bukejiyu
14d46181b8
[Loader] add multi-thread model loading ( #6877 )
...
* multi-thread-loader
* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake
c1fb3112f8
[FDConfig] Support CLI args for quantization params and add cudagraph validation ( #7281 )
...
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123
613f92ee8f
[Feature] support nvfp4 tbo ( #7259 )
2026-04-09 17:29:39 +08:00
fxyfxy777
39ff38aba1
[OP]Unify MoE op with moe_permute path for bf16 GLM ( #7164 )
2026-04-09 16:17:56 +08:00
xiaoxiaohehe001
51efe27d76
[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn ( #7210 )
...
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen
43ace7af25
[RL] support moe-topk use topk_reduce_func ( #7218 )
...
* support moe-topk use topk_reduce_func
* fix ep error
* fix ut
* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu
7005404ce3
[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend ( #7253 )
...
* Delete contiguous ops.
* fix scale
* Delete unnecessary comments
* fix style
2026-04-09 11:00:13 +08:00
AIbin
48d2bbeb74
fix dsa ( #7252 )
2026-04-08 20:21:38 +08:00
K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
lizhenyun01
446b26bbc0
[Feature] support blackwell gemm in ht ( #7053 )
...
* [Feature] support blackwell gemm in ht
* [Feature] support ops for convert
* fix cuda error 716
* fix cuda error
* opt memory
* remove unused code
2026-04-07 19:52:51 +08:00
sunxin
ae2f9f4d22
[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL ( #7130 )
...
* rl gate fp32
* clean
2026-04-06 21:07:38 -07:00
周周周
18f012457d
[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel ( #7201 )
2026-04-07 11:21:57 +08:00
Bingoo
2068656a85
[Optimization] merge matmul and add ( #6986 )
...
* merge matmul and add
* modify format
* using paddle.nn.functional.linear
* using _C_ops.linear
* using paddle.nn.functional.linear
* add FLAGS_use_legacy_linear env var in test case
* fix format
* add assert and remove env
* modify format
* using matmul for no bias
* modify accurate baseline
2026-04-03 18:02:03 +08:00
AIbin
1090f8b123
[Models]support GLM4.7 Flash && Ernie_MLA ( #7139 )
...
* support GLM4.7 Flash && Ernie_MLA
2026-04-03 17:41:33 +08:00
lizexu123
5f612a348d
[BugFix] fix flashinfer-cutedsl moe nvfp4 ( #7120 )
...
* fix nvfp4
* fix
* add document
* fix nvfp4
* support eb5
* support bka
* support eb5
* support xpu
* fix
* fix
* add import cutedsl
* fix
* fix
* fix test
* fix H卡
* update document
* fix
* update document
* update document
* fix
2026-04-03 15:43:19 +08:00
fxyfxy777
9f3b3ce7f5
[Optimization] merge_allreduce ( #7039 )
2026-04-02 19:52:13 +08:00
Bingoo
410988d9ec
[OP] support deepgeem for sm103 ( #7073 )
...
* support deepgeem for sm103
* add assert
* modify code style
* add assert
* modify sm version condition
* remove assert
2026-04-01 21:01:09 +08:00
cmcamdy
7a2e33098f
[XPU] Refactor pre process ( #6993 )
...
* [XPU] support speculate_pre_process
* merge develop
* fix codestype
* fix mtp, support cu_seqlens_q_output
* fix mtp, support cu_seqlens_q_output
* fix test
---------
Co-authored-by: lizan1999 <lizan03@baidu.com >
2026-04-01 20:29:55 +08:00
yzwu
ceaf5df350
[Iluvatar] Fix cuda graph error for tp > 1 in ernie models ( #7126 )
2026-04-01 19:13:34 +08:00
sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
YilongGuo
dd61e7e421
[Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration ( #7086 )
...
Add clear_grpah_opt_backend method that delegates to the underlying model
to clear cuda graph optimization backend.
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com >
2026-03-31 13:48:25 +08:00
yzwu
8789329457
[Iluvatar] Support wi4a16 group_gemm ( #7078 )
2026-03-30 19:03:51 +08:00
zhangbo9674
5c60e2fc6f
fix bug in cudagraph ( #7069 )
2026-03-30 14:24:23 +08:00
mpgemm
1a1d048774
[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 ( #6963 )
2026-03-30 11:37:04 +08:00
Longzhi Wang
2eea6fa97a
[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend ( #7028 )
...
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend
* add constexpr and code style clean
* add test
* fix code style
* fix test
2026-03-30 11:17:15 +08:00
mpgemm
7a20eaebe8
[Feature] Support cute cpp Encoder FA4 ( #7016 )
...
* add cute cpp fa4
* 删掉注释
* 修正合并错误
* sm_version放到函数内
* ci错误
2026-03-30 10:54:56 +08:00
huicongyao
25d64efdc4
[Speculative Decoding] Refactor Eagle MTP hidden states copy ( #6812 )
...
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states
* readibility
* fix xpu bug
* fix coverage failure
* change luanch params & parallelize position_map compute
* Fix MTP-related bugs in FastDeploy centralized inference
* fix
* refactor mtp hidden_states process
* fix
* add unittest & optimize kernel
* remove useless code
* fix
2026-03-25 22:54:31 -07:00
freeliuzc
7a6c28781b
[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug ( #7005 )
...
* optimize attn_mask_offset and optimize mtp usage
* delete useless branch
* fix kernel format
* fix kernel runner
2026-03-25 01:52:06 -07:00
SUN Dong
6cff780fdb
[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment ( #6850 )
...
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization
* update
* update
* update
* support custom topk inDeepGemmFusedMoeMethod apply_tp
* apply_ep_prefill support moe_topk_select
* update
* add ut
* add ut
* add ut
* modity doc
* fix env and docs
* add ut
---------
Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com >
2026-03-24 11:12:39 +08:00
Nyakku Shigure
8b6bbb3504
[Optimization] Use a separate driver when using Triton with Paddle ( #6897 )
...
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-24 10:56:00 +08:00
freeliuzc
e87ce4b8cd
[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess ( #6973 )
...
* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
周周周
5416da8c6e
remove assert ( #6970 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-23 14:22:03 +08:00
jackyYang6
634d23a38a
[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow ( #6934 )
...
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow
* [Docs] Fix thinking_budget markdown formatting
* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
jackyYang6
00eb12f656
[BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path ( #6555 )
...
* [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching
* [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests
2026-03-20 16:37:58 +08:00
AIbin
bf7e2424d0
[Optimization][Feature]Supports multiple batches of DSK-DSA. ( #6930 )
...
* support DSA_MUTI_BATCH
* update test topk
* update dsk-dsa
2026-03-20 15:59:22 +08:00
周周周
1c38da2118
Make seq_lens_this_time/decoder/encoder equal shape ( #6942 )
2026-03-20 15:31:52 +08:00
sunxin
d77edf8fc9
opt wfp8afp8 triton moe ( #6938 )
2026-03-20 11:07:25 +08:00
周周周
b1c800b64b
remove load_up_proj_weight_first ( #6932 )
2026-03-19 17:21:34 +08:00
sunxin
33e01f22a8
[Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 ( #6894 )
...
* update sampling
* fix
* fix
* fix mtp
* fix test
2026-03-19 01:43:10 -07:00
JYChen
f95d8ca7df
[RL] support qkrmsnorm use proxy-norm ( #6862 )
...
* support qkrmsnorm use paddle.nn.functional.rms_norm
* remove flags in fd
2026-03-18 23:27:26 -07:00
周周周
1a05744c4e
nvfp4.py support ep ( #6920 )
2026-03-19 14:07:46 +08:00
周周周
c184a7cb69
remove source in weight_loader in moe.py ( #6892 )
2026-03-19 13:31:43 +08:00
Nyakku Shigure
dd93f8ffb4
[Optimization] Skip compat guard when torch is not installed ( #6913 )
2026-03-19 11:29:27 +08:00
AIbin
4794a28f3d
opt glm5 model ( #6916 )
2026-03-19 11:13:33 +08:00
gongweibao
fb6c56dfd5
[BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 ( #6748 )
...
* [BugFix] Force top_k=1 for greedy decoding when temperature=0
When temperature is set to 0 (greedy decoding), only setting temperature
to a small epsilon is insufficient — the sampling kernel may still pick
non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee
argmax behavior.
Additionally, add argmax fast-path in top_k_top_p_sampling() under
FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that
ignore top_k parameter.
* Extract greedy decoding from FD_DETERMINISTIC_MODE guard
top_k=1 → argmax is a correctness optimization, not deterministic-specific.
Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and
mixed-batch override work unconditionally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* Update test_torch_model.py
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-18 17:36:43 +08:00
AIbin
9b117aafac
support glm-moe-dsa model ( #6863 )
2026-03-18 17:21:55 +08:00
yzwu
8b890c0d72
[Iluvatar] refactor attn and moe code ( #6887 )
2026-03-18 10:31:00 +08:00
YuBaoku
0359794e08
[CI] Sync _log_softmax_batch_invariant with paddle update ( #6893 )
2026-03-17 23:03:57 +08:00
AIbin
cb6819d086
[Optimization][OP]support per_token_group_fp8_quant cuda kernel ( #6865 )
...
* support per_token_group_fp8_quant cuda kernel
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com >
* update code
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com >
2026-03-17 19:17:51 +08:00
Longzhi Wang
daaf498213
[Feature] support compute shared experts before combine for better overlap ( #6697 )
...
* [Feature] support compute shared experts before combine for better overlap
* fix test
* fix xpu
* fix
2026-03-17 15:18:51 +08:00