Commit Graph

722 Commits

Author SHA1 Message Date
bukejiyu 14d46181b8 [Loader] add multi-thread model loading (#6877)
* multi-thread-loader

* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake c1fb3112f8 [FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281)
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123 613f92ee8f [Feature] support nvfp4 tbo (#7259) 2026-04-09 17:29:39 +08:00
fxyfxy777 39ff38aba1 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) 2026-04-09 16:17:56 +08:00
xiaoxiaohehe001 51efe27d76 [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210)
* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen 43ace7af25 [RL] support moe-topk use topk_reduce_func (#7218)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu 7005404ce3 [DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253)
* Delete contiguous ops.

* fix scale

* Delete unnecessary comments

* fix style
2026-04-09 11:00:13 +08:00
AIbin 48d2bbeb74 fix dsa (#7252) 2026-04-08 20:21:38 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
lizhenyun01 446b26bbc0 [Feature] support blackwell gemm in ht (#7053)
* [Feature] support blackwell gemm in ht

* [Feature] support ops for convert

* fix cuda error 716

* fix cuda error

* opt memory

* remove unused code
2026-04-07 19:52:51 +08:00
sunxin ae2f9f4d22 [BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130)
* rl gate fp32

* clean
2026-04-06 21:07:38 -07:00
周周周 18f012457d [OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201) 2026-04-07 11:21:57 +08:00
Bingoo 2068656a85 [Optimization] merge matmul and add (#6986)
* merge matmul and add

* modify format

* using paddle.nn.functional.linear

* using _C_ops.linear

* using paddle.nn.functional.linear

* add FLAGS_use_legacy_linear env var in test case

* fix format

* add assert and remove env

* modify format

* using matmul for no bias

* modify accurate baseline
2026-04-03 18:02:03 +08:00
AIbin 1090f8b123 [Models]support GLM4.7 Flash && Ernie_MLA (#7139)
* support GLM4.7 Flash && Ernie_MLA
2026-04-03 17:41:33 +08:00
lizexu123 5f612a348d [BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)
* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
2026-04-03 15:43:19 +08:00
fxyfxy777 9f3b3ce7f5 [Optimization] merge_allreduce (#7039) 2026-04-02 19:52:13 +08:00
Bingoo 410988d9ec [OP] support deepgeem for sm103 (#7073)
* support deepgeem for sm103

* add assert

* modify code style

* add assert

* modify sm version condition

* remove assert
2026-04-01 21:01:09 +08:00
cmcamdy 7a2e33098f [XPU] Refactor pre process (#6993)
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
2026-04-01 20:29:55 +08:00
yzwu ceaf5df350 [Iluvatar] Fix cuda graph error for tp > 1 in ernie models (#7126) 2026-04-01 19:13:34 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
YilongGuo dd61e7e421 [Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration (#7086)
Add clear_grpah_opt_backend method that delegates to the underlying model
to clear cuda graph optimization backend.

Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-03-31 13:48:25 +08:00
yzwu 8789329457 [Iluvatar] Support wi4a16 group_gemm (#7078) 2026-03-30 19:03:51 +08:00
zhangbo9674 5c60e2fc6f fix bug in cudagraph (#7069) 2026-03-30 14:24:23 +08:00
mpgemm 1a1d048774 [Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963) 2026-03-30 11:37:04 +08:00
Longzhi Wang 2eea6fa97a [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028)
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend

* add constexpr and code style clean

* add test

* fix code style

* fix test
2026-03-30 11:17:15 +08:00
mpgemm 7a20eaebe8 [Feature] Support cute cpp Encoder FA4 (#7016)
* add cute cpp fa4

* 删掉注释

* 修正合并错误

* sm_version放到函数内

* ci错误
2026-03-30 10:54:56 +08:00
huicongyao 25d64efdc4 [Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812)
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states

* readibility

* fix xpu bug

* fix coverage failure

* change luanch params & parallelize position_map compute

* Fix MTP-related bugs in FastDeploy centralized inference

* fix

* refactor mtp hidden_states process

* fix

* add unittest & optimize kernel

* remove useless code

* fix
2026-03-25 22:54:31 -07:00
freeliuzc 7a6c28781b [Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005)
* optimize attn_mask_offset and optimize mtp usage

* delete useless branch

* fix kernel format

* fix kernel runner
2026-03-25 01:52:06 -07:00
SUN Dong 6cff780fdb [RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850)
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization

* update

* update

* update

* support custom topk inDeepGemmFusedMoeMethod  apply_tp

* apply_ep_prefill support moe_topk_select

* update

* add ut

* add ut

* add ut

* modity doc

* fix env and docs

* add ut

---------

Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>
2026-03-24 11:12:39 +08:00
Nyakku Shigure 8b6bbb3504 [Optimization] Use a separate driver when using Triton with Paddle (#6897)
---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-24 10:56:00 +08:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
周周周 5416da8c6e remove assert (#6970)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-23 14:22:03 +08:00
jackyYang6 634d23a38a [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934)
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

* [Docs] Fix thinking_budget markdown formatting

* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
jackyYang6 00eb12f656 [BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555)
* [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching

* [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests
2026-03-20 16:37:58 +08:00
AIbin bf7e2424d0 [Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930)
* support DSA_MUTI_BATCH

* update test topk

* update dsk-dsa
2026-03-20 15:59:22 +08:00
周周周 1c38da2118 Make seq_lens_this_time/decoder/encoder equal shape (#6942) 2026-03-20 15:31:52 +08:00
sunxin d77edf8fc9 opt wfp8afp8 triton moe (#6938) 2026-03-20 11:07:25 +08:00
周周周 b1c800b64b remove load_up_proj_weight_first (#6932) 2026-03-19 17:21:34 +08:00
sunxin 33e01f22a8 [Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 (#6894)
* update sampling

* fix

* fix

* fix mtp

* fix test
2026-03-19 01:43:10 -07:00
JYChen f95d8ca7df [RL] support qkrmsnorm use proxy-norm (#6862)
* support qkrmsnorm use paddle.nn.functional.rms_norm

* remove flags in fd
2026-03-18 23:27:26 -07:00
周周周 1a05744c4e nvfp4.py support ep (#6920) 2026-03-19 14:07:46 +08:00
周周周 c184a7cb69 remove source in weight_loader in moe.py (#6892) 2026-03-19 13:31:43 +08:00
Nyakku Shigure dd93f8ffb4 [Optimization] Skip compat guard when torch is not installed (#6913) 2026-03-19 11:29:27 +08:00
AIbin 4794a28f3d opt glm5 model (#6916) 2026-03-19 11:13:33 +08:00
gongweibao fb6c56dfd5 [BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 (#6748)
* [BugFix] Force top_k=1 for greedy decoding when temperature=0

When temperature is set to 0 (greedy decoding), only setting temperature
to a small epsilon is insufficient — the sampling kernel may still pick
non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee
argmax behavior.

Additionally, add argmax fast-path in top_k_top_p_sampling() under
FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that
ignore top_k parameter.

* Extract greedy decoding from FD_DETERMINISTIC_MODE guard

top_k=1 → argmax is a correctness optimization, not deterministic-specific.
Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and
mixed-batch override work unconditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update test_torch_model.py

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-18 17:36:43 +08:00
AIbin 9b117aafac support glm-moe-dsa model (#6863) 2026-03-18 17:21:55 +08:00
yzwu 8b890c0d72 [Iluvatar] refactor attn and moe code (#6887) 2026-03-18 10:31:00 +08:00
YuBaoku 0359794e08 [CI] Sync _log_softmax_batch_invariant with paddle update (#6893) 2026-03-17 23:03:57 +08:00
AIbin cb6819d086 [Optimization][OP]support per_token_group_fp8_quant cuda kernel (#6865)
* support per_token_group_fp8_quant cuda kernel

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* update code

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-17 19:17:51 +08:00
Longzhi Wang daaf498213 [Feature] support compute shared experts before combine for better overlap (#6697)
* [Feature] support compute shared experts before combine for better overlap

* fix test

* fix xpu

* fix
2026-03-17 15:18:51 +08:00