FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
bukejiyu	14d46181b8	[Loader] add multi-thread model loading (#6877 ) * multi-thread-loader * fix ut	2026-04-09 23:40:15 -07:00
GoldPancake	c1fb3112f8	[FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281 ) * refactor quant cli param	2026-04-10 14:13:42 +08:00
lizexu123	613f92ee8f	[Feature] support nvfp4 tbo (#7259 )	2026-04-09 17:29:39 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
xiaoxiaohehe001	51efe27d76	[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210 ) * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn	2026-04-09 11:05:10 +08:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00
ShaneGZhu	7005404ce3	[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253 ) * Delete contiguous ops. * fix scale * Delete unnecessary comments * fix style	2026-04-09 11:00:13 +08:00
AIbin	48d2bbeb74	fix dsa (#7252 )	2026-04-08 20:21:38 +08:00
K11OntheBoat	bb48bcbaa2	Split enable_mm (#7183 ) Co-authored-by: liuruian <liuruian@MacBook-Pro.local>	2026-04-08 11:25:41 +08:00
lizhenyun01	446b26bbc0	[Feature] support blackwell gemm in ht (#7053 ) * [Feature] support blackwell gemm in ht * [Feature] support ops for convert * fix cuda error 716 * fix cuda error * opt memory * remove unused code	2026-04-07 19:52:51 +08:00
sunxin	ae2f9f4d22	[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130 ) * rl gate fp32 * clean	2026-04-06 21:07:38 -07:00
周周周	18f012457d	[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201 )	2026-04-07 11:21:57 +08:00
Bingoo	2068656a85	[Optimization] merge matmul and add (#6986 ) * merge matmul and add * modify format * using paddle.nn.functional.linear * using _C_ops.linear * using paddle.nn.functional.linear * add FLAGS_use_legacy_linear env var in test case * fix format * add assert and remove env * modify format * using matmul for no bias * modify accurate baseline	2026-04-03 18:02:03 +08:00
AIbin	1090f8b123	[Models]support GLM4.7 Flash && Ernie_MLA (#7139 ) * support GLM4.7 Flash && Ernie_MLA	2026-04-03 17:41:33 +08:00
lizexu123	5f612a348d	[BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120 ) * fix nvfp4 * fix * add document * fix nvfp4 * support eb5 * support bka * support eb5 * support xpu * fix * fix * add import cutedsl * fix * fix * fix test * fix H卡 * update document * fix * update document * update document * fix	2026-04-03 15:43:19 +08:00
fxyfxy777	9f3b3ce7f5	[Optimization] merge_allreduce (#7039 )	2026-04-02 19:52:13 +08:00
Bingoo	410988d9ec	[OP] support deepgeem for sm103 (#7073 ) * support deepgeem for sm103 * add assert * modify code style * add assert * modify sm version condition * remove assert	2026-04-01 21:01:09 +08:00
cmcamdy	7a2e33098f	[XPU] Refactor pre process (#6993 ) * [XPU] support speculate_pre_process * merge develop * fix codestype * fix mtp, support cu_seqlens_q_output * fix mtp, support cu_seqlens_q_output * fix test --------- Co-authored-by: lizan1999 <lizan03@baidu.com>	2026-04-01 20:29:55 +08:00
yzwu	ceaf5df350	[Iluvatar] Fix cuda graph error for tp > 1 in ernie models (#7126 )	2026-04-01 19:13:34 +08:00
sunxin	c29e86fc9d	[Feature] Support mtp overlap schedule (#7001 )	2026-04-01 14:24:26 +08:00
YilongGuo	dd61e7e421	[Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration (#7086 ) Add clear_grpah_opt_backend method that delegates to the underlying model to clear cuda graph optimization backend. Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-03-31 13:48:25 +08:00
yzwu	8789329457	[Iluvatar] Support wi4a16 group_gemm (#7078 )	2026-03-30 19:03:51 +08:00
zhangbo9674	5c60e2fc6f	fix bug in cudagraph (#7069 )	2026-03-30 14:24:23 +08:00
mpgemm	1a1d048774	[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963 )	2026-03-30 11:37:04 +08:00
Longzhi Wang	2eea6fa97a	[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028 ) * [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend * add constexpr and code style clean * add test * fix code style * fix test	2026-03-30 11:17:15 +08:00
mpgemm	7a20eaebe8	[Feature] Support cute cpp Encoder FA4 (#7016 ) * add cute cpp fa4 * 删掉注释 * 修正合并错误 * sm_version放到函数内 * ci错误	2026-03-30 10:54:56 +08:00
huicongyao	25d64efdc4	[Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812 ) * reformat eagle_get_hidden_states & eagle_get_self_hidden_states * readibility * fix xpu bug * fix coverage failure * change luanch params & parallelize position_map compute * Fix MTP-related bugs in FastDeploy centralized inference * fix * refactor mtp hidden_states process * fix * add unittest & optimize kernel * remove useless code * fix	2026-03-25 22:54:31 -07:00
freeliuzc	7a6c28781b	[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005 ) * optimize attn_mask_offset and optimize mtp usage * delete useless branch * fix kernel format * fix kernel runner	2026-03-25 01:52:06 -07:00
SUN Dong	6cff780fdb	[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850 ) * [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization * update * update * update * support custom topk inDeepGemmFusedMoeMethod apply_tp * apply_ep_prefill support moe_topk_select * update * add ut * add ut * add ut * modity doc * fix env and docs * add ut --------- Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>	2026-03-24 11:12:39 +08:00
Nyakku Shigure	8b6bbb3504	[Optimization] Use a separate driver when using Triton with Paddle (#6897 ) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-24 10:56:00 +08:00
freeliuzc	e87ce4b8cd	[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973 ) * support new mtp * refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process * fix cuda-graph for spec-decoding * fix xpu mtp and fix some note * fix unittest and optmize note * fix model status update in eos-branch	2026-03-24 10:19:01 +08:00
周周周	5416da8c6e	remove assert (#6970 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-23 14:22:03 +08:00
jackyYang6	634d23a38a	[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934 ) * [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow * [Docs] Fix thinking_budget markdown formatting * [Test] Align ernie thinking budget test with process_request_dict	2026-03-23 14:15:55 +08:00
jackyYang6	00eb12f656	[BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555 ) * [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching * [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests	2026-03-20 16:37:58 +08:00
AIbin	bf7e2424d0	[Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930 ) * support DSA_MUTI_BATCH * update test topk * update dsk-dsa	2026-03-20 15:59:22 +08:00
周周周	1c38da2118	Make seq_lens_this_time/decoder/encoder equal shape (#6942 )	2026-03-20 15:31:52 +08:00
sunxin	d77edf8fc9	opt wfp8afp8 triton moe (#6938 )	2026-03-20 11:07:25 +08:00
周周周	b1c800b64b	remove load_up_proj_weight_first (#6932 )	2026-03-19 17:21:34 +08:00
sunxin	33e01f22a8	[Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 (#6894 ) * update sampling * fix * fix * fix mtp * fix test	2026-03-19 01:43:10 -07:00
JYChen	f95d8ca7df	[RL] support qkrmsnorm use proxy-norm (#6862 ) * support qkrmsnorm use paddle.nn.functional.rms_norm * remove flags in fd	2026-03-18 23:27:26 -07:00
周周周	1a05744c4e	nvfp4.py support ep (#6920 )	2026-03-19 14:07:46 +08:00
周周周	c184a7cb69	remove source in weight_loader in moe.py (#6892 )	2026-03-19 13:31:43 +08:00
Nyakku Shigure	dd93f8ffb4	[Optimization] Skip compat guard when torch is not installed (#6913 )	2026-03-19 11:29:27 +08:00
AIbin	4794a28f3d	opt glm5 model (#6916 )	2026-03-19 11:13:33 +08:00
gongweibao	fb6c56dfd5	[BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 (#6748 ) * [BugFix] Force top_k=1 for greedy decoding when temperature=0 When temperature is set to 0 (greedy decoding), only setting temperature to a small epsilon is insufficient — the sampling kernel may still pick non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee argmax behavior. Additionally, add argmax fast-path in top_k_top_p_sampling() under FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that ignore top_k parameter. * Extract greedy decoding from FD_DETERMINISTIC_MODE guard top_k=1 → argmax is a correctness optimization, not deterministic-specific. Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and mixed-batch override work unconditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update test_torch_model.py --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-18 17:36:43 +08:00
AIbin	9b117aafac	support glm-moe-dsa model (#6863 )	2026-03-18 17:21:55 +08:00
yzwu	8b890c0d72	[Iluvatar] refactor attn and moe code (#6887 )	2026-03-18 10:31:00 +08:00
YuBaoku	0359794e08	[CI] Sync _log_softmax_batch_invariant with paddle update (#6893 )	2026-03-17 23:03:57 +08:00
AIbin	cb6819d086	[Optimization][OP]support per_token_group_fp8_quant cuda kernel (#6865 ) * support per_token_group_fp8_quant cuda kernel * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * update code --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-17 19:17:51 +08:00
Longzhi Wang	daaf498213	[Feature] support compute shared experts before combine for better overlap (#6697 ) * [Feature] support compute shared experts before combine for better overlap * fix test * fix xpu * fix	2026-03-17 15:18:51 +08:00

1 2 3 4 5 ...

722 Commits