FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
ShaneGZhu	2d8338f9e4	[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. (#7367 )	2026-04-16 19:54:12 +08:00
RichardWooSJTU	420a8c1af5	fix deep gemm import (#7425 )	2026-04-16 17:56:56 +08:00
Bingoo	6b891da02b	[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660 ) * enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues	2026-04-16 14:10:19 +08:00
GoldPancake	a498720a75	[RL] Add clear_graph_opt_backend for glm4_mtp (#7378 ) * add clear_grpah func * fix spell	2026-04-15 19:44:15 +08:00
AIbin	8995a38fa4	fix dsa indexer norm to layernorm (#7398 )	2026-04-15 11:42:45 +08:00
AIbin	bb30f88f1a	[Models] support MLA gate attention (#7404 ) * support mla gate attn * support mla gate attn	2026-04-15 11:42:34 +08:00
zhupengyang	27b00cf385	[XPU] glm-4.5-air (#7071 )	2026-04-14 11:31:49 +08:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
zhangbo9674	627f0d9cc8	[RL] change rms norm for glm (#7269 ) * change rms norm for glm * refine code * refine code * refine code	2026-04-10 01:02:37 -07:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00
AIbin	48d2bbeb74	fix dsa (#7252 )	2026-04-08 20:21:38 +08:00
sunxin	ae2f9f4d22	[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130 ) * rl gate fp32 * clean	2026-04-06 21:07:38 -07:00
AIbin	1090f8b123	[Models]support GLM4.7 Flash && Ernie_MLA (#7139 ) * support GLM4.7 Flash && Ernie_MLA	2026-04-03 17:41:33 +08:00
fxyfxy777	9f3b3ce7f5	[Optimization] merge_allreduce (#7039 )	2026-04-02 19:52:13 +08:00
YilongGuo	dd61e7e421	[Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration (#7086 ) Add clear_grpah_opt_backend method that delegates to the underlying model to clear cuda graph optimization backend. Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-03-31 13:48:25 +08:00
Nyakku Shigure	8b6bbb3504	[Optimization] Use a separate driver when using Triton with Paddle (#6897 ) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-24 10:56:00 +08:00
jackyYang6	00eb12f656	[BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555 ) * [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching * [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests	2026-03-20 16:37:58 +08:00
AIbin	bf7e2424d0	[Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930 ) * support DSA_MUTI_BATCH * update test topk * update dsk-dsa	2026-03-20 15:59:22 +08:00
AIbin	4794a28f3d	opt glm5 model (#6916 )	2026-03-19 11:13:33 +08:00
AIbin	9b117aafac	support glm-moe-dsa model (#6863 )	2026-03-18 17:21:55 +08:00
gongweibao	a6351dea0b	[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533 ) * init * init * fix format * add * add files * add ut * fix some * add ut * add more * add * fix pre-commit * fix pre-commit * fix cover * skip long seq * add * add * fix * remove not need * fix set attr * fix comments * fix comments * fix failed tests --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-16 21:32:43 +08:00
AIbin	c9f7f5234e	[Optimization][BugFix]Optimize Deepseek networking code (#6861 ) * update dsk model * update dsk model	2026-03-16 16:52:43 +08:00
ming1753	bb925c605f	[Other] Adjust GPUModelRunner to enhance compatibility (#6851 )	2026-03-16 14:49:19 +08:00
fxyfxy777	8eb177147c	[BugFix]rm draft code for glm (#6810 ) * rm draft code for glm * fix baseline * fix baseline 2	2026-03-12 23:26:05 -07:00
AIbin	2b8a5b0d81	update indexer model (#6791 )	2026-03-13 14:11:39 +08:00
fxyfxy777	250ce40b40	[Feature] use phi permute/unpermute & rm swiglu (#6361 ) * tp文字输出正常 * B eb5 mini文字输出正常 * eb5mini ep B卡文字输出正常 * default use phi moe op * stash * tp H卡正常 * ep ok * rm debug * rm debug tool * rm del ffn_out * rm swiglu * add envs to swiglu * merge dev * fix ci baseline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix ci baseline 2 --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 02:01:57 -07:00
AIbin	1118351b27	[Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest (#6762 ) * add deepseek model doc * update deepseek model doc * update deepseek model doc * update deepseek model doc * cwb suppor DSK_V32 Model * update DSK_V32_DSA modeling * Ibin Support DSK_DSA * update kernel * update yaml * update requirements * update pre_commit * update model-runner * fix CI bug * del start.sh * fix iluvatar_model_runner * update DSA & add unitest * update import deep_gemm	2026-03-11 15:52:54 +08:00
bukejiyu	cffa8c246c	[Others]update paddleformer 1.0.0 (#6496 ) * update paddleformer 1.0.0 * update	2026-03-11 15:06:29 +08:00
AIbin	c3aceb6bdc	[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689 ) * Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM	2026-03-10 15:05:14 +08:00
周周周	cebe6f7dae	clean nvfp4 related code (#6644 )	2026-03-05 15:48:33 +08:00
周周周	3cc09418f1	support dsv3 use flashmla (#6593 )	2026-03-03 11:09:43 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
周周周	1503443871	add dsv3 mixed deploy as EP16 TP8 (#6525 )	2026-02-27 14:08:25 +08:00
sunxin	53aaac69da	[Optimization] Enable BF16 gate computation for GLM and Qwen (#6457 ) * gate bf16 * add gate-fp32 * fix * update baseline * update * update * fix	2026-02-26 21:08:46 -08:00
jackyYang6	38c3e02470	fix paddleformers fallback (#6465 )	2026-02-23 15:29:13 +08:00
bukejiyu	dc5917289d	[loader]supoort wint2 backend (#6139 ) * support wint2 * update	2026-02-08 22:42:36 -08:00
chen	72fe94cb13	[Feature] support glm tp+dp+ep (#6317 )	2026-02-05 21:47:01 +08:00
GoldPancake	183b8d325a	[RL] Support GLM MTP RL Model (#6267 )	2026-02-04 20:14:35 +08:00
GoldPancake	fb374238e1	Revert "[RL] Support GLM MTP RL Model (#6223 )" (#6301 ) This reverts commit `af6c84d48d`.	2026-02-02 14:08:13 +08:00
GoldPancake	af6c84d48d	[RL] Support GLM MTP RL Model (#6223 ) * support glm mtp rl model * fix * fix * fix ut * update baseline	2026-01-28 08:28:03 -08:00
ddchenhao66	6d33d5e370	[Models][BugFix] shared experts and dense mlp layer do not require TP split (#6180 ) Co-authored-by: ddchenhao66 <dhaochen163.com>	2026-01-28 18:58:19 +08:00
Haonan Luo	82057cb71f	Support MXFP4 for GPT-OSS (#5435 ) * support mxfp4 in gpt-oss * support mxfp4 in gpt-oss * add scope for flashinfer * remove torch code * update envs.FD_MXFP4_BACKEND * update process_weights_after_loading * update env name * support tp in gpt-oss, add e2e test * add flashinfer-python-paddle in requirements * fix import error * add test * add test * add test * add test	2026-01-22 14:21:01 +08:00
jackyYang6	988e0bc338	[Feature] Add PaddleFormers fallback backend (#5999 ) * feat(paddleformers): add dense text model fallback backend * docs(paddleformers): add user guide and fix code review issues * add fallback unit test * precommit format * fix pre-commit * fix: address code review feedback * docs: add PaddleFormers backend documentation (EN) and simplify installation --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-19 21:50:50 +08:00
GoldPancake	879e45f6b3	fix compute logits problem (#6093 )	2026-01-19 20:12:14 +08:00
sunxin	9dc1c74d36	fix opt qknorm (#6080 )	2026-01-19 12:07:20 +08:00
GoldPancake	bda38aa519	[Speculative Decoding] Support MTP for GLM-4.5-Air (#6047 ) * glm mtp * add spec neox partial rope	2026-01-16 14:35:24 +08:00
Cheng Yanfei	fbcccaa750	[Intel HPU] enable MoE EP for hpu (#5855 ) * enable HPU MoE EP * MoE intermediate_scale stack * enable loader_v1 esp for tensor_wise_fp8 TP or EP * modify activation_scale name	2026-01-15 13:08:00 +08:00
xiaoxiaohehe001	6f72be7c3e	[Optimize] Qwen2.5-VL vision model with merged linear layers and unif… (#6037 ) * [Optimize] Qwen2.5-VL vision model with merged linear layers and unified normalization * [Optimize] Qwen2.5-VL vision model with merged linear layers and unified normalization	2026-01-14 19:21:31 +08:00

1 2 3 4 5

219 Commits