FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 17:11:21 +08:00

Author	SHA1	Message	Date
AIbin	6ce4854714	[Feature] Support MOE Cutlass backend for latent MOE (#7428 ) * support moe cutlass backend latent moe	2026-04-16 22:11:49 +08:00
周周周	5e54770b2e	[Feature] 添加 MoE 层 latent mode 支持 (#7382 )	2026-04-15 13:57:07 +08:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
liuruyan	b34708604c	[TI-consistent] support quant use pow2scale (#7308 ) * support quant use pow2scale * fix * fix	2026-04-13 00:01:53 -07:00
Nyako Shigure	d659099415	[Cleanup] Replace torch proxy alias with public compat API (#7348 )	2026-04-13 11:43:26 +08:00
周周周	225fc8d222	use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340 )	2026-04-11 22:39:43 +08:00
chen	4982aa000e	[RL]moe bf16 ep support paddle batch_gemm (#7337 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:12 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00
lizhenyun01	446b26bbc0	[Feature] support blackwell gemm in ht (#7053 ) * [Feature] support blackwell gemm in ht * [Feature] support ops for convert * fix cuda error 716 * fix cuda error * opt memory * remove unused code	2026-04-07 19:52:51 +08:00
lizexu123	5f612a348d	[BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120 ) * fix nvfp4 * fix * add document * fix nvfp4 * support eb5 * support bka * support eb5 * support xpu * fix * fix * add import cutedsl * fix * fix * fix test * fix H卡 * update document * fix * update document * update document * fix	2026-04-03 15:43:19 +08:00
zhangbo9674	5c60e2fc6f	fix bug in cudagraph (#7069 )	2026-03-30 14:24:23 +08:00
mpgemm	1a1d048774	[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963 )	2026-03-30 11:37:04 +08:00
SUN Dong	6cff780fdb	[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850 ) * [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization * update * update * update * support custom topk inDeepGemmFusedMoeMethod apply_tp * apply_ep_prefill support moe_topk_select * update * add ut * add ut * add ut * modity doc * fix env and docs * add ut --------- Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>	2026-03-24 11:12:39 +08:00
Nyakku Shigure	8b6bbb3504	[Optimization] Use a separate driver when using Triton with Paddle (#6897 ) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-24 10:56:00 +08:00
sunxin	d77edf8fc9	opt wfp8afp8 triton moe (#6938 )	2026-03-20 11:07:25 +08:00
周周周	b1c800b64b	remove load_up_proj_weight_first (#6932 )	2026-03-19 17:21:34 +08:00
周周周	c184a7cb69	remove source in weight_loader in moe.py (#6892 )	2026-03-19 13:31:43 +08:00
yzwu	8b890c0d72	[Iluvatar] refactor attn and moe code (#6887 )	2026-03-18 10:31:00 +08:00
Longzhi Wang	daaf498213	[Feature] support compute shared experts before combine for better overlap (#6697 ) * [Feature] support compute shared experts before combine for better overlap * fix test * fix xpu * fix	2026-03-17 15:18:51 +08:00
周周周	ea998dd26f	clean clean code in _load_per_tensor_weight_scale (#6868 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-17 14:06:57 +08:00
RichardWooSJTU	4ed483d20b	[BugFix] Fix ep compatibility issues & Optimize permute operator (#6821 ) * fix ep compatibility issues & optimize permute operator * fix ut * fix ut	2026-03-17 10:32:11 +08:00
fxyfxy777	4d39232553	[BugFix] add ut for fused_moe_degemm (#6840 ) * add ut * add skip	2026-03-16 12:22:18 +08:00
liufengwei0103	62110045f3	[RL] add stream guard (#6814 ) * add stream guard * format	2026-03-13 11:22:26 +08:00
fxyfxy777	250ce40b40	[Feature] use phi permute/unpermute & rm swiglu (#6361 ) * tp文字输出正常 * B eb5 mini文字输出正常 * eb5mini ep B卡文字输出正常 * default use phi moe op * stash * tp H卡正常 * ep ok * rm debug * rm debug tool * rm del ffn_out * rm swiglu * add envs to swiglu * merge dev * fix ci baseline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix ci baseline 2 --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 02:01:57 -07:00
RAM	cdaf6dd400	[RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599 ) * cherry-pick Support Fully Async and PrefixCache step 1 * copy routing_indices_cache.py from 2.4 * cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388) * cherry-pick [RL] Clear Requests status of R3 (#6569) * delete code * fix rename bug * fix status shape bug * fix ci	2026-03-12 01:13:30 -07:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
bukejiyu	598cce8545	[RL] Support SM100 FP8 quantization in RL (#6601 ) * RL SM100 Fix * update	2026-03-04 04:55:04 -08:00
RichardWooSJTU	61789febb9	[Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433 ) * support to load static quant ue8m0 scale of deepgemm via v0_loader * [Fix] Fix ue8m0 scale pack dimension calculation and block size validation 1. Fix pack dimension calculation in fused_moe_triton_backend.py: - Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division - This ensures sufficient pack allocation when num_scales is not a multiple of 4 2. Fix block size hardcoding in block_wise_fp8.py: - Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]` - Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 11:32:35 +08:00
RichardWooSJTU	7bd86f99a5	[BugFix] Fix tbo nan (#6439 )	2026-03-02 14:28:48 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
RichardWooSJTU	7cfb0ffba0	fix pfcc deep ep in low latency mode (#6440 )	2026-03-02 10:35:51 +08:00
Weiguo Zhu	8fb24122b8	fix reshard error (#6536 )	2026-02-27 22:22:37 +08:00
sunxin	53aaac69da	[Optimization] Enable BF16 gate computation for GLM and Qwen (#6457 ) * gate bf16 * add gate-fp32 * fix * update baseline * update * update * fix	2026-02-26 21:08:46 -08:00
AIbin	0eb87467f8	[BugFix]fix RL bug about blockwisefp8 (#6466 ) * fix RL bug about blockwisefp8 * fix moe same bug * fix RL FP8 bug	2026-02-12 09:15:29 +08:00
bukejiyu	dc5917289d	[loader]supoort wint2 backend (#6139 ) * support wint2 * update	2026-02-08 22:42:36 -08:00
Mattheliu	c776d483e4	[BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384 ) * fix: handle 4 return values from noaux_tc_redundant op The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP: - output_tensor (scores) - topk_values - topk_indices - tokens_per_expert_stats_list_out (inplace updated) The Python code was only unpacking 3 values, causing: ValueError: too many values to unpack (expected 3) This fix correctly unpacks all 4 return values, ignoring the inplace updated tensor which is the same as the input tokens_per_expert_stats_list. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> * fix: make noaux_tc_redundant return 4 values to match OP definition The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3, causing inconsistent behavior across different Paddle framework versions. This fix explicitly returns 4 values: - scores (inplace modified) - topk_values - topk_indices - tokens_per_expert_stats_list (inplace modified via atomicAdd) Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> --------- Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>	2026-02-09 13:17:47 +08:00
JYChen	9bcd863902	[Others] support import deepgemm/deepep from fleet ops (#6351 ) * update paddleformers to v1.0 * only change import fleetpath	2026-02-09 11:53:13 +08:00
fxyfxy777	36547cfdb3	[Feature] FD_USE_PHI_FP8_QUANT (#6320 ) * add ut * add use_fd_quant env * rm mask_per_token_quant * add make ops list * USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true * modify comments * use bool type * Add function declaration	2026-02-03 22:33:03 -08:00
RAM	5b22e5dfe7	[RL] R3 Support Fused Put the Routing of All Layers (#6099 ) * fused put routing * fix bug * [draft commit]dynamic dtype * fix async put & numpy bug * fix unit8 test case	2026-02-03 04:13:16 -08:00
JYChen	c745a22420	[Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304 )	2026-02-03 17:47:38 +08:00
周周周	cbdb2462ea	cp 1131 tbo to develop (#6281 )	2026-02-03 15:23:23 +08:00
fxyfxy777	f3413c4caa	[BugFix] fix fused_mask_swiglu_fp8_quant bug (#6316 ) * optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (#6222)" This reverts commit `2ada119a38`. * add block_size * pre-commit	2026-02-03 13:54:12 +08:00
fxyfxy777	2ada119a38	[Optimize] optimize mask_quant & swiglu (#6222 ) * optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant	2026-02-02 13:52:38 +08:00
JYChen	6c685c9474	Revert "[Feature] Support Ernie FP8 on sm100 (#5593 )" (#6275 ) This reverts commit `eb80724b71`.	2026-01-30 11:22:01 +08:00
yuxuan	44b52701f6	[Feature] Support NVFP4 MoE on SM100 (#6003 ) * fp4 dense * [WIP] support nvfp4, dense part * [wip] developing loading qwen model * loading * update * dense fp4 OK, cudagraph error * [WIP] moe forward part * with flashinfer-backend * qwen3_moe_fp4 * update * support flashinfer-cutlass moe, qwen3-moe-fp4 OK * support ernie4.5-fp4 * fix load error * add some ut * add docs * fix CLA, test * fix the apply() in ModelOptNvFp4FusedMoE * fix CodeStyle * del the PADDLE_COMPATIBLE_API * fix broken url: nvidia_gpu.md * fix docs * fix token_ids * fix CI in Hopper * move flashinfer imports inside the function * fix model_runner Removed the logic for generating random padding IDs. * Remove skip condition for CUDA version in nvfp4 test * add test for nvfp4 * fix according to review * Add Chinese translation link to NVFP4 documentation * del flashinfer.py * fix unittest --------- Co-authored-by: zoooo0820 <zoooo0820@qq.com> Co-authored-by: bukejiyu <395822456@qq.com>	2026-01-29 14:16:07 +08:00
JYChen	eb80724b71	[Feature] Support Ernie FP8 on sm100 (#5593 ) * Deepgemm暂时可用版本 * dense部分 e8m0 ok * EB模型E8M0跑通的版本 * code check * support 21b-tp2, dev_paddle * 单机4.5T ep OK的版本 * 修复删除的代码,单机4.5T ep(非cudagraph) * eb tp * Support SM100 block-wise FP8 inference * refine codes, support deepgemm on sm100 * add thirdparty PFCC/DeepGEMM * fix ep decode * 使用deepep ue8m0, 解决精度问题 * 修复FP8 TP精度 * Deepgemm升级适配Hopper逻辑 * add ue8m0 kernel * add ue8m0 kernel * fix custom_ops/gpu_ops/cpp_extensions.cc * eb 输出正常 * eb5 text is right * 目测精度一致 * 自测精度对齐 * 替换masked_per_token_quant, ep精度OK * 性能提升约30% * 暂时跑通ep但是有问题 * 自测一致 * rm test fun * fix ep event * 图优化算子更新Deepgemm * fix build * 暂时绕过deepgemm CI编译问题 * 根据SM区分deepgemm版本 * remove useless code --------- Co-authored-by: ckl117 <ckl117@163.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: fxyfxy777 <fxyfxy777@163.com>	2026-01-29 13:49:54 +08:00
Yuanle Liu	8b05774fad	[Others] enhance deep_ep import and support mixed mode flash_mask_attn (#6238 ) * support flashmaskattn mixed and enhance deepep import * update * fix	2026-01-28 00:02:02 +08:00
Yuanle Liu	253c5cc16c	Improve deep_ep import handling with logging (#6207 ) * Improve deep_ep import handling with logging Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging. * Add traceback import to ep.py	2026-01-24 22:41:42 -08:00
GoldPancake	646aced1eb	[UT] Add GLM E2E tests for non-MTP and MTP (#6163 ) * add glm ut	2026-01-23 10:34:29 +08:00

1 2 3 4 5

214 Commits