FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
yzwu	8b890c0d72	[Iluvatar] refactor attn and moe code (#6887 )	2026-03-18 10:31:00 +08:00
Longzhi Wang	daaf498213	[Feature] support compute shared experts before combine for better overlap (#6697 ) * [Feature] support compute shared experts before combine for better overlap * fix test * fix xpu * fix	2026-03-17 15:18:51 +08:00
周周周	ea998dd26f	clean clean code in _load_per_tensor_weight_scale (#6868 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-17 14:06:57 +08:00
RichardWooSJTU	4ed483d20b	[BugFix] Fix ep compatibility issues & Optimize permute operator (#6821 ) * fix ep compatibility issues & optimize permute operator * fix ut * fix ut	2026-03-17 10:32:11 +08:00
fxyfxy777	4d39232553	[BugFix] add ut for fused_moe_degemm (#6840 ) * add ut * add skip	2026-03-16 12:22:18 +08:00
liufengwei0103	62110045f3	[RL] add stream guard (#6814 ) * add stream guard * format	2026-03-13 11:22:26 +08:00
fxyfxy777	250ce40b40	[Feature] use phi permute/unpermute & rm swiglu (#6361 ) * tp文字输出正常 * B eb5 mini文字输出正常 * eb5mini ep B卡文字输出正常 * default use phi moe op * stash * tp H卡正常 * ep ok * rm debug * rm debug tool * rm del ffn_out * rm swiglu * add envs to swiglu * merge dev * fix ci baseline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix ci baseline 2 --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 02:01:57 -07:00
RAM	cdaf6dd400	[RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599 ) * cherry-pick Support Fully Async and PrefixCache step 1 * copy routing_indices_cache.py from 2.4 * cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388) * cherry-pick [RL] Clear Requests status of R3 (#6569) * delete code * fix rename bug * fix status shape bug * fix ci	2026-03-12 01:13:30 -07:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
bukejiyu	598cce8545	[RL] Support SM100 FP8 quantization in RL (#6601 ) * RL SM100 Fix * update	2026-03-04 04:55:04 -08:00
RichardWooSJTU	61789febb9	[Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433 ) * support to load static quant ue8m0 scale of deepgemm via v0_loader * [Fix] Fix ue8m0 scale pack dimension calculation and block size validation 1. Fix pack dimension calculation in fused_moe_triton_backend.py: - Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division - This ensures sufficient pack allocation when num_scales is not a multiple of 4 2. Fix block size hardcoding in block_wise_fp8.py: - Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]` - Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 11:32:35 +08:00
RichardWooSJTU	7bd86f99a5	[BugFix] Fix tbo nan (#6439 )	2026-03-02 14:28:48 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
RichardWooSJTU	7cfb0ffba0	fix pfcc deep ep in low latency mode (#6440 )	2026-03-02 10:35:51 +08:00
Weiguo Zhu	8fb24122b8	fix reshard error (#6536 )	2026-02-27 22:22:37 +08:00
sunxin	53aaac69da	[Optimization] Enable BF16 gate computation for GLM and Qwen (#6457 ) * gate bf16 * add gate-fp32 * fix * update baseline * update * update * fix	2026-02-26 21:08:46 -08:00
AIbin	0eb87467f8	[BugFix]fix RL bug about blockwisefp8 (#6466 ) * fix RL bug about blockwisefp8 * fix moe same bug * fix RL FP8 bug	2026-02-12 09:15:29 +08:00
bukejiyu	dc5917289d	[loader]supoort wint2 backend (#6139 ) * support wint2 * update	2026-02-08 22:42:36 -08:00
Mattheliu	c776d483e4	[BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384 ) * fix: handle 4 return values from noaux_tc_redundant op The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP: - output_tensor (scores) - topk_values - topk_indices - tokens_per_expert_stats_list_out (inplace updated) The Python code was only unpacking 3 values, causing: ValueError: too many values to unpack (expected 3) This fix correctly unpacks all 4 return values, ignoring the inplace updated tensor which is the same as the input tokens_per_expert_stats_list. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> * fix: make noaux_tc_redundant return 4 values to match OP definition The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3, causing inconsistent behavior across different Paddle framework versions. This fix explicitly returns 4 values: - scores (inplace modified) - topk_values - topk_indices - tokens_per_expert_stats_list (inplace modified via atomicAdd) Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> --------- Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>	2026-02-09 13:17:47 +08:00
JYChen	9bcd863902	[Others] support import deepgemm/deepep from fleet ops (#6351 ) * update paddleformers to v1.0 * only change import fleetpath	2026-02-09 11:53:13 +08:00
fxyfxy777	36547cfdb3	[Feature] FD_USE_PHI_FP8_QUANT (#6320 ) * add ut * add use_fd_quant env * rm mask_per_token_quant * add make ops list * USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true * modify comments * use bool type * Add function declaration	2026-02-03 22:33:03 -08:00
RAM	5b22e5dfe7	[RL] R3 Support Fused Put the Routing of All Layers (#6099 ) * fused put routing * fix bug * [draft commit]dynamic dtype * fix async put & numpy bug * fix unit8 test case	2026-02-03 04:13:16 -08:00
JYChen	c745a22420	[Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304 )	2026-02-03 17:47:38 +08:00
周周周	cbdb2462ea	cp 1131 tbo to develop (#6281 )	2026-02-03 15:23:23 +08:00
fxyfxy777	f3413c4caa	[BugFix] fix fused_mask_swiglu_fp8_quant bug (#6316 ) * optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant * Revert "[Optimize] optimize mask_quant & swiglu (#6222)" This reverts commit `2ada119a38`. * add block_size * pre-commit	2026-02-03 13:54:12 +08:00
fxyfxy777	2ada119a38	[Optimize] optimize mask_quant & swiglu (#6222 ) * optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant	2026-02-02 13:52:38 +08:00
JYChen	6c685c9474	Revert "[Feature] Support Ernie FP8 on sm100 (#5593 )" (#6275 ) This reverts commit `eb80724b71`.	2026-01-30 11:22:01 +08:00
yuxuan	44b52701f6	[Feature] Support NVFP4 MoE on SM100 (#6003 ) * fp4 dense * [WIP] support nvfp4, dense part * [wip] developing loading qwen model * loading * update * dense fp4 OK, cudagraph error * [WIP] moe forward part * with flashinfer-backend * qwen3_moe_fp4 * update * support flashinfer-cutlass moe, qwen3-moe-fp4 OK * support ernie4.5-fp4 * fix load error * add some ut * add docs * fix CLA, test * fix the apply() in ModelOptNvFp4FusedMoE * fix CodeStyle * del the PADDLE_COMPATIBLE_API * fix broken url: nvidia_gpu.md * fix docs * fix token_ids * fix CI in Hopper * move flashinfer imports inside the function * fix model_runner Removed the logic for generating random padding IDs. * Remove skip condition for CUDA version in nvfp4 test * add test for nvfp4 * fix according to review * Add Chinese translation link to NVFP4 documentation * del flashinfer.py * fix unittest --------- Co-authored-by: zoooo0820 <zoooo0820@qq.com> Co-authored-by: bukejiyu <395822456@qq.com>	2026-01-29 14:16:07 +08:00
JYChen	eb80724b71	[Feature] Support Ernie FP8 on sm100 (#5593 ) * Deepgemm暂时可用版本 * dense部分 e8m0 ok * EB模型E8M0跑通的版本 * code check * support 21b-tp2, dev_paddle * 单机4.5T ep OK的版本 * 修复删除的代码,单机4.5T ep(非cudagraph) * eb tp * Support SM100 block-wise FP8 inference * refine codes, support deepgemm on sm100 * add thirdparty PFCC/DeepGEMM * fix ep decode * 使用deepep ue8m0, 解决精度问题 * 修复FP8 TP精度 * Deepgemm升级适配Hopper逻辑 * add ue8m0 kernel * add ue8m0 kernel * fix custom_ops/gpu_ops/cpp_extensions.cc * eb 输出正常 * eb5 text is right * 目测精度一致 * 自测精度对齐 * 替换masked_per_token_quant, ep精度OK * 性能提升约30% * 暂时跑通ep但是有问题 * 自测一致 * rm test fun * fix ep event * 图优化算子更新Deepgemm * fix build * 暂时绕过deepgemm CI编译问题 * 根据SM区分deepgemm版本 * remove useless code --------- Co-authored-by: ckl117 <ckl117@163.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: fxyfxy777 <fxyfxy777@163.com>	2026-01-29 13:49:54 +08:00
Yuanle Liu	8b05774fad	[Others] enhance deep_ep import and support mixed mode flash_mask_attn (#6238 ) * support flashmaskattn mixed and enhance deepep import * update * fix	2026-01-28 00:02:02 +08:00
Yuanle Liu	253c5cc16c	Improve deep_ep import handling with logging (#6207 ) * Improve deep_ep import handling with logging Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging. * Add traceback import to ep.py	2026-01-24 22:41:42 -08:00
GoldPancake	646aced1eb	[UT] Add GLM E2E tests for non-MTP and MTP (#6163 ) * add glm ut	2026-01-23 10:34:29 +08:00
Haonan Luo	82057cb71f	Support MXFP4 for GPT-OSS (#5435 ) * support mxfp4 in gpt-oss * support mxfp4 in gpt-oss * add scope for flashinfer * remove torch code * update envs.FD_MXFP4_BACKEND * update process_weights_after_loading * update env name * support tp in gpt-oss, add e2e test * add flashinfer-python-paddle in requirements * fix import error * add test * add test * add test * add test	2026-01-22 14:21:01 +08:00
yzwu	837ddca273	[Iluvartar][CI] Fix the error max_tokens_per_expert referenced before assignment (#6083 )	2026-01-21 16:01:29 +08:00
lizexu123	f4902fe42d	[BugFix] fix wint2 (#6109 ) * fix * fix * fix	2026-01-20 21:46:21 +08:00
fxyfxy777	4c92035f2d	[Feature] Unify fp8 block_wise quant ops (#5991 ) * quant stash * blockwise_quant * precommit * rm tensor.cut * tp ok * add swiglu * rm outdate code * fix activate ut * change baseline * fix baseline error	2026-01-15 05:50:37 -08:00
lizexu123	6619298b50	【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models (#6007 ) * update w4afp8 * build.sh ok * support cuda_graph * fix * add test * fix max_tokens_per_expert * >=70 * fix * compute_max_tokens_from_prefix_sum in w4afp8 * compute_max_tokens use cub	2026-01-15 19:18:42 +08:00
Cheng Yanfei	fbcccaa750	[Intel HPU] enable MoE EP for hpu (#5855 ) * enable HPU MoE EP * MoE intermediate_scale stack * enable loader_v1 esp for tensor_wise_fp8 TP or EP * modify activation_scale name	2026-01-15 13:08:00 +08:00
RAM	b3f59fd9b5	[RL][CI] Support Async R3 And Add Accuracy Test (#5937 ) * add bs1 r3 test case * async put * r3 test case 1.0 * success run eb5 * refine test case * pre-commit * add eb45 & glm testcase * format code * add p2pstore requirements * support only last turn * R3 use worker log * refine code &fix ci bug * refine error mesg * fix empty input bug * Success set acc ci of eb45 and glm45 * refine code * fix bug	2026-01-14 04:25:06 -08:00
xiaoxiaohehe001	00a01ae024	[Feature] Support redundant expert for eplb (#5918 ) * [BugFix] support redundant expert for eplb * support redundant expert for eplb * support redundant expert for eplb * update * fix ci eplb	2026-01-09 17:13:24 +08:00
Ryan	3e74bacc5e	add m_grouped_gemm_fp8_fp8_bf16_nt_contiguous_custom_python_op (#5847 )	2026-01-07 16:17:55 +08:00
lizexu123	1d3ae7c024	[BugFix] fix w4afp8 tp=8 (#5868 ) * fix w4afp8 tp=8 * fix	2026-01-05 18:59:02 +08:00
ming1753	f50e1bcc16	[Others] enable use PFCC deep_ep (#5822 ) * upstream deep_ep * fix bug * fix bug * modify env name	2026-01-05 02:07:01 -08:00
周周周	dc13344ab8	[Optimization] add del to decrease peak memory in MoE prefill (#5863 )	2026-01-05 14:01:48 +08:00
lizexu123	44a13e4557	[Feature] support w4afp8 v1_loader and v0_loader(tp>1) (#5757 ) * support * fix * support w4afp8 v1_loader and v0_loader * fix * fix test * fix test * fix test * fix moe.py * add test_ernie_4_5_w4afp8 * add test * delete tensor * fix test * fix * add * fix test	2025-12-30 14:11:52 +08:00
Ryan	eb782a0225	[BugFix] Fix return value inconsistency for `ep_moe_expert_combine` op (#5812 )	2025-12-29 16:44:00 +08:00
Nyakku Shigure	11227e00bb	[GraphOptimization] Wrap deep gemm and triton as python op (#5673 ) * [GraphOptimization] Wrap deep gemm and triton as python op * add unitest to _base_test && compatibility * paddle.static.MetaTensor -> "paddle.static.MetaTensor" * mv register_custom_python_op * rename yaml --------- Co-authored-by: DrRyanHuang <zihaohuang@aliyun.com>	2025-12-24 15:23:46 +08:00
bukejiyu	d1c6e57341	[Others] upgrade paddleformer to 0.4.0 (#5599 )	2025-12-23 05:08:01 -08:00
Sunny-bot1	04035e4ebf	support w4afp8 two stage (#5608 )	2025-12-22 15:13:05 +08:00
Sunny-bot1	40f3897a4e	support w4afp8 moe offline permute & load (#5613 )	2025-12-22 15:12:57 +08:00

1 2 3 4

196 Commits