FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
yzwu	8b890c0d72	[Iluvatar] refactor attn and moe code (#6887 )	2026-03-18 10:31:00 +08:00
AIbin	cb6819d086	[Optimization][OP]support per_token_group_fp8_quant cuda kernel (#6865 ) * support per_token_group_fp8_quant cuda kernel * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * update code --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-17 19:17:51 +08:00
Longzhi Wang	daaf498213	[Feature] support compute shared experts before combine for better overlap (#6697 ) * [Feature] support compute shared experts before combine for better overlap * fix test * fix xpu * fix	2026-03-17 15:18:51 +08:00
gongweibao	a6351dea0b	[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533 ) * init * init * fix format * add * add files * add ut * fix some * add ut * add more * add * fix pre-commit * fix pre-commit * fix cover * skip long seq * add * add * fix * remove not need * fix set attr * fix comments * fix comments * fix failed tests --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-16 21:32:43 +08:00
Longzhi Wang	5c92f4d0cd	[Feature] Add deepgemm bias epilogue for SM100 (#6857 ) * [Feature] Add deepgemm bias epilogue for SM100 * fix	2026-03-16 20:12:00 +08:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
AIbin	c3aceb6bdc	[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689 ) * Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM	2026-03-10 15:05:14 +08:00
周周周	3897a0b4fc	nvfp4 clean code (#6671 )	2026-03-09 18:00:34 +08:00
周周周	cebe6f7dae	clean nvfp4 related code (#6644 )	2026-03-05 15:48:33 +08:00
bukejiyu	598cce8545	[RL] Support SM100 FP8 quantization in RL (#6601 ) * RL SM100 Fix * update	2026-03-04 04:55:04 -08:00
RichardWooSJTU	61789febb9	[Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433 ) * support to load static quant ue8m0 scale of deepgemm via v0_loader * [Fix] Fix ue8m0 scale pack dimension calculation and block size validation 1. Fix pack dimension calculation in fused_moe_triton_backend.py: - Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division - This ensures sufficient pack allocation when num_scales is not a multiple of 4 2. Fix block size hardcoding in block_wise_fp8.py: - Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]` - Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 11:32:35 +08:00
chen	1cae7a0d53	weight only quant method support QKVGate_proj (#6612 )	2026-03-03 11:19:32 +08:00
Weiguo Zhu	8fb24122b8	fix reshard error (#6536 )	2026-02-27 22:22:37 +08:00
JYChen	c6d8fbe526	[BugFix] fix log with paddlefleet.ops (#6528 )	2026-02-27 14:34:29 +08:00
Longzhi Wang	22566168c3	[Feature] support qkv&gate linear fusion (#6455 ) * [Feature] support qkv&gate linear fusion * add test	2026-02-24 15:20:29 +08:00
AIbin	0eb87467f8	[BugFix]fix RL bug about blockwisefp8 (#6466 ) * fix RL bug about blockwisefp8 * fix moe same bug * fix RL FP8 bug	2026-02-12 09:15:29 +08:00
JYChen	40c952e7b5	fix deepgemm import (#6451 ) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-02-11 20:10:01 +08:00
JYChen	9bcd863902	[Others] support import deepgemm/deepep from fleet ops (#6351 ) * update paddleformers to v1.0 * only change import fleetpath	2026-02-09 11:53:13 +08:00
fxyfxy777	36547cfdb3	[Feature] FD_USE_PHI_FP8_QUANT (#6320 ) * add ut * add use_fd_quant env * rm mask_per_token_quant * add make ops list * USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true * modify comments * use bool type * Add function declaration	2026-02-03 22:33:03 -08:00
JYChen	c745a22420	[Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304 )	2026-02-03 17:47:38 +08:00
JYChen	6c685c9474	Revert "[Feature] Support Ernie FP8 on sm100 (#5593 )" (#6275 ) This reverts commit `eb80724b71`.	2026-01-30 11:22:01 +08:00
yuxuan	44b52701f6	[Feature] Support NVFP4 MoE on SM100 (#6003 ) * fp4 dense * [WIP] support nvfp4, dense part * [wip] developing loading qwen model * loading * update * dense fp4 OK, cudagraph error * [WIP] moe forward part * with flashinfer-backend * qwen3_moe_fp4 * update * support flashinfer-cutlass moe, qwen3-moe-fp4 OK * support ernie4.5-fp4 * fix load error * add some ut * add docs * fix CLA, test * fix the apply() in ModelOptNvFp4FusedMoE * fix CodeStyle * del the PADDLE_COMPATIBLE_API * fix broken url: nvidia_gpu.md * fix docs * fix token_ids * fix CI in Hopper * move flashinfer imports inside the function * fix model_runner Removed the logic for generating random padding IDs. * Remove skip condition for CUDA version in nvfp4 test * add test for nvfp4 * fix according to review * Add Chinese translation link to NVFP4 documentation * del flashinfer.py * fix unittest --------- Co-authored-by: zoooo0820 <zoooo0820@qq.com> Co-authored-by: bukejiyu <395822456@qq.com>	2026-01-29 14:16:07 +08:00
JYChen	eb80724b71	[Feature] Support Ernie FP8 on sm100 (#5593 ) * Deepgemm暂时可用版本 * dense部分 e8m0 ok * EB模型E8M0跑通的版本 * code check * support 21b-tp2, dev_paddle * 单机4.5T ep OK的版本 * 修复删除的代码,单机4.5T ep(非cudagraph) * eb tp * Support SM100 block-wise FP8 inference * refine codes, support deepgemm on sm100 * add thirdparty PFCC/DeepGEMM * fix ep decode * 使用deepep ue8m0, 解决精度问题 * 修复FP8 TP精度 * Deepgemm升级适配Hopper逻辑 * add ue8m0 kernel * add ue8m0 kernel * fix custom_ops/gpu_ops/cpp_extensions.cc * eb 输出正常 * eb5 text is right * 目测精度一致 * 自测精度对齐 * 替换masked_per_token_quant, ep精度OK * 性能提升约30% * 暂时跑通ep但是有问题 * 自测一致 * rm test fun * fix ep event * 图优化算子更新Deepgemm * fix build * 暂时绕过deepgemm CI编译问题 * 根据SM区分deepgemm版本 * remove useless code --------- Co-authored-by: ckl117 <ckl117@163.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: fxyfxy777 <fxyfxy777@163.com>	2026-01-29 13:49:54 +08:00
CSWYF3634076	08c411518f	[Loader] support dummy load weight (#6169 ) * [Loader] support dummy load weight * [Loader] support dummy load weight v2 * [Loader] support dummy load weight unittest * [Loader] support dummy load weight unittest v2 * [Loader] support dummy load weight v3 docs and fp8	2026-01-26 13:58:53 +08:00
Haonan Luo	82057cb71f	Support MXFP4 for GPT-OSS (#5435 ) * support mxfp4 in gpt-oss * support mxfp4 in gpt-oss * add scope for flashinfer * remove torch code * update envs.FD_MXFP4_BACKEND * update process_weights_after_loading * update env name * support tp in gpt-oss, add e2e test * add flashinfer-python-paddle in requirements * fix import error * add test * add test * add test * add test	2026-01-22 14:21:01 +08:00
fxyfxy777	4c92035f2d	[Feature] Unify fp8 block_wise quant ops (#5991 ) * quant stash * blockwise_quant * precommit * rm tensor.cut * tp ok * add swiglu * rm outdate code * fix activate ut * change baseline * fix baseline error	2026-01-15 05:50:37 -08:00
lizexu123	acdf0cd1d9	fix hadamard_block_size (#5888 )	2026-01-06 14:12:14 +08:00
lizexu123	44a13e4557	[Feature] support w4afp8 v1_loader and v0_loader(tp>1) (#5757 ) * support * fix * support w4afp8 v1_loader and v0_loader * fix * fix test * fix test * fix test * fix moe.py * add test_ernie_4_5_w4afp8 * add test * delete tensor * fix test * fix * add * fix test	2025-12-30 14:11:52 +08:00
Nyakku Shigure	11227e00bb	[GraphOptimization] Wrap deep gemm and triton as python op (#5673 ) * [GraphOptimization] Wrap deep gemm and triton as python op * add unitest to _base_test && compatibility * paddle.static.MetaTensor -> "paddle.static.MetaTensor" * mv register_custom_python_op * rename yaml --------- Co-authored-by: DrRyanHuang <zihaohuang@aliyun.com>	2025-12-24 15:23:46 +08:00
Sunny-bot1	40f3897a4e	support w4afp8 moe offline permute & load (#5613 )	2025-12-22 15:12:57 +08:00
Longzhi Wang	d8587e987e	[Model] tp+ep support v1_loader (#5465 ) * [Model] tp+ep support v1_loader * fix * fix mtp_linear * fix mtp_linear * fix * fix * fix v0 loader * fix * Add get_tensor for ep * fix linear weight_loader * fix typo * fix	2025-12-18 14:31:54 +08:00
zhupengyang	8735cb5045	[XPU] refactor moe ffn (#5501 ) - remove BKCL_DISPATCH_ALL_GATHER - support sparse mode - support moe quant_method	2025-12-18 14:14:05 +08:00
fmiao2372	404cf0ece4	[Intel HPU] enable tensor_wise_fp8 (#5324 ) * [Intel HPU] enable tensor_wise_fp8 * update code based on comments * fix code style issue * fix bug about RP 5138 * mv kv_cache modifications to HPU backend * fix FP8 Precision Issues * fix FP8 Precision Issues * Add quantization UT --------- Co-authored-by: yanfeich <yanfei.cheng@intel.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2025-12-17 16:45:03 +08:00
Yuanle Liu	b8e4828373	[BugFix] fix dynamic c8 in v1 loader (#5562 )	2025-12-15 04:07:54 -08:00
Haonan Luo	e397c4fba6	[Others] remove add_bias option (#5425 )	2025-12-09 17:39:35 +08:00
Yuanle Liu	be0c960260	[BugFix] dynamic cache kv block_wise_fp8 not need create layer.cache_k_scale (#5362 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FD Image Build (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details	2025-12-03 05:32:59 -08:00
Sunny-bot1	3629db4129	[Quantization] Support w4afp8 MoE dynamic quantization (#5282 ) * support dynamic activation quant for w4afp8 * support dynamic w4afp8 * add test * fix * fix --------- Co-authored-by: zhoutianzi666 <17801055074@163.com>	2025-12-02 18:56:16 +08:00
xiaoxiaohehe001	e150a418d4	support moe offline quant (#5142 )	2025-11-24 18:59:18 +08:00
Sunny-bot1	43f0c7557e	[Feature] Add an unquantized option for MoE and Dense quant type (#4813 )	2025-11-19 16:24:03 +08:00
yangjianfengo1	ae7bee8122	【New Feature】W4afp8 supports per group quantization (#4987 ) * w4afp8 支持per group * code style * fix transpose * revert fast hardmard --------- Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com> Co-authored-by: plusNew001 <95567040+plusNew001@users.noreply.github.com>	2025-11-13 19:17:27 +08:00
MingkunZhang	9d9f5df8d0	[Metax] support default_v1 loader & thinking model (#4956 ) Co-authored-by: plusNew001 <95567040+plusNew001@users.noreply.github.com>	2025-11-12 16:32:26 +08:00
bukejiyu	b09ebb2813	refactor pt loading (#4532 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FD Image Build (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details	2025-11-11 21:30:39 +08:00
Yuanle Liu	3dc0ffa46d	[TSP] Support qwen3 moe tsp + cudagraph (#4871 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * support qwen3_moe tsp mode * fix * fix * update * update * update * fix * support external_rmsnorm * update * fix	2025-11-10 23:37:51 +08:00
Sunny-bot1	59d2edde29	[BugFix] Add support for weight shape constraints and group size selection in Machete (#4911 )	2025-11-10 20:57:35 +08:00
YuBaoku	819b2dbbae	Revert "【New Feature】W4afp8 supports per group quantization (#4272 )" (#4854 ) This reverts commit `93fcf7e4ec`.	2025-11-06 17:48:28 +08:00
yangjianfengo1	93fcf7e4ec	【New Feature】W4afp8 supports per group quantization (#4272 ) * w4afp8 支持per group * code style * 精度完成 * revert append attn utils * ffn1 动态量化 * ffn2 支持动态量化 * code style * code style * 修改单测 * 修改单测 * fix bug * Implement conditional parameter creation for layers Add parameter creation for up_gate_proj_in_scale when ep_size > 1. * code style * fix conflict * code style * code style * 修复w4aint8 精度 * fix ci --------- Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com>	2025-11-05 21:00:23 +08:00
AIbin	316f784016	fix wint2 config (#4721 )	2025-10-31 15:44:14 +08:00
Sunny-bot1	4ffe41a747	WINT4/WINT8 dense gemm default use Machete (#4451 )	2025-10-23 17:57:59 +08:00
chen	db82e9a022	[BugFix]Fix wfp8afp8 triton moe group_topk renormalized=True (#4449 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * fix group_topk renormalized=True * check test	2025-10-16 23:17:48 +08:00
zhupengyang	26ff2f8683	[XPU] refine fused moe (#4219 )	2025-10-16 19:04:07 +08:00

1 2

96 Commits