bukejiyu
5bfc0938e2
[BugFix] PD reorder fix and add ut ( #6375 )
2026-02-09 04:42:48 -08:00
sunxin
783d56e28a
[Optimization] Support logprob async copy ( #6362 )
...
* support logprob async copy
* fix prompt logprob
* fix xpu
2026-02-09 17:32:12 +08:00
bukejiyu
dc5917289d
[loader]supoort wint2 backend ( #6139 )
...
* support wint2
* update
2026-02-08 22:42:36 -08:00
Mattheliu
c776d483e4
[BugFix]fix handle 4 return values from noaux_tc_redundant op ( #6384 )
...
* fix: handle 4 return values from noaux_tc_redundant op
The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)
The Python code was only unpacking 3 values, causing:
ValueError: too many values to unpack (expected 3)
This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
* fix: make noaux_tc_redundant return 4 values to match OP definition
The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.
This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
---------
Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com >
2026-02-09 13:17:47 +08:00
JYChen
9bcd863902
[Others] support import deepgemm/deepep from fleet ops ( #6351 )
...
* update paddleformers to v1.0
* only change import fleetpath
2026-02-09 11:53:13 +08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
K11OntheBoat
116e2aea7a
Support Norm before Rope ( #6332 )
...
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2026-02-05 15:28:52 +08:00
chen
29a313a402
[Optimization] Support FA2/FA3/FA4 with attn_mask_q ( #6354 )
...
* support FA4 sm100
* flash attn backend support mask
* flash attn backend run flashmask correct
* add test for flash_attn_backend and flash_attn_func
* check
* add test for fa4
* requirements.txt add fa4 whl
* check test on sm100
* fix CI conflict
* add enable_torch_proxy for flash_mask
* lazy import fa4
* check
* fix tests import
* check test_load_mpt import
2026-02-05 14:39:00 +08:00
GoldPancake
183b8d325a
[RL] Support GLM MTP RL Model ( #6267 )
2026-02-04 20:14:35 +08:00
fxyfxy777
36547cfdb3
[Feature] FD_USE_PHI_FP8_QUANT ( #6320 )
...
* add ut
* add use_fd_quant env
* rm mask_per_token_quant
* add make ops list
* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true
* modify comments
* use bool type
* Add function declaration
2026-02-03 22:33:03 -08:00
RAM
5b22e5dfe7
[RL] R3 Support Fused Put the Routing of All Layers ( #6099 )
...
* fused put routing
* fix bug
* [draft commit]dynamic dtype
* fix async put & numpy bug
* fix unit8 test case
2026-02-03 04:13:16 -08:00
JYChen
c745a22420
[Feature] Support Ernie FP8 on sm100 ( the fixed version) ( #6304 )
2026-02-03 17:47:38 +08:00
bukejiyu
12d4b4cb87
[Feature]Support reorder ids to split prefill and decodes ( #5779 )
...
* support reorder ids
* perfect code
* fix
* fix unittest
* delete code
* fix
* add python api
* delete custom op
* update algorithm
* fix swap
* support condense
* support condense
* support mtp
* delete code
* update
* update
* update
* update
* update for other platfrom
* update
* fix
* fix mtp
* fix ut
* update
* fix ut
* update ut
* fix
* fix encoder_cache
* fix ci
* fix
* fix vl
* Fix performance regression
* fix
* fix
* fix mtp
* fix index->req_id mapping
* fix ut
---------
Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-02-03 00:28:02 -08:00
周周周
cbdb2462ea
cp 1131 tbo to develop ( #6281 )
2026-02-03 15:23:23 +08:00
fxyfxy777
f3413c4caa
[BugFix] fix fused_mask_swiglu_fp8_quant bug ( #6316 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
* Revert "[Optimize] optimize mask_quant & swiglu (#6222 )"
This reverts commit 2ada119a38 .
* add block_size
* pre-commit
2026-02-03 13:54:12 +08:00
fxyfxy777
2ada119a38
[Optimize] optimize mask_quant & swiglu ( #6222 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
xiaozude
030647521a
[Metax] adapt to the latest develop ( #6282 )
2026-01-29 23:21:20 -08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
MingkunZhang
c4abb01f9c
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) ( #6266 )
2026-01-29 19:24:36 +08:00
yuxuan
44b52701f6
[Feature] Support NVFP4 MoE on SM100 ( #6003 )
...
* fp4 dense
* [WIP] support nvfp4, dense part
* [wip] developing loading qwen model
* loading
* update
* dense fp4 OK, cudagraph error
* [WIP] moe forward part
* with flashinfer-backend
* qwen3_moe_fp4
* update
* support flashinfer-cutlass moe, qwen3-moe-fp4 OK
* support ernie4.5-fp4
* fix load error
* add some ut
* add docs
* fix CLA, test
* fix the apply() in ModelOptNvFp4FusedMoE
* fix CodeStyle
* del the PADDLE_COMPATIBLE_API
* fix broken url: nvidia_gpu.md
* fix docs
* fix token_ids
* fix CI in Hopper
* move flashinfer imports inside the function
* fix model_runner
Removed the logic for generating random padding IDs.
* Remove skip condition for CUDA version in nvfp4 test
* add test for nvfp4
* fix according to review
* Add Chinese translation link to NVFP4 documentation
* del flashinfer.py
* fix unittest
---------
Co-authored-by: zoooo0820 <zoooo0820@qq.com >
Co-authored-by: bukejiyu <395822456@qq.com >
2026-01-29 14:16:07 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00
ddchenhao66
6d33d5e370
[Models][BugFix] shared experts and dense mlp layer do not require TP split ( #6180 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-28 18:58:19 +08:00
GoldPancake
7d6c87c29e
[Others] Support constrained decoding when enable_thinking is false ( #6248 )
...
* support constrained decoding when enable_thinking is false
* fix
* fix
* fix
2026-01-28 00:05:17 -08:00
freeliuzc
ce06c6dfb3
[BugFix] Fix token_penalty kernel ( #6069 )
...
* fix token_penalty kernel
* try to fix xpu
* fix xpu
* fix unit test
2026-01-28 12:03:05 +08:00
Yuanle Liu
8b05774fad
[Others] enhance deep_ep import and support mixed mode flash_mask_attn ( #6238 )
...
* support flashmaskattn mixed and enhance deepep import
* update
* fix
2026-01-28 00:02:02 +08:00
周周周
aa57864c5b
remove unneeded para from flash_mask_attention ( #6218 )
2026-01-27 14:04:27 +08:00
xiaoxiaohehe001
7ffa88bb01
[BugFix] fix mask_attn ( #6214 )
...
* [BugFix] fix mask attn
* [BugFix] fix mask attn
2026-01-26 07:46:51 -08:00
CSWYF3634076
08c411518f
[Loader] support dummy load weight ( #6169 )
...
* [Loader] support dummy load weight
* [Loader] support dummy load weight v2
* [Loader] support dummy load weight unittest
* [Loader] support dummy load weight unittest v2
* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin
adc69c15d0
[Model Runner] Prepare token count and move FA3 initialization into the graph ( #6170 )
...
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
Yuanle Liu
253c5cc16c
Improve deep_ep import handling with logging ( #6207 )
...
* Improve deep_ep import handling with logging
Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging.
* Add traceback import to ep.py
2026-01-24 22:41:42 -08:00
fxyfxy777
79f42209bf
add scale_wrapper for per_block_cast_to_fp8 ( #6183 )
2026-01-23 00:37:20 -08:00
GoldPancake
646aced1eb
[UT] Add GLM E2E tests for non-MTP and MTP ( #6163 )
...
* add glm ut
2026-01-23 10:34:29 +08:00
Haonan Luo
82057cb71f
Support MXFP4 for GPT-OSS ( #5435 )
...
* support mxfp4 in gpt-oss
* support mxfp4 in gpt-oss
* add scope for flashinfer
* remove torch code
* update envs.FD_MXFP4_BACKEND
* update process_weights_after_loading
* update env name
* support tp in gpt-oss, add e2e test
* add flashinfer-python-paddle in requirements
* fix import error
* add test
* add test
* add test
* add test
2026-01-22 14:21:01 +08:00
fxyfxy777
9c4db0ac3f
[BugFix] fix weight quant op ( #6137 )
...
* fix weight quant
* fix weight quant
* bit equal
* code style
2026-01-22 09:50:57 +08:00
zccjjj
14a64e9b3b
[XPU] change XPU EP interface from xDeepEP to paddle ( #5706 )
...
* add ENV VAR to controll low lantency buffer
2026-01-21 18:23:45 +08:00
K11OntheBoat
490a6551dc
rename params of normalization layer ( #6133 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-01-21 17:18:35 +08:00
yzwu
837ddca273
[Iluvartar][CI] Fix the error max_tokens_per_expert referenced before assignment ( #6083 )
2026-01-21 16:01:29 +08:00
lizexu123
f4902fe42d
[BugFix] fix wint2 ( #6109 )
...
* fix
* fix
* fix
2026-01-20 21:46:21 +08:00
yinwei
51a8a2ed57
[XPU] Support CudaGraph(add block attn cuda_graph support) ( #6116 )
...
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
Ryan
dda27e50f5
[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph ( #6081 )
...
* rm static_op_get_block_shape_and_split_kv_block from cudagraph
* update max_capture_shape
* fallback: zeros -> empty to avoid coverage check
* check graph_opt_config exists
* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test
* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
zhupengyang
45ebb2efb4
[XPU] support plugin model ( #6092 )
2026-01-20 13:00:09 +08:00
jackyYang6
988e0bc338
[Feature] Add PaddleFormers fallback backend ( #5999 )
...
* feat(paddleformers): add dense text model fallback backend
* docs(paddleformers): add user guide and fix code review issues
* add fallback unit test
* precommit format
* fix pre-commit
* fix: address code review feedback
* docs: add PaddleFormers backend documentation (EN) and simplify installation
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-19 21:50:50 +08:00
sunxin
9dc1c74d36
fix opt qknorm ( #6080 )
2026-01-19 12:07:20 +08:00
fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
lizexu123
6619298b50
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models ( #6007 )
...
* update w4afp8
* build.sh ok
* support cuda_graph
* fix
* add test
* fix max_tokens_per_expert
* >=70
* fix
* compute_max_tokens_from_prefix_sum in w4afp8
* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Cheng Yanfei
fbcccaa750
[Intel HPU] enable MoE EP for hpu ( #5855 )
...
* enable HPU MoE EP
* MoE intermediate_scale stack
* enable loader_v1 esp for tensor_wise_fp8 TP or EP
* modify activation_scale name
2026-01-15 13:08:00 +08:00
zhupengyang
24ffa7c991
[XPU] fix moe num_expert ( #6014 )
2026-01-15 10:49:36 +08:00
RAM
b3f59fd9b5
[RL][CI] Support Async R3 And Add Accuracy Test ( #5937 )
...
* add bs1 r3 test case
* async put
* r3 test case 1.0
* success run eb5
* refine test case
* pre-commit
* add eb45 & glm testcase
* format code
* add p2pstore requirements
* support only last turn
* R3 use worker log
* refine code &fix ci bug
* refine error mesg
* fix empty input bug
* Success set acc ci of eb45 and glm45
* refine code
* fix bug
2026-01-14 04:25:06 -08:00
Ryan
0d1a5e70bc
[Graph Optimization] Add full_cuda_graph to control subgraph split ( #6027 )
2026-01-14 11:43:59 +08:00