YuBaoku
54f7d9f621
[CI] Sync mm_batch_invariant with paddle.mm update ( #6557 )
2026-02-28 14:56:42 +08:00
Weiguo Zhu
8fb24122b8
fix reshard error ( #6536 )
2026-02-27 22:22:37 +08:00
cmcamdy
13447279aa
[XPU] Fix PD + MTP ( #6495 )
...
* fix pd + mtp
* fix code style
* fix PD + MTP, D get P's first token
* add anno for gpu(speculate_update)
* update draft insertv1
* fix wapper & kernel
* fix wapper
* fix code stype
2026-02-27 19:07:35 +08:00
JYChen
c6d8fbe526
[BugFix] fix log with paddlefleet.ops ( #6528 )
2026-02-27 14:34:29 +08:00
周周周
1503443871
add dsv3 mixed deploy as EP16 TP8 ( #6525 )
2026-02-27 14:08:25 +08:00
sunxin
53aaac69da
[Optimization] Enable BF16 gate computation for GLM and Qwen ( #6457 )
...
* gate bf16
* add gate-fp32
* fix
* update baseline
* update
* update
* fix
2026-02-26 21:08:46 -08:00
gongweibao
edd31e8849
[Feature] Add Deterministic Inference Support ( #6476 )
...
* add
* [tests] Add Paddle attention determinism tests and refactor resource manager
Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* add
* add
* add
* add
* add more
* add more
* fixsome
* fixsome
* fix bugs
* fix bugs
* only in gpu
* add docs
* fix comments
* fix some
* fix some
* fix comments
* add more
* fix potential problem
* remove not need
* remove not need
* remove no need
* fix bug
* fix bugs
* fix comments
* fix comments
* Update tests/ce/deterministic/test_determinism_verification.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/inter_communicator/test_ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/engine/test_sampling_params_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism_standalone.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix comments
* fix import error
* fix a bug
* fix bugs
* fix bugs
* fix coverage
* refine codes
* refine code
* fix comments
* fix comments
* fix comments
* rm not need
* fix allreduce large tensor bug
* mv log files
* mv log files
* add files
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-26 19:31:51 -08:00
zccjjj
c34cb2a8c2
[XPU] [bugfix] fix moe_ffn_quant_type_map bugs about datatype and tensorshape ( #6337 )
2026-02-27 09:55:41 +08:00
chen
2d1531f3cb
dev opensource model support fa4/flashmasV2/V3 ( #6518 )
2026-02-26 17:46:05 +08:00
GoldPancake
2178f2829b
[Speculative Decoding] Support suffix decoding ( #6403 )
...
* support suffix decoding
2026-02-26 11:42:05 +08:00
Yuanle Liu
6d3fede240
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 ( #6493 )
...
* Initial plan
* Migrate PRs #6311 , #6129 , #6305 to develop and merge unit tests
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix
* update
* fix
* fix ci
* fix ci
* Initial plan
* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add disable-thinking case to test_chat_with_response_max_tokens
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add both reasoning_max_tokens and response_max_tokens case
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix ci
* fix ci
* fix ci
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2026-02-25 21:36:50 +08:00
zhupengyang
a303eacf62
[XPU] support norm before rope ( #6475 )
2026-02-25 18:43:44 +08:00
jackyYang6
a29ee57e15
[Feature] Support ThinkingBudget Logits processor to control thinking content length ( #6367 )
...
* feat: add thinking budget logits processor
* add unittest
* fix pre-commit
* add unittest
* docs: clarify operator-level vs logits processor usage and conflict guidance
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-02-25 14:17:09 +08:00
Longzhi Wang
22566168c3
[Feature] support qkv&gate linear fusion ( #6455 )
...
* [Feature] support qkv&gate linear fusion
* add test
2026-02-24 15:20:29 +08:00
jackyYang6
38c3e02470
fix paddleformers fallback ( #6465 )
2026-02-23 15:29:13 +08:00
AIbin
0eb87467f8
[BugFix]fix RL bug about blockwisefp8 ( #6466 )
...
* fix RL bug about blockwisefp8
* fix moe same bug
* fix RL FP8 bug
2026-02-12 09:15:29 +08:00
JYChen
40c952e7b5
fix deepgemm import ( #6451 )
...
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-02-11 20:10:01 +08:00
zhupengyang
4a8c54926b
[XPU] topk_method=noaux_tc ( #6355 )
2026-02-11 16:12:20 +08:00
yzwu
60e75ea8e8
[Iluvatar][CI] Fix cannot import get_stop ( #6165 )
2026-02-10 16:57:23 +08:00
chen
d937d6ebfd
check ( #6424 )
2026-02-10 15:55:17 +08:00
yuxuan
83b4b082ab
[BugFix] Fix model loading error for 300B FP8 EP parallel test case ( #6382 )
...
* fix fp8 bug
* fix
* fix comment, cn to en
* fix ci
* del else in utils
* fix review
2026-02-10 11:32:57 +08:00
chen
a8ffcaa068
fix fa4 test ( #6408 )
2026-02-10 10:57:21 +08:00
bukejiyu
5bfc0938e2
[BugFix] PD reorder fix and add ut ( #6375 )
2026-02-09 04:42:48 -08:00
Mattheliu
d75b1b8df1
[Fix] Use paddle.device.get_device_properties for multi-platform compatibility ( #6400 )
...
Replace paddle.device.cuda.get_device_properties() with paddle.device.get_device_properties()
to support all hardware platforms (NVIDIA, ILUVATAR, HPU, etc.)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-09 19:15:41 +08:00
sunxin
783d56e28a
[Optimization] Support logprob async copy ( #6362 )
...
* support logprob async copy
* fix prompt logprob
* fix xpu
2026-02-09 17:32:12 +08:00
MingkunZhang
268276e287
[Metax][CI] e2e ci tests enable cuda graph ( #6401 )
2026-02-09 16:25:23 +08:00
bukejiyu
dc5917289d
[loader]supoort wint2 backend ( #6139 )
...
* support wint2
* update
2026-02-08 22:42:36 -08:00
Mattheliu
c776d483e4
[BugFix]fix handle 4 return values from noaux_tc_redundant op ( #6384 )
...
* fix: handle 4 return values from noaux_tc_redundant op
The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)
The Python code was only unpacking 3 values, causing:
ValueError: too many values to unpack (expected 3)
This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
* fix: make noaux_tc_redundant return 4 values to match OP definition
The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.
This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
---------
Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com >
2026-02-09 13:17:47 +08:00
JYChen
9bcd863902
[Others] support import deepgemm/deepep from fleet ops ( #6351 )
...
* update paddleformers to v1.0
* only change import fleetpath
2026-02-09 11:53:13 +08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
chen
72fe94cb13
[Feature] support glm tp+dp+ep ( #6317 )
2026-02-05 21:47:01 +08:00
K11OntheBoat
116e2aea7a
Support Norm before Rope ( #6332 )
...
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2026-02-05 15:28:52 +08:00
chen
29a313a402
[Optimization] Support FA2/FA3/FA4 with attn_mask_q ( #6354 )
...
* support FA4 sm100
* flash attn backend support mask
* flash attn backend run flashmask correct
* add test for flash_attn_backend and flash_attn_func
* check
* add test for fa4
* requirements.txt add fa4 whl
* check test on sm100
* fix CI conflict
* add enable_torch_proxy for flash_mask
* lazy import fa4
* check
* fix tests import
* check test_load_mpt import
2026-02-05 14:39:00 +08:00
GoldPancake
183b8d325a
[RL] Support GLM MTP RL Model ( #6267 )
2026-02-04 20:14:35 +08:00
fxyfxy777
36547cfdb3
[Feature] FD_USE_PHI_FP8_QUANT ( #6320 )
...
* add ut
* add use_fd_quant env
* rm mask_per_token_quant
* add make ops list
* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true
* modify comments
* use bool type
* Add function declaration
2026-02-03 22:33:03 -08:00
sunxin
9b0a82cfa9
[Model Runner] Support overlap schedule ( #6259 )
2026-02-04 10:49:44 +08:00
RAM
5b22e5dfe7
[RL] R3 Support Fused Put the Routing of All Layers ( #6099 )
...
* fused put routing
* fix bug
* [draft commit]dynamic dtype
* fix async put & numpy bug
* fix unit8 test case
2026-02-03 04:13:16 -08:00
JYChen
c745a22420
[Feature] Support Ernie FP8 on sm100 ( the fixed version) ( #6304 )
2026-02-03 17:47:38 +08:00
bukejiyu
12d4b4cb87
[Feature]Support reorder ids to split prefill and decodes ( #5779 )
...
* support reorder ids
* perfect code
* fix
* fix unittest
* delete code
* fix
* add python api
* delete custom op
* update algorithm
* fix swap
* support condense
* support condense
* support mtp
* delete code
* update
* update
* update
* update
* update for other platfrom
* update
* fix
* fix mtp
* fix ut
* update
* fix ut
* update ut
* fix
* fix encoder_cache
* fix ci
* fix
* fix vl
* Fix performance regression
* fix
* fix
* fix mtp
* fix index->req_id mapping
* fix ut
---------
Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-02-03 00:28:02 -08:00
周周周
cbdb2462ea
cp 1131 tbo to develop ( #6281 )
2026-02-03 15:23:23 +08:00
周周周
8277b95fa6
remove speculate_get_padding_offset op ( #6308 )
2026-02-03 15:18:12 +08:00
fxyfxy777
f3413c4caa
[BugFix] fix fused_mask_swiglu_fp8_quant bug ( #6316 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
* Revert "[Optimize] optimize mask_quant & swiglu (#6222 )"
This reverts commit 2ada119a38 .
* add block_size
* pre-commit
2026-02-03 13:54:12 +08:00
GoldPancake
fb374238e1
Revert "[RL] Support GLM MTP RL Model ( #6223 )" ( #6301 )
...
This reverts commit af6c84d48d .
2026-02-02 14:08:13 +08:00
fxyfxy777
2ada119a38
[Optimize] optimize mask_quant & swiglu ( #6222 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
xiaozude
030647521a
[Metax] adapt to the latest develop ( #6282 )
2026-01-29 23:21:20 -08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
MingkunZhang
c4abb01f9c
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) ( #6266 )
2026-01-29 19:24:36 +08:00
Ryan
5e78c1ac87
[Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode ( #6196 )
...
* refine comment && refine variable name
* replace comment
2026-01-29 16:29:54 +08:00
yuxuan
44b52701f6
[Feature] Support NVFP4 MoE on SM100 ( #6003 )
...
* fp4 dense
* [WIP] support nvfp4, dense part
* [wip] developing loading qwen model
* loading
* update
* dense fp4 OK, cudagraph error
* [WIP] moe forward part
* with flashinfer-backend
* qwen3_moe_fp4
* update
* support flashinfer-cutlass moe, qwen3-moe-fp4 OK
* support ernie4.5-fp4
* fix load error
* add some ut
* add docs
* fix CLA, test
* fix the apply() in ModelOptNvFp4FusedMoE
* fix CodeStyle
* del the PADDLE_COMPATIBLE_API
* fix broken url: nvidia_gpu.md
* fix docs
* fix token_ids
* fix CI in Hopper
* move flashinfer imports inside the function
* fix model_runner
Removed the logic for generating random padding IDs.
* Remove skip condition for CUDA version in nvfp4 test
* add test for nvfp4
* fix according to review
* Add Chinese translation link to NVFP4 documentation
* del flashinfer.py
* fix unittest
---------
Co-authored-by: zoooo0820 <zoooo0820@qq.com >
Co-authored-by: bukejiyu <395822456@qq.com >
2026-01-29 14:16:07 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00