Commit Graph

377 Commits

Author SHA1 Message Date
chen 6a4efa011a update attn_mask_q 2 (#7372) 2026-04-13 23:34:21 +08:00
K11OntheBoat 10a5e1c7c3 Check optional params before .get() call in gqa_rope_write_cache (#7311)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-04-13 19:05:43 +08:00
JYChen ffe2cf10f9 [Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7317)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:37:27 +08:00
YuBaoku 9985b192b4 [XPU][CI]Update xtdk version in download_dependencies.sh (#7320) (#7321)
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-11 00:27:32 +08:00
YuBaoku 1e0ab318e0 [BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221) (#7294)
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
2026-04-10 13:54:09 +08:00
fxyfxy777 7f55586e63 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) (#7282) 2026-04-09 21:37:53 +08:00
YuBaoku 19cac90117 [XPU][CI] lock xvllm version for fix bug (#7264) (#7265)
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-09 12:46:37 +08:00
xiaoxiaohehe001 2647d80699 [Cherry-Pick][BugFix] Fix flash mask attn 2.5 (#7249)
* [CherryPick] Fix flash_mask_attn

* [CherryPick] Fix flash_mask_attn
2026-04-09 11:05:16 +08:00
Bingoo 1fa58fcb34 support moe for sm103 (#7239) 2026-04-08 20:57:07 +08:00
JYChen 566699303c solve conflict (#7135)
Co-authored-by: wangyifei <mitu626@163.com>
2026-04-02 10:55:15 +08:00
cmcamdy d854e4ee4b [Cherry-Pick][XPU] Fix speculate schedule(#7049) (#7051)
* [BugFix] xpu fix speculate schedule cache kernel

* fix code style

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-27 18:30:06 +08:00
chen 2efab46261 add instantiations for decoder rope enfore_fmul_rn=true (#7010) 2026-03-25 20:06:52 +08:00
chen 49c2310854 [RL][Cherry-Pick] RoPE without fmad opt (#6901) (#6902)
* [RL] RoPE without fmad opt (#6901)

* env FD_ENABLE_RL=1 do fmul_rn(a*b) in rope

* pre_commit
2026-03-25 10:42:16 +08:00
GoldPancake cf0df470cf [Cherry-Pick][Speculative Decoding] Support suffix decoding (#6403) (#6967) 2026-03-23 17:33:58 +08:00
Yonghua Li a4f36cc8db [Cherry-Pick] [BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822) (#6824)
* [BugFix] replace ftok with custom_ftok in get_output/save_output ops

* [Test] add unit test for custom_ftok

* [Chore] create custom_ftok.h

* [Chore] reorganize header file

* [Fix] fix syntax

* [Fix] fix cache messager msg_queue_id+rank_id conflict
2026-03-16 14:22:30 +08:00
yinwei f103a143db [XPU][CI]Cherry-Pick PR and Update CI Case (#6619)
* [XPU] Fix PD + MTP (#6495)

* fix pd + mtp

* fix code style

* fix PD + MTP, D get P's first token

* add anno for gpu(speculate_update)

* update draft insertv1

* fix wapper & kernel

* fix wapper

* fix code stype

* fix tp4 dp1 (#6624)

* update paddle whl package

---------

Co-authored-by: cmcamdy <1027740945@qq.com>
2026-03-11 10:57:30 +08:00
AIbin 01e6ca734a [Cherry-Pick][Feature]Supports SWA based on appendattn #6547 (#6594)
* support SWA V1
2026-03-02 20:15:23 +08:00
Yuanle Liu 0a5ad26f6f [Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 (#6511)
* [OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 (#6493)

* Initial plan

* Migrate PRs #6311, #6129, #6305 to develop and merge unit tests

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix

* update

* fix

* fix ci

* fix ci

* Initial plan

* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add disable-thinking case to test_chat_with_response_max_tokens

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add both reasoning_max_tokens and response_max_tokens case

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix ci

* fix ci

* fix ci

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* Delete tests/model_executor/test_thinking_budget.py

* fix

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
2026-02-26 13:29:38 +08:00
sunxin d36ff9ebfa [Cherry-Pick 2.5][BugFix] Fix get_padding_offset in empty run (#6460) 2026-02-11 20:21:53 +08:00
GoldPancake 375d7fbffb fix bug (#6425) 2026-02-10 17:07:51 +08:00
Mattheliu c776d483e4 [BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384)
* fix: handle 4 return values from noaux_tc_redundant op

The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)

The Python code was only unpacking 3 values, causing:
  ValueError: too many values to unpack (expected 3)

This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.

Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>

* fix: make noaux_tc_redundant return 4 values to match OP definition

The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.

This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)

Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>

---------

Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>
2026-02-09 13:17:47 +08:00
周周周 2b4748de4f [MTP] refactor MTP pre_process (#6358) 2026-02-09 10:47:15 +08:00
jc d6b3c722c1 [KVCache] Storage cache supports c8 model (#6298)
* Refine cache transfer manager
* Storage cache supports c8 model
2026-02-06 12:01:17 +08:00
周周周 e3fb8796b4 Remove MTP rebuil_padding useless code (#6336) 2026-02-05 16:28:44 +08:00
chen 29a313a402 [Optimization] Support FA2/FA3/FA4 with attn_mask_q (#6354)
* support FA4 sm100

* flash attn backend support mask

* flash attn backend run flashmask correct

* add test for flash_attn_backend and flash_attn_func

* check

* add test for fa4

* requirements.txt add fa4 whl

* check test on sm100

* fix CI conflict

* add enable_torch_proxy for flash_mask

* lazy import fa4

* check

* fix tests import

* check test_load_mpt import
2026-02-05 14:39:00 +08:00
lizan1999 72edd394d9 [XPU] support noaux_tc (#6326) 2026-02-05 12:04:16 +08:00
fxyfxy777 36547cfdb3 [Feature] FD_USE_PHI_FP8_QUANT (#6320)
* add ut

* add use_fd_quant env

* rm mask_per_token_quant

* add make ops list

* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true

* modify comments

* use bool type

* Add function declaration
2026-02-03 22:33:03 -08:00
周周周 6225439778 add PADDLE_ENFORCE (#6321) 2026-02-04 10:47:19 +08:00
JYChen c745a22420 [Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304) 2026-02-03 17:47:38 +08:00
周周周 8277b95fa6 remove speculate_get_padding_offset op (#6308) 2026-02-03 15:18:12 +08:00
fxyfxy777 2ada119a38 [Optimize] optimize mask_quant & swiglu (#6222)
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
xiaozude 030647521a [Metax] adapt to the latest develop (#6282) 2026-01-29 23:21:20 -08:00
JYChen 6c685c9474 Revert "[Feature] Support Ernie FP8 on sm100 (#5593)" (#6275)
This reverts commit eb80724b71.
2026-01-30 11:22:01 +08:00
JYChen eb80724b71 [Feature] Support Ernie FP8 on sm100 (#5593)
* Deepgemm暂时可用版本

* dense部分 e8m0 ok

* EB模型E8M0跑通的版本

* code check

* support 21b-tp2, dev_paddle

* 单机4.5T ep OK的版本

* 修复删除的代码,单机4.5T ep(非cudagraph)

* eb tp

* Support SM100 block-wise FP8 inference

* refine codes, support deepgemm on sm100

* add thirdparty PFCC/DeepGEMM

* fix ep decode

* 使用deepep ue8m0, 解决精度问题

* 修复FP8 TP精度

* Deepgemm升级适配Hopper逻辑

* add ue8m0 kernel

* add ue8m0 kernel

* fix custom_ops/gpu_ops/cpp_extensions.cc

* eb 输出正常

* eb5 text is right

* 目测精度一致

* 自测精度对齐

* 替换masked_per_token_quant, ep精度OK

* 性能提升约30%

* 暂时跑通ep但是有问题

* 自测一致

* rm test fun

* fix ep event

* 图优化算子更新Deepgemm

* fix build

* 暂时绕过deepgemm CI编译问题

* 根据SM区分deepgemm版本

* remove useless code

---------

Co-authored-by: ckl117 <ckl117@163.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com>
2026-01-29 13:49:54 +08:00
jc 7da5f54fb3 [CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler (#6250)
* Add unit test for swap_layout

* remove splitwise_scheduler test
2026-01-28 19:20:20 +08:00
GoldPancake 7d6c87c29e [Others] Support constrained decoding when enable_thinking is false (#6248)
* support constrained decoding when enable_thinking is false

* fix

* fix

* fix
2026-01-28 00:05:17 -08:00
sunxin 27f8799f04 [Model Runner] Refactor execute_model for GPU async scheduling (#6176) 2026-01-28 14:19:33 +08:00
freeliuzc ce06c6dfb3 [BugFix] Fix token_penalty kernel (#6069)
* fix token_penalty kernel

* try to fix xpu

* fix xpu

* fix unit test
2026-01-28 12:03:05 +08:00
周周周 aa57864c5b remove unneeded para from flash_mask_attention (#6218) 2026-01-27 14:04:27 +08:00
yangjianfengo1 b3627b59f8 [Bug Fix] fix mask attention (#6216) 2026-01-26 07:46:26 -08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周 0966df78dc [Others] remove stop_nums (#6182) 2026-01-26 12:12:47 +08:00
RuohengMa 976203cf60 [XPU ]fix text_image_gather_scatter in cudagraph mode(#6049) 2026-01-23 19:48:43 +08:00
lizan1999 b3a48529ab [XPU] add more type for recover batch sequence (#6142) 2026-01-23 15:16:05 +08:00
jc 309c7d9764 router support divided roolout (#6150) 2026-01-22 10:39:39 +08:00
lizexu123 f4902fe42d [BugFix] fix wint2 (#6109)
* fix

* fix

* fix
2026-01-20 21:46:21 +08:00
yinwei 51a8a2ed57 [XPU] Support CudaGraph(add block attn cuda_graph support) (#6116)
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
zhupengyang 45ebb2efb4 [XPU] support plugin model (#6092) 2026-01-20 13:00:09 +08:00
sunxin a4144e0b8e [Optimization] Avoid unnecessary penalty computation (#6078) 2026-01-19 15:24:12 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00