Commit Graph

275 Commits

Author SHA1 Message Date
fxyfxy777 36547cfdb3 [Feature] FD_USE_PHI_FP8_QUANT (#6320)
* add ut

* add use_fd_quant env

* rm mask_per_token_quant

* add make ops list

* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true

* modify comments

* use bool type

* Add function declaration
2026-02-03 22:33:03 -08:00
周周周 6225439778 add PADDLE_ENFORCE (#6321) 2026-02-04 10:47:19 +08:00
JYChen c745a22420 [Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304) 2026-02-03 17:47:38 +08:00
周周周 8277b95fa6 remove speculate_get_padding_offset op (#6308) 2026-02-03 15:18:12 +08:00
fxyfxy777 2ada119a38 [Optimize] optimize mask_quant & swiglu (#6222)
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
xiaozude 030647521a [Metax] adapt to the latest develop (#6282) 2026-01-29 23:21:20 -08:00
JYChen 6c685c9474 Revert "[Feature] Support Ernie FP8 on sm100 (#5593)" (#6275)
This reverts commit eb80724b71.
2026-01-30 11:22:01 +08:00
JYChen eb80724b71 [Feature] Support Ernie FP8 on sm100 (#5593)
* Deepgemm暂时可用版本

* dense部分 e8m0 ok

* EB模型E8M0跑通的版本

* code check

* support 21b-tp2, dev_paddle

* 单机4.5T ep OK的版本

* 修复删除的代码,单机4.5T ep(非cudagraph)

* eb tp

* Support SM100 block-wise FP8 inference

* refine codes, support deepgemm on sm100

* add thirdparty PFCC/DeepGEMM

* fix ep decode

* 使用deepep ue8m0, 解决精度问题

* 修复FP8 TP精度

* Deepgemm升级适配Hopper逻辑

* add ue8m0 kernel

* add ue8m0 kernel

* fix custom_ops/gpu_ops/cpp_extensions.cc

* eb 输出正常

* eb5 text is right

* 目测精度一致

* 自测精度对齐

* 替换masked_per_token_quant, ep精度OK

* 性能提升约30%

* 暂时跑通ep但是有问题

* 自测一致

* rm test fun

* fix ep event

* 图优化算子更新Deepgemm

* fix build

* 暂时绕过deepgemm CI编译问题

* 根据SM区分deepgemm版本

* remove useless code

---------

Co-authored-by: ckl117 <ckl117@163.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com>
2026-01-29 13:49:54 +08:00
jc 7da5f54fb3 [CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler (#6250)
* Add unit test for swap_layout

* remove splitwise_scheduler test
2026-01-28 19:20:20 +08:00
GoldPancake 7d6c87c29e [Others] Support constrained decoding when enable_thinking is false (#6248)
* support constrained decoding when enable_thinking is false

* fix

* fix

* fix
2026-01-28 00:05:17 -08:00
sunxin 27f8799f04 [Model Runner] Refactor execute_model for GPU async scheduling (#6176) 2026-01-28 14:19:33 +08:00
freeliuzc ce06c6dfb3 [BugFix] Fix token_penalty kernel (#6069)
* fix token_penalty kernel

* try to fix xpu

* fix xpu

* fix unit test
2026-01-28 12:03:05 +08:00
周周周 aa57864c5b remove unneeded para from flash_mask_attention (#6218) 2026-01-27 14:04:27 +08:00
yangjianfengo1 b3627b59f8 [Bug Fix] fix mask attention (#6216) 2026-01-26 07:46:26 -08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周 0966df78dc [Others] remove stop_nums (#6182) 2026-01-26 12:12:47 +08:00
jc 309c7d9764 router support divided roolout (#6150) 2026-01-22 10:39:39 +08:00
lizexu123 f4902fe42d [BugFix] fix wint2 (#6109)
* fix

* fix

* fix
2026-01-20 21:46:21 +08:00
sunxin a4144e0b8e [Optimization] Avoid unnecessary penalty computation (#6078) 2026-01-19 15:24:12 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
fxyfxy777 4c92035f2d [Feature] Unify fp8 block_wise quant ops (#5991)
* quant stash

* blockwise_quant

* precommit

* rm tensor.cut

* tp ok

* add swiglu

* rm outdate code

* fix activate ut

* change baseline

* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc 49617d9832 [Feature]Support tag phase token enforce generation (#6034)
* support tag phase token enforce generation

* optimize note and some feature

* fix sampler unit test

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-15 03:59:55 -08:00
lizexu123 6619298b50 【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models (#6007)
* update w4afp8

* build.sh ok

* support cuda_graph

* fix

* add test

* fix max_tokens_per_expert

* >=70

* fix

* compute_max_tokens_from_prefix_sum in w4afp8

* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Daci e10b51b8c6 [Feature] get_output_kv_signal blocking read mode & send_first_token (#5836)
* get_output_kv_signal blocking read mode

* send first token before recycle

* xpu get_output_kv_signal blocking read mode

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-15 14:11:03 +08:00
chenjian 74d0f1c01f [Optim] Robust sync status when preempted happens (#5796)
* [Bug fix] Sync status for caching output cache

* fix

* fix

* fix bug

* fix

* fix

* support xpu

* fix

* fix

* fix

* fix

* fix

* fix ci

* fix ci

* fix xpu

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-14 12:07:33 +08:00
周周周 ad8d05a8de [Optimization] Do not compute ATTN padding part in In Cuda graph mode (#5985) 2026-01-13 11:32:27 +08:00
lzy 223b2f5d86 Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917) 2026-01-12 14:09:39 +08:00
sunxin 17ef3920f3 remove decoder_num_blocks_device memset (#5982) 2026-01-10 21:22:06 +08:00
周周周 b8d9daa785 MLA clean code (#5979) 2026-01-10 21:05:00 +08:00
xiaoxiaohehe001 00a01ae024 [Feature] Support redundant expert for eplb (#5918)
* [BugFix] support redundant expert for eplb

* support redundant expert for eplb

* support redundant expert for eplb

* update

* fix ci eplb
2026-01-09 17:13:24 +08:00
GoldPancake a1fc4e249e [Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927)
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
lizhenyun01 2be8656c29 [BugFix] fix mtp split kv attetion (#5920)
* [BugFix] fix mtp split kv attetion

* clean code

* clean code
2026-01-07 04:07:31 -08:00
kevin a76e8ae40c [Feature] support rdma pd dy-c8 (#5788)
* add rdma pd dy-c8

* update code
2026-01-07 14:55:25 +08:00
周周周 f15df1ec89 Revert cuda check (#5915)
* commit

* commit
2026-01-07 14:40:18 +08:00
yangjianfengo1 59523b27de opt w4afp8 (#5853) 2026-01-07 12:22:35 +08:00
周周周 83ae59431e [BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp (#5895) 2026-01-06 15:39:06 +08:00
Yuanle Liu 5e729bc2ba [OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 (#5890) 2026-01-06 10:39:35 +08:00
周周周 ab553b3b8b revert cuda_check (#5883) 2026-01-05 20:51:31 +08:00
lizexu123 1d3ae7c024 [BugFix] fix w4afp8 tp=8 (#5868)
* fix w4afp8 tp=8

* fix
2026-01-05 18:59:02 +08:00
chen ac39c0f887 support fa3 qwen-vl rope (#5869) 2026-01-05 15:29:34 +08:00
sunxin adb91dcacc [BugFix] Fix wint4 ep issue caused by empty run (#5870) 2026-01-05 14:24:37 +08:00
周周周 e3957a5ebc [Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel (#5620) 2026-01-04 11:21:15 +08:00
Sunny-bot1 598d292a69 w4afp8 fix quant (#5830) 2025-12-30 21:16:13 +08:00
Yonghua Li a8d3e3ba12 [BugFix] fix shm opened but not closed in set_data_ipc (#5826) 2025-12-29 23:35:07 +08:00
CSWYF3634076 9286403570 [Models] Add Qwen3-VL Model Support (#5763)
* support v1 loader

* remove useless code

* remove useless

* [Model] support Qwen3VL images success

* [Model] support Qwen3VL rope_3d

* [Model] support Qwen3VL remove log

* [Model] support Qwen3VL RL

* [Model] support Qwen3VL tp

* [Model] support Qwen3VL video

* [Model] support Qwen3VL fix ernievl

* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds

* [Model] support Qwen3VL fix multi card

* [Model] support Qwen3VL file close

* [Model] support Qwen3VL fix ce

* [Model] support Qwen3VL fix unittest

* [Model] support Qwen3VL add unittest

---------

Co-authored-by: Ayakouji <yuhongh@qq.com>
2025-12-29 17:39:33 +08:00
周周周 a3f0696e35 [BugFix] fix compile error in sm89 (#5809) 2025-12-29 16:55:52 +08:00
Longzhi Wang 11329ee35e [Model] support mode config for expert_dispatch (#5748) 2025-12-29 13:37:20 +08:00
Ryan 09229d8953 change count_tokens_per_expert_func declaration: Tensor -> vector<Tensor> (#5794) 2025-12-26 19:02:28 +08:00
Ryan 724045c426 add some op infershape&dtype (#5762) 2025-12-26 16:17:39 +08:00
周周周 03363cab4c make flash_mask attention pybind (#5783) 2025-12-26 14:31:35 +08:00