Commit Graph

332 Commits

Author SHA1 Message Date
lizexu123 f4902fe42d [BugFix] fix wint2 (#6109)
* fix

* fix

* fix
2026-01-20 21:46:21 +08:00
yinwei 51a8a2ed57 [XPU] Support CudaGraph(add block attn cuda_graph support) (#6116)
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
zhupengyang 45ebb2efb4 [XPU] support plugin model (#6092) 2026-01-20 13:00:09 +08:00
sunxin a4144e0b8e [Optimization] Avoid unnecessary penalty computation (#6078) 2026-01-19 15:24:12 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
fxyfxy777 4c92035f2d [Feature] Unify fp8 block_wise quant ops (#5991)
* quant stash

* blockwise_quant

* precommit

* rm tensor.cut

* tp ok

* add swiglu

* rm outdate code

* fix activate ut

* change baseline

* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc 49617d9832 [Feature]Support tag phase token enforce generation (#6034)
* support tag phase token enforce generation

* optimize note and some feature

* fix sampler unit test

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-15 03:59:55 -08:00
cmcamdy 59d8ae0a25 [XPU] Speculate Decoding + PD, benchmark fix (#6036)
* fix mtp pd

* fix kernel

* fix code style

* fix kernel

* fix test / clear debug code

* fix test / clear debug code

* fix codestyle

* fix codestyle

* fix codestyle
2026-01-15 19:19:03 +08:00
lizexu123 6619298b50 【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models (#6007)
* update w4afp8

* build.sh ok

* support cuda_graph

* fix

* add test

* fix max_tokens_per_expert

* >=70

* fix

* compute_max_tokens_from_prefix_sum in w4afp8

* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Daci e10b51b8c6 [Feature] get_output_kv_signal blocking read mode & send_first_token (#5836)
* get_output_kv_signal blocking read mode

* send first token before recycle

* xpu get_output_kv_signal blocking read mode

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-15 14:11:03 +08:00
chenjian 74d0f1c01f [Optim] Robust sync status when preempted happens (#5796)
* [Bug fix] Sync status for caching output cache

* fix

* fix

* fix bug

* fix

* fix

* support xpu

* fix

* fix

* fix

* fix

* fix

* fix ci

* fix ci

* fix xpu

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-14 12:07:33 +08:00
周周周 ad8d05a8de [Optimization] Do not compute ATTN padding part in In Cuda graph mode (#5985) 2026-01-13 11:32:27 +08:00
lzy 223b2f5d86 Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917) 2026-01-12 14:09:39 +08:00
sunxin 17ef3920f3 remove decoder_num_blocks_device memset (#5982) 2026-01-10 21:22:06 +08:00
周周周 b8d9daa785 MLA clean code (#5979) 2026-01-10 21:05:00 +08:00
zhupengyang 9db48ecb34 [XPU] fix dp4 (#5946) 2026-01-09 20:36:53 +08:00
xiaoxiaohehe001 00a01ae024 [Feature] Support redundant expert for eplb (#5918)
* [BugFix] support redundant expert for eplb

* support redundant expert for eplb

* support redundant expert for eplb

* update

* fix ci eplb
2026-01-09 17:13:24 +08:00
yangjianfengo1 16e1992eba [Bugfix] Increase the shape of w4afp8 gemm (#5957)
* 增加w4afp8 shape

* 增加w4afp8 shape

* code style
2026-01-09 14:11:17 +08:00
GoldPancake a1fc4e249e [Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927)
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
lizhenyun01 2be8656c29 [BugFix] fix mtp split kv attetion (#5920)
* [BugFix] fix mtp split kv attetion

* clean code

* clean code
2026-01-07 04:07:31 -08:00
kevin a76e8ae40c [Feature] support rdma pd dy-c8 (#5788)
* add rdma pd dy-c8

* update code
2026-01-07 14:55:25 +08:00
周周周 f15df1ec89 Revert cuda check (#5915)
* commit

* commit
2026-01-07 14:40:18 +08:00
yangjianfengo1 59523b27de opt w4afp8 (#5853) 2026-01-07 12:22:35 +08:00
MingkunZhang 7ad5737560 [Metax] adapt to gemm interface on different versions of maca (#5905)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-07 10:02:24 +08:00
周周周 83ae59431e [BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp (#5895) 2026-01-06 15:39:06 +08:00
ddchenhao66 733014bf32 [XPU] Support EP4TP1 in pd disaggregation (#5860)
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-06 15:25:36 +08:00
lizexu123 acdf0cd1d9 fix hadamard_block_size (#5888) 2026-01-06 14:12:14 +08:00
Yuanle Liu 5e729bc2ba [OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 (#5890) 2026-01-06 10:39:35 +08:00
Neil Zhu 272a371635 [Metax] optimize flash attention backend (#5876) 2026-01-06 09:52:09 +08:00
周周周 ab553b3b8b revert cuda_check (#5883) 2026-01-05 20:51:31 +08:00
lizexu123 1d3ae7c024 [BugFix] fix w4afp8 tp=8 (#5868)
* fix w4afp8 tp=8

* fix
2026-01-05 18:59:02 +08:00
cmcamdy 690d4bcdb0 [XPU] Speculative Decoding with PD (#5856)
* [XPU] Speculative Decoding with PD

* fix post process

* share kv cache sender

* support speculate decoding step system cache

* support speculate decoding step system cache

---------

Co-authored-by: root <root@gajl-bbc-onlinec-com-1512108.gajl.baidu.com>
2026-01-05 17:31:03 +08:00
chen ac39c0f887 support fa3 qwen-vl rope (#5869) 2026-01-05 15:29:34 +08:00
sunxin adb91dcacc [BugFix] Fix wint4 ep issue caused by empty run (#5870) 2026-01-05 14:24:37 +08:00
周周周 e3957a5ebc [Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel (#5620) 2026-01-04 11:21:15 +08:00
MingkunZhang f732d7d2ad [Metax] adapt prefix caching & cpu swap (#5844)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2025-12-31 17:02:48 +08:00
ddchenhao66 9e45ef7ca9 [XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL (#5831) 2025-12-31 09:49:12 +08:00
Sunny-bot1 598d292a69 w4afp8 fix quant (#5830) 2025-12-30 21:16:13 +08:00
lizexu123 44a13e4557 [Feature] support w4afp8 v1_loader and v0_loader(tp>1) (#5757)
* support

* fix

* support w4afp8 v1_loader and v0_loader

* fix

* fix test

* fix test

* fix test

* fix moe.py

* add test_ernie_4_5_w4afp8

* add test

* delete tensor

* fix test

* fix

* add

* fix test
2025-12-30 14:11:52 +08:00
Yonghua Li a8d3e3ba12 [BugFix] fix shm opened but not closed in set_data_ipc (#5826) 2025-12-29 23:35:07 +08:00
CSWYF3634076 9286403570 [Models] Add Qwen3-VL Model Support (#5763)
* support v1 loader

* remove useless code

* remove useless

* [Model] support Qwen3VL images success

* [Model] support Qwen3VL rope_3d

* [Model] support Qwen3VL remove log

* [Model] support Qwen3VL RL

* [Model] support Qwen3VL tp

* [Model] support Qwen3VL video

* [Model] support Qwen3VL fix ernievl

* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds

* [Model] support Qwen3VL fix multi card

* [Model] support Qwen3VL file close

* [Model] support Qwen3VL fix ce

* [Model] support Qwen3VL fix unittest

* [Model] support Qwen3VL add unittest

---------

Co-authored-by: Ayakouji <yuhongh@qq.com>
2025-12-29 17:39:33 +08:00
周周周 a3f0696e35 [BugFix] fix compile error in sm89 (#5809) 2025-12-29 16:55:52 +08:00
Longzhi Wang 11329ee35e [Model] support mode config for expert_dispatch (#5748) 2025-12-29 13:37:20 +08:00
Ryan 09229d8953 change count_tokens_per_expert_func declaration: Tensor -> vector<Tensor> (#5794) 2025-12-26 19:02:28 +08:00
Ryan 724045c426 add some op infershape&dtype (#5762) 2025-12-26 16:17:39 +08:00
周周周 03363cab4c make flash_mask attention pybind (#5783) 2025-12-26 14:31:35 +08:00
kevin 5538dda3c8 [Feature] pd support dy-c8 ipc (#5750)
* pd support dy-c8 ipc

* update code

* support v0

* update code
2025-12-25 21:22:34 +08:00
freeliuzc 9018ccf74e [Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes (#5738)
* fix attn_mask_offset in mtp with multi-step and pd-split-mode

* fix xpu operater register

* update pmtp multi-step mtp strategy in d-split -mode

* add note

* fix xpu register
2025-12-25 01:54:59 -08:00
Juncai 412867fd99 [Feature] Support KV Cache Storage (#5571)
* Support Mooncake Store

* up

* up

* add op

* fix conflict

* fix error

* up for comments

* avoid thread lock

* up

* fix unittest

* fix unittest

* remove debug info

* consider tp_size > 1

* add default rdma_nics

* add utils

* up

* fix error

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-12-25 16:30:35 +08:00
RuohengMa e154c03416 [XPU] refine moe_expert_ffn ut (#5743) 2025-12-25 10:35:24 +08:00