lizexu123
f4902fe42d
[BugFix] fix wint2 ( #6109 )
...
* fix
* fix
* fix
2026-01-20 21:46:21 +08:00
yinwei
51a8a2ed57
[XPU] Support CudaGraph(add block attn cuda_graph support) ( #6116 )
...
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
zhupengyang
45ebb2efb4
[XPU] support plugin model ( #6092 )
2026-01-20 13:00:09 +08:00
sunxin
a4144e0b8e
[Optimization] Avoid unnecessary penalty computation ( #6078 )
2026-01-19 15:24:12 +08:00
GoldPancake
bda38aa519
[Speculative Decoding] Support MTP for GLM-4.5-Air ( #6047 )
...
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
cmcamdy
59d8ae0a25
[XPU] Speculate Decoding + PD, benchmark fix ( #6036 )
...
* fix mtp pd
* fix kernel
* fix code style
* fix kernel
* fix test / clear debug code
* fix test / clear debug code
* fix codestyle
* fix codestyle
* fix codestyle
2026-01-15 19:19:03 +08:00
lizexu123
6619298b50
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models ( #6007 )
...
* update w4afp8
* build.sh ok
* support cuda_graph
* fix
* add test
* fix max_tokens_per_expert
* >=70
* fix
* compute_max_tokens_from_prefix_sum in w4afp8
* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Daci
e10b51b8c6
[Feature] get_output_kv_signal blocking read mode & send_first_token ( #5836 )
...
* get_output_kv_signal blocking read mode
* send first token before recycle
* xpu get_output_kv_signal blocking read mode
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-15 14:11:03 +08:00
chenjian
74d0f1c01f
[Optim] Robust sync status when preempted happens ( #5796 )
...
* [Bug fix] Sync status for caching output cache
* fix
* fix
* fix bug
* fix
* fix
* support xpu
* fix
* fix
* fix
* fix
* fix
* fix ci
* fix ci
* fix xpu
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-14 12:07:33 +08:00
周周周
ad8d05a8de
[Optimization] Do not compute ATTN padding part in In Cuda graph mode ( #5985 )
2026-01-13 11:32:27 +08:00
lzy
223b2f5d86
Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. ( #5917 )
2026-01-12 14:09:39 +08:00
sunxin
17ef3920f3
remove decoder_num_blocks_device memset ( #5982 )
2026-01-10 21:22:06 +08:00
周周周
b8d9daa785
MLA clean code ( #5979 )
2026-01-10 21:05:00 +08:00
zhupengyang
9db48ecb34
[XPU] fix dp4 ( #5946 )
2026-01-09 20:36:53 +08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
yangjianfengo1
16e1992eba
[Bugfix] Increase the shape of w4afp8 gemm ( #5957 )
...
* 增加w4afp8 shape
* 增加w4afp8 shape
* code style
2026-01-09 14:11:17 +08:00
GoldPancake
a1fc4e249e
[Bugfix] Fix mtp logprob hang problem when include stop_seq ( #5927 )
...
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
lizhenyun01
2be8656c29
[BugFix] fix mtp split kv attetion ( #5920 )
...
* [BugFix] fix mtp split kv attetion
* clean code
* clean code
2026-01-07 04:07:31 -08:00
kevin
a76e8ae40c
[Feature] support rdma pd dy-c8 ( #5788 )
...
* add rdma pd dy-c8
* update code
2026-01-07 14:55:25 +08:00
周周周
f15df1ec89
Revert cuda check ( #5915 )
...
* commit
* commit
2026-01-07 14:40:18 +08:00
yangjianfengo1
59523b27de
opt w4afp8 ( #5853 )
2026-01-07 12:22:35 +08:00
MingkunZhang
7ad5737560
[Metax] adapt to gemm interface on different versions of maca ( #5905 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2026-01-07 10:02:24 +08:00
周周周
83ae59431e
[BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp ( #5895 )
2026-01-06 15:39:06 +08:00
ddchenhao66
733014bf32
[XPU] Support EP4TP1 in pd disaggregation ( #5860 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-06 15:25:36 +08:00
lizexu123
acdf0cd1d9
fix hadamard_block_size ( #5888 )
2026-01-06 14:12:14 +08:00
Yuanle Liu
5e729bc2ba
[OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 ( #5890 )
2026-01-06 10:39:35 +08:00
Neil Zhu
272a371635
[Metax] optimize flash attention backend ( #5876 )
2026-01-06 09:52:09 +08:00
周周周
ab553b3b8b
revert cuda_check ( #5883 )
2026-01-05 20:51:31 +08:00
lizexu123
1d3ae7c024
[BugFix] fix w4afp8 tp=8 ( #5868 )
...
* fix w4afp8 tp=8
* fix
2026-01-05 18:59:02 +08:00
cmcamdy
690d4bcdb0
[XPU] Speculative Decoding with PD ( #5856 )
...
* [XPU] Speculative Decoding with PD
* fix post process
* share kv cache sender
* support speculate decoding step system cache
* support speculate decoding step system cache
---------
Co-authored-by: root <root@gajl-bbc-onlinec-com-1512108.gajl.baidu.com >
2026-01-05 17:31:03 +08:00
chen
ac39c0f887
support fa3 qwen-vl rope ( #5869 )
2026-01-05 15:29:34 +08:00
sunxin
adb91dcacc
[BugFix] Fix wint4 ep issue caused by empty run ( #5870 )
2026-01-05 14:24:37 +08:00
周周周
e3957a5ebc
[Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel ( #5620 )
2026-01-04 11:21:15 +08:00
MingkunZhang
f732d7d2ad
[Metax] adapt prefix caching & cpu swap ( #5844 )
...
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com >
2025-12-31 17:02:48 +08:00
ddchenhao66
9e45ef7ca9
[XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL ( #5831 )
2025-12-31 09:49:12 +08:00
Sunny-bot1
598d292a69
w4afp8 fix quant ( #5830 )
2025-12-30 21:16:13 +08:00
lizexu123
44a13e4557
[Feature] support w4afp8 v1_loader and v0_loader(tp>1) ( #5757 )
...
* support
* fix
* support w4afp8 v1_loader and v0_loader
* fix
* fix test
* fix test
* fix test
* fix moe.py
* add test_ernie_4_5_w4afp8
* add test
* delete tensor
* fix test
* fix
* add
* fix test
2025-12-30 14:11:52 +08:00
Yonghua Li
a8d3e3ba12
[BugFix] fix shm opened but not closed in set_data_ipc ( #5826 )
2025-12-29 23:35:07 +08:00
CSWYF3634076
9286403570
[Models] Add Qwen3-VL Model Support ( #5763 )
...
* support v1 loader
* remove useless code
* remove useless
* [Model] support Qwen3VL images success
* [Model] support Qwen3VL rope_3d
* [Model] support Qwen3VL remove log
* [Model] support Qwen3VL RL
* [Model] support Qwen3VL tp
* [Model] support Qwen3VL video
* [Model] support Qwen3VL fix ernievl
* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds
* [Model] support Qwen3VL fix multi card
* [Model] support Qwen3VL file close
* [Model] support Qwen3VL fix ce
* [Model] support Qwen3VL fix unittest
* [Model] support Qwen3VL add unittest
---------
Co-authored-by: Ayakouji <yuhongh@qq.com >
2025-12-29 17:39:33 +08:00
周周周
a3f0696e35
[BugFix] fix compile error in sm89 ( #5809 )
2025-12-29 16:55:52 +08:00
Longzhi Wang
11329ee35e
[Model] support mode config for expert_dispatch ( #5748 )
2025-12-29 13:37:20 +08:00
Ryan
09229d8953
change count_tokens_per_expert_func declaration: Tensor -> vector<Tensor> ( #5794 )
2025-12-26 19:02:28 +08:00
Ryan
724045c426
add some op infershape&dtype ( #5762 )
2025-12-26 16:17:39 +08:00
周周周
03363cab4c
make flash_mask attention pybind ( #5783 )
2025-12-26 14:31:35 +08:00
kevin
5538dda3c8
[Feature] pd support dy-c8 ipc ( #5750 )
...
* pd support dy-c8 ipc
* update code
* support v0
* update code
2025-12-25 21:22:34 +08:00
freeliuzc
9018ccf74e
[Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes ( #5738 )
...
* fix attn_mask_offset in mtp with multi-step and pd-split-mode
* fix xpu operater register
* update pmtp multi-step mtp strategy in d-split -mode
* add note
* fix xpu register
2025-12-25 01:54:59 -08:00
Juncai
412867fd99
[Feature] Support KV Cache Storage ( #5571 )
...
* Support Mooncake Store
* up
* up
* add op
* fix conflict
* fix error
* up for comments
* avoid thread lock
* up
* fix unittest
* fix unittest
* remove debug info
* consider tp_size > 1
* add default rdma_nics
* add utils
* up
* fix error
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-25 16:30:35 +08:00
RuohengMa
e154c03416
[XPU] refine moe_expert_ffn ut ( #5743 )
2025-12-25 10:35:24 +08:00