sunxin
a4144e0b8e
[Optimization] Avoid unnecessary penalty computation ( #6078 )
2026-01-19 15:24:12 +08:00
GoldPancake
bda38aa519
[Speculative Decoding] Support MTP for GLM-4.5-Air ( #6047 )
...
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
lizexu123
6619298b50
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models ( #6007 )
...
* update w4afp8
* build.sh ok
* support cuda_graph
* fix
* add test
* fix max_tokens_per_expert
* >=70
* fix
* compute_max_tokens_from_prefix_sum in w4afp8
* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Daci
e10b51b8c6
[Feature] get_output_kv_signal blocking read mode & send_first_token ( #5836 )
...
* get_output_kv_signal blocking read mode
* send first token before recycle
* xpu get_output_kv_signal blocking read mode
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-15 14:11:03 +08:00
chenjian
74d0f1c01f
[Optim] Robust sync status when preempted happens ( #5796 )
...
* [Bug fix] Sync status for caching output cache
* fix
* fix
* fix bug
* fix
* fix
* support xpu
* fix
* fix
* fix
* fix
* fix
* fix ci
* fix ci
* fix xpu
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-14 12:07:33 +08:00
周周周
ad8d05a8de
[Optimization] Do not compute ATTN padding part in In Cuda graph mode ( #5985 )
2026-01-13 11:32:27 +08:00
lzy
223b2f5d86
Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. ( #5917 )
2026-01-12 14:09:39 +08:00
sunxin
17ef3920f3
remove decoder_num_blocks_device memset ( #5982 )
2026-01-10 21:22:06 +08:00
周周周
b8d9daa785
MLA clean code ( #5979 )
2026-01-10 21:05:00 +08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
GoldPancake
a1fc4e249e
[Bugfix] Fix mtp logprob hang problem when include stop_seq ( #5927 )
...
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
lizhenyun01
2be8656c29
[BugFix] fix mtp split kv attetion ( #5920 )
...
* [BugFix] fix mtp split kv attetion
* clean code
* clean code
2026-01-07 04:07:31 -08:00
kevin
a76e8ae40c
[Feature] support rdma pd dy-c8 ( #5788 )
...
* add rdma pd dy-c8
* update code
2026-01-07 14:55:25 +08:00
周周周
f15df1ec89
Revert cuda check ( #5915 )
...
* commit
* commit
2026-01-07 14:40:18 +08:00
yangjianfengo1
59523b27de
opt w4afp8 ( #5853 )
2026-01-07 12:22:35 +08:00
周周周
83ae59431e
[BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp ( #5895 )
2026-01-06 15:39:06 +08:00
Yuanle Liu
5e729bc2ba
[OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 ( #5890 )
2026-01-06 10:39:35 +08:00
周周周
ab553b3b8b
revert cuda_check ( #5883 )
2026-01-05 20:51:31 +08:00
lizexu123
1d3ae7c024
[BugFix] fix w4afp8 tp=8 ( #5868 )
...
* fix w4afp8 tp=8
* fix
2026-01-05 18:59:02 +08:00
chen
ac39c0f887
support fa3 qwen-vl rope ( #5869 )
2026-01-05 15:29:34 +08:00
sunxin
adb91dcacc
[BugFix] Fix wint4 ep issue caused by empty run ( #5870 )
2026-01-05 14:24:37 +08:00
周周周
e3957a5ebc
[Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel ( #5620 )
2026-01-04 11:21:15 +08:00
Sunny-bot1
598d292a69
w4afp8 fix quant ( #5830 )
2025-12-30 21:16:13 +08:00
Yonghua Li
a8d3e3ba12
[BugFix] fix shm opened but not closed in set_data_ipc ( #5826 )
2025-12-29 23:35:07 +08:00
CSWYF3634076
9286403570
[Models] Add Qwen3-VL Model Support ( #5763 )
...
* support v1 loader
* remove useless code
* remove useless
* [Model] support Qwen3VL images success
* [Model] support Qwen3VL rope_3d
* [Model] support Qwen3VL remove log
* [Model] support Qwen3VL RL
* [Model] support Qwen3VL tp
* [Model] support Qwen3VL video
* [Model] support Qwen3VL fix ernievl
* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds
* [Model] support Qwen3VL fix multi card
* [Model] support Qwen3VL file close
* [Model] support Qwen3VL fix ce
* [Model] support Qwen3VL fix unittest
* [Model] support Qwen3VL add unittest
---------
Co-authored-by: Ayakouji <yuhongh@qq.com >
2025-12-29 17:39:33 +08:00
周周周
a3f0696e35
[BugFix] fix compile error in sm89 ( #5809 )
2025-12-29 16:55:52 +08:00
Longzhi Wang
11329ee35e
[Model] support mode config for expert_dispatch ( #5748 )
2025-12-29 13:37:20 +08:00
Ryan
09229d8953
change count_tokens_per_expert_func declaration: Tensor -> vector<Tensor> ( #5794 )
2025-12-26 19:02:28 +08:00
Ryan
724045c426
add some op infershape&dtype ( #5762 )
2025-12-26 16:17:39 +08:00
周周周
03363cab4c
make flash_mask attention pybind ( #5783 )
2025-12-26 14:31:35 +08:00
kevin
5538dda3c8
[Feature] pd support dy-c8 ipc ( #5750 )
...
* pd support dy-c8 ipc
* update code
* support v0
* update code
2025-12-25 21:22:34 +08:00
freeliuzc
9018ccf74e
[Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes ( #5738 )
...
* fix attn_mask_offset in mtp with multi-step and pd-split-mode
* fix xpu operater register
* update pmtp multi-step mtp strategy in d-split -mode
* add note
* fix xpu register
2025-12-25 01:54:59 -08:00
Juncai
412867fd99
[Feature] Support KV Cache Storage ( #5571 )
...
* Support Mooncake Store
* up
* up
* add op
* fix conflict
* fix error
* up for comments
* avoid thread lock
* up
* fix unittest
* fix unittest
* remove debug info
* consider tp_size > 1
* add default rdma_nics
* add utils
* up
* fix error
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-25 16:30:35 +08:00
chen
c7ab32d154
check ( #5736 )
2025-12-24 16:49:20 +08:00
周周周
922a73ddd6
[Others] clean code ( #5691 )
2025-12-24 11:28:47 +08:00
lizexu123
6d323769dd
fix w4afp8 ( #5634 )
2025-12-22 13:39:41 +08:00
chen
a32cb54d0b
[BugFix] Fix custom_all_reduce overflow ( #5662 )
...
* check
* check
* code style
2025-12-19 18:24:21 +08:00
yzwu
ac013803f3
[Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode ( #5555 )
2025-12-18 02:14:25 -08:00
Yuanle Liu
cdc0004894
Revert "[Feature] add ue8m0 for per_token_quant_fp8 ( #5563 )" ( #5611 )
...
This reverts commit 73e1d6aa90 .
2025-12-17 13:59:06 +08:00
Yuanle Liu
867803ae10
[BugFix] fix speculate_limit_thinking_content_length ( #5590 )
...
* fix speculate_limit_thinking_content_length
* update
2025-12-16 04:31:45 -08:00
chen
27ef3610b5
support glm fa3 ( #5586 )
2025-12-16 19:33:27 +08:00
fxyfxy777
73e1d6aa90
[Feature] add ue8m0 for per_token_quant_fp8 ( #5563 )
...
* ue8m0
* add default arg
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-16 18:40:12 +08:00
Echo-Nie
50100f98d7
[Feature] Support fusedmoe on Blackwell ( #5325 )
...
* update sm100
* fix
* fix style
2025-12-16 11:58:50 +08:00
freeliuzc
532f9ba227
[BugFix][Speculative Decoding](Spend many dyas to solve)Fix write qknorm cache bug in speculative decoding ( #5491 )
...
* [liuzichang spend 10 dyas]fix write qknorm cache bug
* fix 'fix cachekv bug''
2025-12-15 18:27:11 +08:00
chen
a389bb7c5c
[Feature][Optimization] Qwen Support Dynamic block_wise_fp8 cache ( #5486 )
2025-12-12 17:10:17 +08:00
Juncai
d67388a479
[PD Disaggregation] Distinguish the pipelines for sending kv signal in different prefill ( #5514 )
...
* Distinguish the pipelines for sending kv signal in different prefill
* up
2025-12-12 14:05:36 +08:00
Neil Zhu
4403a21d4b
[Metax] refactor cutlass moe and optimize flash attention ( #5361 )
...
* [Metax] refactor moe and flash attention backend
---------
Co-authored-by: zhangchenyi_dl <16219492+zhangchenyidl@user.noreply.gitee.com >
2025-12-10 17:15:17 +08:00
Copilot
e38709b499
[BugFix] Fix limit_thinking early return logic in CUDA kernels ( #5471 )
...
* Initial plan
* [BugFix] Fix limit_thinking bug - change AND to OR in condition checks
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* Update Chinese comments to reflect OR logic instead of AND
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2025-12-10 11:03:19 +08:00