RuohengMa
9d3551cfbb
[XPU] add support for rope3d ( #7518 )
...
* [XPU] add support for rope3d
* support decoder
---------
Co-authored-by: yinwei <yinwei_hust@163.com >
2026-04-21 13:39:00 +08:00
周周周
609f649dd7
[OP] Add flashmla baseline implementation and precision test ( #7477 )
2026-04-21 13:37:52 +08:00
YuBaoku
3c8c82d5d4
[CI] Remove flashinfer cache cleanup to reduce unit test runtime ( #7476 )
2026-04-21 11:38:30 +08:00
YuBaoku
5e866e3e21
[CI] Add --workers=1 to keep test behavior consistent with default change
2026-04-20 22:31:42 +08:00
Zhang Yulong
30db3e9d8f
[benchmark] update tools ( #7512 )
2026-04-20 19:40:17 +08:00
YuBaoku
c9783a84a6
[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 ( #7486 )
2026-04-20 19:35:34 +08:00
K11OntheBoat
b79b094dcc
Change default workers and max-concurrency when launch api-server ( #7457 )
...
Co-authored-by: zhangxiao35 <zhangxiao35@baidu.com >
2026-04-20 15:55:06 +08:00
ZhijunLStudio
a0c39cc9af
[Typo] Fix parameter name typo in slice_fn: paramter -> parameter ( #7462 )
...
Fix the typo in the internal parameter name of slice_fn():
weight_or_paramter -> weight_or_parameter.
No functional changes.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-20 10:06:02 +08:00
RuohengMa
cf5bc5e510
[XPU] fix bug and teporary fix for rope 3d ( #7465 )
2026-04-20 09:51:27 +08:00
YuBaoku
b2aca6c550
[CI] Improve logging check accuracy and unify error log cleanup ( #7473 )
2026-04-18 19:41:21 +08:00
freeliuzc
22a4f6019d
[Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy ( #7467 )
...
* fix repeat_time kernel and change default spec verify strategy
* fix unit_test
2026-04-18 00:38:01 +08:00
GoldPancake
df3b4e12f4
[Speculative Decoding] Add MTP logprob support for PD disaggregation ( #7442 )
...
* support mtp logprob in pd
* fix
* fix
* fix
* fix xpu bugs
2026-04-17 21:37:38 +08:00
yzwu
3b9d6c60d3
[Iiluvatar] fix ci error and update readme ( #7453 )
2026-04-17 20:42:56 +08:00
jackyYang6
a729e0f729
[Bugfix][RL] fix control request timeout in async update weights pipeline ( #7430 )
2026-04-17 16:45:33 +08:00
freeliuzc
43685a98a7
[BugFix] Fix real token exceeding max_batched_tokens limit ( #7438 )
...
* fix max_num_batched_tokens error compute
* add temperatory solution
* fix bug
2026-04-17 16:18:07 +08:00
jc
6847891241
Mooncake storage register local buffer by chunk ( #7416 )
2026-04-17 10:39:34 +08:00
YuBaoku
91b8bf20f0
[CI] Add pytest failure log collection and persistence ( #7405 )
2026-04-16 22:56:17 +08:00
AIbin
6ce4854714
[Feature] Support MOE Cutlass backend for latent MOE ( #7428 )
...
* support moe cutlass backend latent moe
2026-04-16 22:11:49 +08:00
ShaneGZhu
2d8338f9e4
[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. ( #7367 )
2026-04-16 19:54:12 +08:00
RichardWooSJTU
d2d633b05c
allow parallel dp starting ( #7426 )
2026-04-16 18:43:09 +08:00
RichardWooSJTU
420a8c1af5
fix deep gemm import ( #7425 )
2026-04-16 17:56:56 +08:00
ddchenhao66
e9527208d9
[BugFix][XPU] Fix kv_cache management bug ( #7420 )
2026-04-16 15:45:45 +08:00
zhouchong
6e16438a57
[Feature] implement log channel separation and request log level system ( #7190 )
...
* feat: implement log channel separation and request log level system
* fix: log system improvements based on review
* add request_id to error logs, use RequestLogLevel enum, and unify logger implementation from utils to logger module
2026-04-16 15:13:05 +08:00
Jiajun Ji
29495b2cf1
[XPU] Unify Spec and non-spec branch.( #6947 ) ( #7180 )
...
* [XPU] cherry-pick PR-6947
* [XPU] use unified_update_model_status.
* refactor xpu_model_runner.
* refactor sampler.
* fix codestyle.
* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.
* fix codestyle.
* replace output_padding_offset with is_speculative flag in gather_next_token.
* rename hiddden_states.
* unify cu_seqlens_q_output and batch_id_per_token_output init.
---------
Co-authored-by: cmcamdy <1027740945@qq.com >
2026-04-16 14:58:38 +08:00
YuBaoku
17002edc47
[CI] Add approval check for logging-related modifications ( #7429 )
2026-04-16 14:50:22 +08:00
RuohengMa
de0c5e68fb
[XPU] Split the block_attn operator into smaller operators ( #6798 )
...
* spliced block_attn
* adapt to latest vllm
* fix unit tests
* delete mtp+cudagraph 4 cards test
* fix vl model
* fix mtp
* fix slot mapping
2026-04-16 14:28:40 +08:00
Bingoo
6b891da02b
[Optimization] enable trtllm_all_reduce fusion kernel in glm model ( #6660 )
...
* enable trtllm_all_reduce fusion kernel in glm model
* fix conflict
* format update
* fix a bug
* modify test
* modify test
* support empty tensor and modify test
* fix test_linear config issues
* modify test name
* add edge test case
* modify format
* fix conflict
* modify default max token num in trtllm_allreduce_fusion
* add max token num branch for trtllm_allreduce_fusion
* fix format
* fix rmsnorm config issue
* modify 2025 to 2026
* using compat grard
* Lazily import flashinfer.comm and fix test config issue
* fix test issues
* add flashinfer cache dir clean machine
* fix some issues
2026-04-16 14:10:19 +08:00
jc
e53f5184ac
PD deployment support without router ( #7412 )
2026-04-15 20:13:07 +08:00
GoldPancake
a498720a75
[RL] Add clear_graph_opt_backend for glm4_mtp ( #7378 )
...
* add clear_grpah func
* fix spell
2026-04-15 19:44:15 +08:00
RichardWooSJTU
dec0b060fc
[Optimization] Auto set num_max_dispatch_tokens_per_rank ( #7237 )
...
* auto set num_max_dispatch_tokens_per_rank
* fix ci
* fix ci
* fix ci
2026-04-15 19:13:38 +08:00
luukunn
3f84d8d893
[DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline ( #7298 )
...
* merge mm processor
2026-04-15 19:01:06 +08:00
Bingoo
a218d29488
modify flash_mask version ( #7413 )
2026-04-15 18:16:58 +08:00
luukunn
14d556692b
[BugFix] fix tool call parser ( #7369 )
...
* fix tool call parser
* add unit test
* fix unit test
* add unit test
2026-04-15 16:21:46 +08:00
AIbin
8eebbcaf15
[BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit ( #7407 )
...
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
2026-04-15 15:55:11 +08:00
周周周
5e54770b2e
[Feature] 添加 MoE 层 latent mode 支持 ( #7382 )
2026-04-15 13:57:07 +08:00
lonelygsh
f7a2418ce2
[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler ( #7402 )
2026-04-15 12:45:23 +08:00
AIbin
8995a38fa4
fix dsa indexer norm to layernorm ( #7398 )
2026-04-15 11:42:45 +08:00
AIbin
bb30f88f1a
[Models] support MLA gate attention ( #7404 )
...
* support mla gate attn
* support mla gate attn
2026-04-15 11:42:34 +08:00
chen
616b29ce08
check init_flash_attn_version log ( #7399 )
2026-04-15 11:05:10 +08:00
cmcamdy
13b9fe7299
[XPU] add verify draft tokens ( #6947 )
...
* [XPU] add verify draft tokens
* fix test
* fix code style
* use sync cpy
* fix code style
* fix kernel check
* fix ramdom seed
* fix test
* fix check
* fix eos set
* fix verify
* fix verify
2026-04-15 10:18:33 +08:00
lonelygsh
e0a1653b26
[Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel ( #7349 )
...
Co-authored-by: guanshihui] <guanshihui@baidu.com >
2026-04-14 20:57:11 +08:00
sunxin
7b0baced17
fix rl moe gate type ( #7393 )
2026-04-14 20:04:04 +08:00
Echo-Nie
8819a039c9
[Others] Fix typo ( #7280 )
...
* typo
* typo
* typo
* typo
2026-04-14 17:28:22 +08:00
luukunn
9d9d79c457
[DataProcessor] add strict ( #7307 )
...
* add strict
* fix
2026-04-14 17:25:38 +08:00
kevin
ff47701f31
[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario ( #7364 )
...
## Motivation
在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。
## Modifications
1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
`update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
Bingoo
9c23e6154c
[Others] replace tool_helpers to fast_dataindex ( #7353 )
...
* replace tool_helpers to fast_dataindex
* modify others requirement
2026-04-14 15:13:54 +08:00
xiaoxiaohehe001
abba29b348
[BugFix] fix mm rope ( #7274 )
2026-04-14 11:36:08 +08:00
Yuanle Liu
8f21c9caa6
[BugFix] fix gitignore claude ( #7381 )
2026-04-13 20:32:45 -07:00
zhupengyang
27b00cf385
[XPU] glm-4.5-air ( #7071 )
2026-04-14 11:31:49 +08:00
chen
26c47c2afc
update attn_mask_q 2 ( #7371 )
2026-04-13 23:06:04 +08:00