Commit Graph

5089 Commits

Author SHA1 Message Date
yzwu e4a4573080 [Iluvatar] Fix cannot import name mtp_save_first_token (#7495)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-04-21 14:09:08 +08:00
RuohengMa 9d3551cfbb [XPU] add support for rope3d (#7518)
* [XPU] add support for rope3d

* support decoder

---------

Co-authored-by: yinwei <yinwei_hust@163.com>
2026-04-21 13:39:00 +08:00
周周周 609f649dd7 [OP] Add flashmla baseline implementation and precision test (#7477) 2026-04-21 13:37:52 +08:00
YuBaoku 3c8c82d5d4 [CI] Remove flashinfer cache cleanup to reduce unit test runtime (#7476) 2026-04-21 11:38:30 +08:00
YuBaoku 5e866e3e21 [CI] Add --workers=1 to keep test behavior consistent with default change 2026-04-20 22:31:42 +08:00
Zhang Yulong 30db3e9d8f [benchmark] update tools (#7512) 2026-04-20 19:40:17 +08:00
YuBaoku c9783a84a6 [CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 (#7486) 2026-04-20 19:35:34 +08:00
K11OntheBoat b79b094dcc Change default workers and max-concurrency when launch api-server (#7457)
Co-authored-by: zhangxiao35 <zhangxiao35@baidu.com>
2026-04-20 15:55:06 +08:00
ZhijunLStudio a0c39cc9af [Typo] Fix parameter name typo in slice_fn: paramter -> parameter (#7462)
Fix the typo in the internal parameter name of slice_fn():
weight_or_paramter -> weight_or_parameter.

No functional changes.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-20 10:06:02 +08:00
RuohengMa cf5bc5e510 [XPU] fix bug and teporary fix for rope 3d (#7465) 2026-04-20 09:51:27 +08:00
YuBaoku b2aca6c550 [CI] Improve logging check accuracy and unify error log cleanup (#7473) 2026-04-18 19:41:21 +08:00
freeliuzc 22a4f6019d [Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy (#7467)
* fix repeat_time kernel and change default spec verify strategy

* fix unit_test
2026-04-18 00:38:01 +08:00
GoldPancake df3b4e12f4 [Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442)
* support mtp logprob in pd

* fix

* fix

* fix

* fix xpu bugs
2026-04-17 21:37:38 +08:00
yzwu 3b9d6c60d3 [Iiluvatar] fix ci error and update readme (#7453) 2026-04-17 20:42:56 +08:00
jackyYang6 a729e0f729 [Bugfix][RL] fix control request timeout in async update weights pipeline (#7430) 2026-04-17 16:45:33 +08:00
freeliuzc 43685a98a7 [BugFix] Fix real token exceeding max_batched_tokens limit (#7438)
* fix max_num_batched_tokens error compute

* add temperatory solution

* fix bug
2026-04-17 16:18:07 +08:00
jc 6847891241 Mooncake storage register local buffer by chunk (#7416) 2026-04-17 10:39:34 +08:00
YuBaoku 91b8bf20f0 [CI] Add pytest failure log collection and persistence (#7405) 2026-04-16 22:56:17 +08:00
AIbin 6ce4854714 [Feature] Support MOE Cutlass backend for latent MOE (#7428)
* support moe cutlass backend  latent moe
2026-04-16 22:11:49 +08:00
ShaneGZhu 2d8338f9e4 [Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. (#7367) 2026-04-16 19:54:12 +08:00
RichardWooSJTU d2d633b05c allow parallel dp starting (#7426) 2026-04-16 18:43:09 +08:00
RichardWooSJTU 420a8c1af5 fix deep gemm import (#7425) 2026-04-16 17:56:56 +08:00
ddchenhao66 e9527208d9 [BugFix][XPU] Fix kv_cache management bug (#7420) 2026-04-16 15:45:45 +08:00
zhouchong 6e16438a57 [Feature] implement log channel separation and request log level system (#7190)
* feat: implement log channel separation and request log level system

* fix: log system improvements based on review

* add request_id to error logs, use RequestLogLevel enum, and unify logger implementation from utils to logger module
2026-04-16 15:13:05 +08:00
Jiajun Ji 29495b2cf1 [XPU] Unify Spec and non-spec branch.(#6947) (#7180)
* [XPU] cherry-pick PR-6947

* [XPU] use unified_update_model_status.

* refactor xpu_model_runner.

* refactor sampler.

* fix codestyle.

* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
  WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.

* fix codestyle.

* replace output_padding_offset with is_speculative flag in gather_next_token.

* rename hiddden_states.

* unify cu_seqlens_q_output and batch_id_per_token_output init.

---------

Co-authored-by: cmcamdy <1027740945@qq.com>
2026-04-16 14:58:38 +08:00
YuBaoku 17002edc47 [CI] Add approval check for logging-related modifications (#7429) 2026-04-16 14:50:22 +08:00
RuohengMa de0c5e68fb [XPU] Split the block_attn operator into smaller operators (#6798)
* spliced block_attn

* adapt to latest vllm

* fix unit tests

* delete mtp+cudagraph 4 cards test

* fix vl model

* fix mtp

* fix slot mapping
2026-04-16 14:28:40 +08:00
Bingoo 6b891da02b [Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660)
* enable trtllm_all_reduce fusion kernel in glm model

* fix conflict

* format update

* fix a bug

* modify test

* modify test

* support empty tensor and modify test

* fix test_linear config issues

* modify test name

* add edge test case

* modify format

* fix conflict

* modify default max token num in trtllm_allreduce_fusion

* add max token num branch for trtllm_allreduce_fusion

* fix format

* fix rmsnorm config issue

* modify 2025 to 2026

* using compat grard

* Lazily import flashinfer.comm and fix test config issue

* fix test issues

* add flashinfer cache dir clean machine

* fix some issues
2026-04-16 14:10:19 +08:00
jc e53f5184ac PD deployment support without router (#7412) 2026-04-15 20:13:07 +08:00
GoldPancake a498720a75 [RL] Add clear_graph_opt_backend for glm4_mtp (#7378)
* add clear_grpah func

* fix spell
2026-04-15 19:44:15 +08:00
RichardWooSJTU dec0b060fc [Optimization] Auto set num_max_dispatch_tokens_per_rank (#7237)
* auto set num_max_dispatch_tokens_per_rank

* fix ci

* fix ci

* fix ci
2026-04-15 19:13:38 +08:00
luukunn 3f84d8d893 [DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298)
* merge mm processor
2026-04-15 19:01:06 +08:00
Bingoo a218d29488 modify flash_mask version (#7413) 2026-04-15 18:16:58 +08:00
luukunn 14d556692b [BugFix] fix tool call parser (#7369)
* fix tool call parser

* add unit test

* fix unit test

* add unit test
2026-04-15 16:21:46 +08:00
AIbin 8eebbcaf15 [BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit (#7407)
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len

* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
2026-04-15 15:55:11 +08:00
周周周 5e54770b2e [Feature] 添加 MoE 层 latent mode 支持 (#7382) 2026-04-15 13:57:07 +08:00
lonelygsh f7a2418ce2 [Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402) 2026-04-15 12:45:23 +08:00
AIbin 8995a38fa4 fix dsa indexer norm to layernorm (#7398) 2026-04-15 11:42:45 +08:00
AIbin bb30f88f1a [Models] support MLA gate attention (#7404)
* support mla gate attn

* support mla gate attn
2026-04-15 11:42:34 +08:00
chen 616b29ce08 check init_flash_attn_version log (#7399) 2026-04-15 11:05:10 +08:00
cmcamdy 13b9fe7299 [XPU] add verify draft tokens (#6947)
* [XPU] add verify draft tokens

* fix test

* fix code style

* use sync cpy

* fix code style

* fix kernel check

* fix ramdom seed

* fix test

* fix check

* fix eos set

* fix verify

* fix verify
2026-04-15 10:18:33 +08:00
lonelygsh e0a1653b26 [Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349)
Co-authored-by: guanshihui] <guanshihui@baidu.com>
2026-04-14 20:57:11 +08:00
sunxin 7b0baced17 fix rl moe gate type (#7393) 2026-04-14 20:04:04 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
luukunn 9d9d79c457 [DataProcessor] add strict (#7307)
* add strict

* fix
2026-04-14 17:25:38 +08:00
kevin ff47701f31 [BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364)
## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
Bingoo 9c23e6154c [Others] replace tool_helpers to fast_dataindex (#7353)
* replace tool_helpers to fast_dataindex

* modify others requirement
2026-04-14 15:13:54 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
Yuanle Liu 8f21c9caa6 [BugFix] fix gitignore claude (#7381) 2026-04-13 20:32:45 -07:00
zhupengyang 27b00cf385 [XPU] glm-4.5-air (#7071) 2026-04-14 11:31:49 +08:00