Commit Graph

5069 Commits

Author SHA1 Message Date
RichardWooSJTU d2d633b05c allow parallel dp starting (#7426) 2026-04-16 18:43:09 +08:00
RichardWooSJTU 420a8c1af5 fix deep gemm import (#7425) 2026-04-16 17:56:56 +08:00
ddchenhao66 e9527208d9 [BugFix][XPU] Fix kv_cache management bug (#7420) 2026-04-16 15:45:45 +08:00
zhouchong 6e16438a57 [Feature] implement log channel separation and request log level system (#7190)
* feat: implement log channel separation and request log level system

* fix: log system improvements based on review

* add request_id to error logs, use RequestLogLevel enum, and unify logger implementation from utils to logger module
2026-04-16 15:13:05 +08:00
Jiajun Ji 29495b2cf1 [XPU] Unify Spec and non-spec branch.(#6947) (#7180)
* [XPU] cherry-pick PR-6947

* [XPU] use unified_update_model_status.

* refactor xpu_model_runner.

* refactor sampler.

* fix codestyle.

* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
  WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.

* fix codestyle.

* replace output_padding_offset with is_speculative flag in gather_next_token.

* rename hiddden_states.

* unify cu_seqlens_q_output and batch_id_per_token_output init.

---------

Co-authored-by: cmcamdy <1027740945@qq.com>
2026-04-16 14:58:38 +08:00
YuBaoku 17002edc47 [CI] Add approval check for logging-related modifications (#7429) 2026-04-16 14:50:22 +08:00
RuohengMa de0c5e68fb [XPU] Split the block_attn operator into smaller operators (#6798)
* spliced block_attn

* adapt to latest vllm

* fix unit tests

* delete mtp+cudagraph 4 cards test

* fix vl model

* fix mtp

* fix slot mapping
2026-04-16 14:28:40 +08:00
Bingoo 6b891da02b [Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660)
* enable trtllm_all_reduce fusion kernel in glm model

* fix conflict

* format update

* fix a bug

* modify test

* modify test

* support empty tensor and modify test

* fix test_linear config issues

* modify test name

* add edge test case

* modify format

* fix conflict

* modify default max token num in trtllm_allreduce_fusion

* add max token num branch for trtllm_allreduce_fusion

* fix format

* fix rmsnorm config issue

* modify 2025 to 2026

* using compat grard

* Lazily import flashinfer.comm and fix test config issue

* fix test issues

* add flashinfer cache dir clean machine

* fix some issues
2026-04-16 14:10:19 +08:00
jc e53f5184ac PD deployment support without router (#7412) 2026-04-15 20:13:07 +08:00
GoldPancake a498720a75 [RL] Add clear_graph_opt_backend for glm4_mtp (#7378)
* add clear_grpah func

* fix spell
2026-04-15 19:44:15 +08:00
RichardWooSJTU dec0b060fc [Optimization] Auto set num_max_dispatch_tokens_per_rank (#7237)
* auto set num_max_dispatch_tokens_per_rank

* fix ci

* fix ci

* fix ci
2026-04-15 19:13:38 +08:00
luukunn 3f84d8d893 [DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298)
* merge mm processor
2026-04-15 19:01:06 +08:00
Bingoo a218d29488 modify flash_mask version (#7413) 2026-04-15 18:16:58 +08:00
luukunn 14d556692b [BugFix] fix tool call parser (#7369)
* fix tool call parser

* add unit test

* fix unit test

* add unit test
2026-04-15 16:21:46 +08:00
AIbin 8eebbcaf15 [BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit (#7407)
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len

* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
2026-04-15 15:55:11 +08:00
周周周 5e54770b2e [Feature] 添加 MoE 层 latent mode 支持 (#7382) 2026-04-15 13:57:07 +08:00
lonelygsh f7a2418ce2 [Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402) 2026-04-15 12:45:23 +08:00
AIbin 8995a38fa4 fix dsa indexer norm to layernorm (#7398) 2026-04-15 11:42:45 +08:00
AIbin bb30f88f1a [Models] support MLA gate attention (#7404)
* support mla gate attn

* support mla gate attn
2026-04-15 11:42:34 +08:00
chen 616b29ce08 check init_flash_attn_version log (#7399) 2026-04-15 11:05:10 +08:00
cmcamdy 13b9fe7299 [XPU] add verify draft tokens (#6947)
* [XPU] add verify draft tokens

* fix test

* fix code style

* use sync cpy

* fix code style

* fix kernel check

* fix ramdom seed

* fix test

* fix check

* fix eos set

* fix verify

* fix verify
2026-04-15 10:18:33 +08:00
lonelygsh e0a1653b26 [Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349)
Co-authored-by: guanshihui] <guanshihui@baidu.com>
2026-04-14 20:57:11 +08:00
sunxin 7b0baced17 fix rl moe gate type (#7393) 2026-04-14 20:04:04 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
luukunn 9d9d79c457 [DataProcessor] add strict (#7307)
* add strict

* fix
2026-04-14 17:25:38 +08:00
kevin ff47701f31 [BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364)
## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
Bingoo 9c23e6154c [Others] replace tool_helpers to fast_dataindex (#7353)
* replace tool_helpers to fast_dataindex

* modify others requirement
2026-04-14 15:13:54 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
Yuanle Liu 8f21c9caa6 [BugFix] fix gitignore claude (#7381) 2026-04-13 20:32:45 -07:00
zhupengyang 27b00cf385 [XPU] glm-4.5-air (#7071) 2026-04-14 11:31:49 +08:00
chen 26c47c2afc update attn_mask_q 2 (#7371) 2026-04-13 23:06:04 +08:00
Yuanle Liu 0ddb6e461c [Optimization] 移除 num_blocks 上限限制 (#7241) 2026-04-13 07:07:41 -07:00
lonelygsh e83d45833f [Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166)
- speculate_limit_thinking_content_length: update current_base_step to
  step_idx+1 (step_idx now records history count before current round);
  remove incorrect step_idx decrement on accept_num truncation; mark
  step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
  step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
  formula (remove stale -accept_num offset); use <= condition so accept_idx
  maps directly to the accepted token that ends the stop sequence; fix
  accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
2026-04-13 20:53:42 +08:00
周周周 73bd4ab318 [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361)
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
YuBaoku 1e08ee74e5 [CI] Modify 4-card container startup config and move test case (#7363) 2026-04-13 05:23:49 -07:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
JYChen 5ddd1af756 remove fa4 requirements (#7143) 2026-04-13 19:24:20 +08:00
AIbin 1fb8194191 [OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests

* Replace max_position_embeddings with max_model_len

* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
Zhang Yulong 738c658c54 [Benchmark] Update seed argument handling in benchmark_serving.py (#7356) 2026-04-13 16:05:50 +08:00
周周周 a6f0055d51 add ips check (#7352)
* commit

* commit

---------

Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-13 15:24:22 +08:00
liuruyan b34708604c [TI-consistent] support quant use pow2scale (#7308)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:01:53 -07:00
AIbin 6213ad5340 [Docs][BugFix] fix mla log (#7243)
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
Jiajun Ji cb03958b52 [XPU] Refactor get_padding_offset to single kernel. (#7029)
* [XPU] Refactor get_padding_offset to single kernel.

* add unittest.

* fix codestyle.

* remove cum_offsets_now.

* remove max_len.
2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun 26d6a20c2f [Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299)
* [Optim] Remove IPCLock between CacheManager and WorkerProcess

* Update envs.py

* Update worker_process.py

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
2026-04-12 13:59:34 +08:00
周周周 225fc8d222 use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340) 2026-04-11 22:39:43 +08:00
chen 4982aa000e [RL]moe bf16 ep support paddle batch_gemm (#7337)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin ba01d7a823 [Optimization] [OP] [Models] dsk del prefill mask (#7313)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests
2026-04-11 19:32:27 +08:00
JYChen 076ab07528 [RL] change glm rope_emb calculation (#7316)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:36:28 +08:00
YuBaoku fcf8b1336d [CI] Fix nightly test error and add container cleanup in build_rl (#7335)
* [CI] Fix nightly test error and add container cleanup in build_rl
2026-04-11 12:14:46 +08:00