Commit Graph

500 Commits

Author SHA1 Message Date
GoldPancake df3b4e12f4 [Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442)
* support mtp logprob in pd

* fix

* fix

* fix

* fix xpu bugs
2026-04-17 21:37:38 +08:00
ShaneGZhu 2d8338f9e4 [Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. (#7367) 2026-04-16 19:54:12 +08:00
ddchenhao66 e9527208d9 [BugFix][XPU] Fix kv_cache management bug (#7420) 2026-04-16 15:45:45 +08:00
Jiajun Ji 29495b2cf1 [XPU] Unify Spec and non-spec branch.(#6947) (#7180)
* [XPU] cherry-pick PR-6947

* [XPU] use unified_update_model_status.

* refactor xpu_model_runner.

* refactor sampler.

* fix codestyle.

* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
  WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.

* fix codestyle.

* replace output_padding_offset with is_speculative flag in gather_next_token.

* rename hiddden_states.

* unify cu_seqlens_q_output and batch_id_per_token_output init.

---------

Co-authored-by: cmcamdy <1027740945@qq.com>
2026-04-16 14:58:38 +08:00
RuohengMa de0c5e68fb [XPU] Split the block_attn operator into smaller operators (#6798)
* spliced block_attn

* adapt to latest vllm

* fix unit tests

* delete mtp+cudagraph 4 cards test

* fix vl model

* fix mtp

* fix slot mapping
2026-04-16 14:28:40 +08:00
Bingoo 6b891da02b [Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660)
* enable trtllm_all_reduce fusion kernel in glm model

* fix conflict

* format update

* fix a bug

* modify test

* modify test

* support empty tensor and modify test

* fix test_linear config issues

* modify test name

* add edge test case

* modify format

* fix conflict

* modify default max token num in trtllm_allreduce_fusion

* add max token num branch for trtllm_allreduce_fusion

* fix format

* fix rmsnorm config issue

* modify 2025 to 2026

* using compat grard

* Lazily import flashinfer.comm and fix test config issue

* fix test issues

* add flashinfer cache dir clean machine

* fix some issues
2026-04-16 14:10:19 +08:00
GoldPancake a498720a75 [RL] Add clear_graph_opt_backend for glm4_mtp (#7378)
* add clear_grpah func

* fix spell
2026-04-15 19:44:15 +08:00
luukunn 3f84d8d893 [DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298)
* merge mm processor
2026-04-15 19:01:06 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
zhupengyang 27b00cf385 [XPU] glm-4.5-air (#7071) 2026-04-14 11:31:49 +08:00
Yuanle Liu 0ddb6e461c [Optimization] 移除 num_blocks 上限限制 (#7241) 2026-04-13 07:07:41 -07:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
Jiang-Jia-Jun 26d6a20c2f [Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299)
* [Optim] Remove IPCLock between CacheManager and WorkerProcess

* Update envs.py

* Update worker_process.py

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
2026-04-12 13:59:34 +08:00
sunxin 00005c92e0 [BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300) 2026-04-10 03:29:45 -07:00
bukejiyu 14d46181b8 [Loader] add multi-thread model loading (#6877)
* multi-thread-loader

* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake c1fb3112f8 [FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281)
* refactor quant cli param
2026-04-10 14:13:42 +08:00
chenjian 427efadaee [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159)
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
2026-04-08 19:30:54 +08:00
Jiajun Ji 9b970de029 [XPU] Add TP broadcast after sampling in XPU model runner to ensure consistent results across ranks. (#7096) 2026-04-08 19:26:53 +08:00
RichardWooSJTU 771d42c90b [TBO] Apply tbo to gpu_model_runner (#7165)
* apply tbo in gpu_model_runner

* fix
2026-04-08 16:55:17 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
GoldPancake 9d4fd19c3f [Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding (#7215) 2026-04-07 20:22:28 +08:00
Nana 367d37b523 fix typo (#7147) 2026-04-07 16:30:32 +08:00
huicongyao 095a11d932 fix MTP bugs in TP and overlap (#7172)
* fix MTP bugs in TP and overlap

* fix
2026-04-03 14:19:11 +08:00
cmcamdy 7a2e33098f [XPU] Refactor pre process (#6993)
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
2026-04-01 20:29:55 +08:00
yzwu ceaf5df350 [Iluvatar] Fix cuda graph error for tp > 1 in ernie models (#7126) 2026-04-01 19:13:34 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
Yonghua Li a3cc3aa777 [BugFix] reset exist tasks signal in clear_data (#7111)
* [BugFix] reset exist tasks signal in clear_data

* [Fix] fix stale exist tasks signal after weight update

* [Chore] downgrade detected new requests log to DEBUG level

* [fix] adjust continue place
2026-03-31 21:24:08 +08:00
chenjian 6727df8286 [Optimization] Optimize ttft for prefill pd (#6680)
* optimize ttft

* fix

* fix

* fix ci

* fix ci

* fix

* fix bug

* fix

* add comments

* fix ci

* fix

* fix ci

* fix format

* update according to review

* add comment

* fix

* fix format
2026-03-30 20:36:23 +08:00
jackyYang6 05f2d95729 [RL] Adapt async rollout checkpoint update flow (#7042)
* update checkpoint-transfer flow and control update_weights params

* test: add update_weights route validation
2026-03-30 19:19:34 +08:00
yzwu 8789329457 [Iluvatar] Support wi4a16 group_gemm (#7078) 2026-03-30 19:03:51 +08:00
GoldPancake 6693bcd0e4 [BugFix] fix clear_parameters in draft cudagraph (#7035) 2026-03-27 15:28:50 +08:00
Yonghua Li 442514252c [fix] remove all gather ep group control requests in normal cases (#7026) 2026-03-26 18:41:29 +08:00
freeliuzc 4fd877ed43 [Speculative Decoding] Support mtp expert-parallel and support different modality deploy (#7018)
* support mtp ep and support different modality

* fix default arg
2026-03-26 13:52:16 +08:00
Yonghua Li a7f52c300d [Feature] support v1 update/clear api for RL (#6761)
* [Feature] support v1 update/clear api for RL

* [fix] fix execute_model and add sleep/wakeup api

* [fix] fix mtp and key_prefix

* [chore] move _update_key_prefix to resume method

* [fix] make the interface safe to call multiple times

* [fix] fix some tiny bugs

* [chore] make small changes against pr review

* [docs] add docs for weight update

* [test] add some tests and update docs

* [style] fix code style check

* [test] fix ci

* [fix] fix stale control responses when control method timed out

* [chore] remove unused code

* [chore] fix code style

* [chore] optimize tags and key_prefix

* [test] fix ci

* [chore] fix code style

* [test] fix ci

* [fix] fix ep control

* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
bukejiyu c62f6b4ea5 [Others] Fix PD reorder for MTP (#6792)
* fix pd reorder in mtp

* add ut

* update

* fix mtp
2026-03-23 21:10:22 +08:00
xiaoxiaohehe001 c1f7991aec [BugFix] add worker_process no grad (#6971) 2026-03-23 02:10:56 -07:00
sunxin 7a78001be2 fix execute_model_normal in empty run (#6968) 2026-03-23 14:07:46 +08:00
周周周 1c38da2118 Make seq_lens_this_time/decoder/encoder equal shape (#6942) 2026-03-20 15:31:52 +08:00
yzwu 8b890c0d72 [Iluvatar] refactor attn and moe code (#6887) 2026-03-18 10:31:00 +08:00
qwes5s5 3b7507a4c2 test_abort (#6743) 2026-03-17 14:06:40 +08:00
huicongyao eab429d05e fix performance drop while no spec (#6866) 2026-03-17 13:06:36 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
ming1753 bb925c605f [Other] Adjust GPUModelRunner to enhance compatibility (#6851) 2026-03-16 14:49:19 +08:00
huicongyao 2e63d88f7a [Optimization][Speculative Decoding]Fuse padding sampling params (#6765)
* optimize speculate pre process unit test

* Add CUDA kernel for building sampling params in speculative decoding

* init infer seed in device

* format code

* add unittest & fix

* fix

* format-code

* format-code

* fix rebase

* .

* fix unitest
2026-03-12 05:05:15 -07:00
MingkunZhang a9ace998db [Metax][Fix] fix ci error based pr#6805 caused by pr#6685 (#6807) 2026-03-12 19:30:16 +08:00
RAM cdaf6dd400 [RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599)
* cherry-pick  Support Fully Async and PrefixCache step 1

* copy routing_indices_cache.py from 2.4

* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)

* cherry-pick [RL] Clear Requests status of R3 (#6569)

* delete code

* fix rename bug

* fix status shape bug

* fix ci
2026-03-12 01:13:30 -07:00
cmcamdy 3543088d3e [XPU] rm stop nums (#6651)
* rm stop nums

* fix conflict

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-03-12 14:05:58 +08:00