GoldPancake
df3b4e12f4
[Speculative Decoding] Add MTP logprob support for PD disaggregation ( #7442 )
...
* support mtp logprob in pd
* fix
* fix
* fix
* fix xpu bugs
2026-04-17 21:37:38 +08:00
ShaneGZhu
2d8338f9e4
[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. ( #7367 )
2026-04-16 19:54:12 +08:00
ddchenhao66
e9527208d9
[BugFix][XPU] Fix kv_cache management bug ( #7420 )
2026-04-16 15:45:45 +08:00
Jiajun Ji
29495b2cf1
[XPU] Unify Spec and non-spec branch.( #6947 ) ( #7180 )
...
* [XPU] cherry-pick PR-6947
* [XPU] use unified_update_model_status.
* refactor xpu_model_runner.
* refactor sampler.
* fix codestyle.
* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.
* fix codestyle.
* replace output_padding_offset with is_speculative flag in gather_next_token.
* rename hiddden_states.
* unify cu_seqlens_q_output and batch_id_per_token_output init.
---------
Co-authored-by: cmcamdy <1027740945@qq.com >
2026-04-16 14:58:38 +08:00
RuohengMa
de0c5e68fb
[XPU] Split the block_attn operator into smaller operators ( #6798 )
...
* spliced block_attn
* adapt to latest vllm
* fix unit tests
* delete mtp+cudagraph 4 cards test
* fix vl model
* fix mtp
* fix slot mapping
2026-04-16 14:28:40 +08:00
Bingoo
6b891da02b
[Optimization] enable trtllm_all_reduce fusion kernel in glm model ( #6660 )
...
* enable trtllm_all_reduce fusion kernel in glm model
* fix conflict
* format update
* fix a bug
* modify test
* modify test
* support empty tensor and modify test
* fix test_linear config issues
* modify test name
* add edge test case
* modify format
* fix conflict
* modify default max token num in trtllm_allreduce_fusion
* add max token num branch for trtllm_allreduce_fusion
* fix format
* fix rmsnorm config issue
* modify 2025 to 2026
* using compat grard
* Lazily import flashinfer.comm and fix test config issue
* fix test issues
* add flashinfer cache dir clean machine
* fix some issues
2026-04-16 14:10:19 +08:00
GoldPancake
a498720a75
[RL] Add clear_graph_opt_backend for glm4_mtp ( #7378 )
...
* add clear_grpah func
* fix spell
2026-04-15 19:44:15 +08:00
luukunn
3f84d8d893
[DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline ( #7298 )
...
* merge mm processor
2026-04-15 19:01:06 +08:00
Echo-Nie
8819a039c9
[Others] Fix typo ( #7280 )
...
* typo
* typo
* typo
* typo
2026-04-14 17:28:22 +08:00
xiaoxiaohehe001
abba29b348
[BugFix] fix mm rope ( #7274 )
2026-04-14 11:36:08 +08:00
zhupengyang
27b00cf385
[XPU] glm-4.5-air ( #7071 )
2026-04-14 11:31:49 +08:00
Yuanle Liu
0ddb6e461c
[Optimization] 移除 num_blocks 上限限制 ( #7241 )
2026-04-13 07:07:41 -07:00
freeliuzc
31e2a8bbad
[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap ( #7323 )
...
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
Nyako Shigure
d659099415
[Cleanup] Replace torch proxy alias with public compat API ( #7348 )
2026-04-13 11:43:26 +08:00
Jiang-Jia-Jun
26d6a20c2f
[Optim] Remove IPCLock between CacheManager and WorkerProcess ( #7299 )
...
* [Optim] Remove IPCLock between CacheManager and WorkerProcess
* Update envs.py
* Update worker_process.py
---------
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com >
2026-04-12 13:59:34 +08:00
sunxin
00005c92e0
[BugFix] Fix mtp empty run issue in overlap schedule and EP model ( #7300 )
2026-04-10 03:29:45 -07:00
bukejiyu
14d46181b8
[Loader] add multi-thread model loading ( #6877 )
...
* multi-thread-loader
* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake
c1fb3112f8
[FDConfig] Support CLI args for quantization params and add cudagraph validation ( #7281 )
...
* refactor quant cli param
2026-04-10 14:13:42 +08:00
chenjian
427efadaee
[Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 ( #7159 )
...
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1
* fix
2026-04-08 19:30:54 +08:00
Jiajun Ji
9b970de029
[XPU] Add TP broadcast after sampling in XPU model runner to ensure consistent results across ranks. ( #7096 )
2026-04-08 19:26:53 +08:00
RichardWooSJTU
771d42c90b
[TBO] Apply tbo to gpu_model_runner ( #7165 )
...
* apply tbo in gpu_model_runner
* fix
2026-04-08 16:55:17 +08:00
K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
GoldPancake
9d4fd19c3f
[Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding ( #7215 )
2026-04-07 20:22:28 +08:00
Nana
367d37b523
fix typo ( #7147 )
2026-04-07 16:30:32 +08:00
huicongyao
095a11d932
fix MTP bugs in TP and overlap ( #7172 )
...
* fix MTP bugs in TP and overlap
* fix
2026-04-03 14:19:11 +08:00
cmcamdy
7a2e33098f
[XPU] Refactor pre process ( #6993 )
...
* [XPU] support speculate_pre_process
* merge develop
* fix codestype
* fix mtp, support cu_seqlens_q_output
* fix mtp, support cu_seqlens_q_output
* fix test
---------
Co-authored-by: lizan1999 <lizan03@baidu.com >
2026-04-01 20:29:55 +08:00
yzwu
ceaf5df350
[Iluvatar] Fix cuda graph error for tp > 1 in ernie models ( #7126 )
2026-04-01 19:13:34 +08:00
sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
Yonghua Li
a3cc3aa777
[BugFix] reset exist tasks signal in clear_data ( #7111 )
...
* [BugFix] reset exist tasks signal in clear_data
* [Fix] fix stale exist tasks signal after weight update
* [Chore] downgrade detected new requests log to DEBUG level
* [fix] adjust continue place
2026-03-31 21:24:08 +08:00
chenjian
6727df8286
[Optimization] Optimize ttft for prefill pd ( #6680 )
...
* optimize ttft
* fix
* fix
* fix ci
* fix ci
* fix
* fix bug
* fix
* add comments
* fix ci
* fix
* fix ci
* fix format
* update according to review
* add comment
* fix
* fix format
2026-03-30 20:36:23 +08:00
jackyYang6
05f2d95729
[RL] Adapt async rollout checkpoint update flow ( #7042 )
...
* update checkpoint-transfer flow and control update_weights params
* test: add update_weights route validation
2026-03-30 19:19:34 +08:00
yzwu
8789329457
[Iluvatar] Support wi4a16 group_gemm ( #7078 )
2026-03-30 19:03:51 +08:00
GoldPancake
6693bcd0e4
[BugFix] fix clear_parameters in draft cudagraph ( #7035 )
2026-03-27 15:28:50 +08:00
Yonghua Li
442514252c
[fix] remove all gather ep group control requests in normal cases ( #7026 )
2026-03-26 18:41:29 +08:00
freeliuzc
4fd877ed43
[Speculative Decoding] Support mtp expert-parallel and support different modality deploy ( #7018 )
...
* support mtp ep and support different modality
* fix default arg
2026-03-26 13:52:16 +08:00
Yonghua Li
a7f52c300d
[Feature] support v1 update/clear api for RL ( #6761 )
...
* [Feature] support v1 update/clear api for RL
* [fix] fix execute_model and add sleep/wakeup api
* [fix] fix mtp and key_prefix
* [chore] move _update_key_prefix to resume method
* [fix] make the interface safe to call multiple times
* [fix] fix some tiny bugs
* [chore] make small changes against pr review
* [docs] add docs for weight update
* [test] add some tests and update docs
* [style] fix code style check
* [test] fix ci
* [fix] fix stale control responses when control method timed out
* [chore] remove unused code
* [chore] fix code style
* [chore] optimize tags and key_prefix
* [test] fix ci
* [chore] fix code style
* [test] fix ci
* [fix] fix ep control
* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
freeliuzc
e87ce4b8cd
[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess ( #6973 )
...
* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
bukejiyu
c62f6b4ea5
[Others] Fix PD reorder for MTP ( #6792 )
...
* fix pd reorder in mtp
* add ut
* update
* fix mtp
2026-03-23 21:10:22 +08:00
xiaoxiaohehe001
c1f7991aec
[BugFix] add worker_process no grad ( #6971 )
2026-03-23 02:10:56 -07:00
sunxin
7a78001be2
fix execute_model_normal in empty run ( #6968 )
2026-03-23 14:07:46 +08:00
周周周
1c38da2118
Make seq_lens_this_time/decoder/encoder equal shape ( #6942 )
2026-03-20 15:31:52 +08:00
yzwu
8b890c0d72
[Iluvatar] refactor attn and moe code ( #6887 )
2026-03-18 10:31:00 +08:00
qwes5s5
3b7507a4c2
test_abort ( #6743 )
2026-03-17 14:06:40 +08:00
huicongyao
eab429d05e
fix performance drop while no spec ( #6866 )
2026-03-17 13:06:36 +08:00
gongweibao
a6351dea0b
[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages ( #6533 )
...
* init
* init
* fix format
* add
* add files
* add ut
* fix some
* add ut
* add more
* add
* fix pre-commit
* fix pre-commit
* fix cover
* skip long seq
* add
* add
* fix
* remove not need
* fix set attr
* fix comments
* fix comments
* fix failed tests
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-16 21:32:43 +08:00
ming1753
bb925c605f
[Other] Adjust GPUModelRunner to enhance compatibility ( #6851 )
2026-03-16 14:49:19 +08:00
huicongyao
2e63d88f7a
[Optimization][Speculative Decoding]Fuse padding sampling params ( #6765 )
...
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
2026-03-12 05:05:15 -07:00
MingkunZhang
a9ace998db
[Metax][Fix] fix ci error based pr#6805 caused by pr#6685 ( #6807 )
2026-03-12 19:30:16 +08:00
RAM
cdaf6dd400
[RL][Cherry-Pick] Support Fully Async and PrefixCache ( #6599 )
...
* cherry-pick Support Fully Async and PrefixCache step 1
* copy routing_indices_cache.py from 2.4
* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388 )
* cherry-pick [RL] Clear Requests status of R3 (#6569 )
* delete code
* fix rename bug
* fix status shape bug
* fix ci
2026-03-12 01:13:30 -07:00
cmcamdy
3543088d3e
[XPU] rm stop nums ( #6651 )
...
* rm stop nums
* fix conflict
---------
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-03-12 14:05:58 +08:00