Commit Graph

238 Commits

Author SHA1 Message Date
google-labs-jules[bot] 69c7dd0a19 Bolt: Optimize single element list appends
Replaced instances of `.extend([item])` with `.append(item)` in multiple files.
Using `.extend([item])` incurs memory overhead by allocating a new single-element
list and is computationally slower than calling `.append(item)` directly.

Files updated:
- fastdeploy/input/encodings/ernie_encoding.py
- fastdeploy/input/ernie4_5_vl_processor/process.py
- fastdeploy/output/token_processor.py
- fastdeploy/worker/gpu_model_runner.py
- fastdeploy/worker/metax_model_runner.py
2026-04-15 16:45:13 +00:00
GoldPancake a498720a75 [RL] Add clear_graph_opt_backend for glm4_mtp (#7378)
* add clear_grpah func

* fix spell
2026-04-15 19:44:15 +08:00
luukunn 3f84d8d893 [DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298)
* merge mm processor
2026-04-15 19:01:06 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
sunxin 00005c92e0 [BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300) 2026-04-10 03:29:45 -07:00
chenjian 427efadaee [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159)
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
2026-04-08 19:30:54 +08:00
RichardWooSJTU 771d42c90b [TBO] Apply tbo to gpu_model_runner (#7165)
* apply tbo in gpu_model_runner

* fix
2026-04-08 16:55:17 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
GoldPancake 9d4fd19c3f [Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding (#7215) 2026-04-07 20:22:28 +08:00
Nana 367d37b523 fix typo (#7147) 2026-04-07 16:30:32 +08:00
huicongyao 095a11d932 fix MTP bugs in TP and overlap (#7172)
* fix MTP bugs in TP and overlap

* fix
2026-04-03 14:19:11 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
jackyYang6 05f2d95729 [RL] Adapt async rollout checkpoint update flow (#7042)
* update checkpoint-transfer flow and control update_weights params

* test: add update_weights route validation
2026-03-30 19:19:34 +08:00
GoldPancake 6693bcd0e4 [BugFix] fix clear_parameters in draft cudagraph (#7035) 2026-03-27 15:28:50 +08:00
freeliuzc 4fd877ed43 [Speculative Decoding] Support mtp expert-parallel and support different modality deploy (#7018)
* support mtp ep and support different modality

* fix default arg
2026-03-26 13:52:16 +08:00
Yonghua Li a7f52c300d [Feature] support v1 update/clear api for RL (#6761)
* [Feature] support v1 update/clear api for RL

* [fix] fix execute_model and add sleep/wakeup api

* [fix] fix mtp and key_prefix

* [chore] move _update_key_prefix to resume method

* [fix] make the interface safe to call multiple times

* [fix] fix some tiny bugs

* [chore] make small changes against pr review

* [docs] add docs for weight update

* [test] add some tests and update docs

* [style] fix code style check

* [test] fix ci

* [fix] fix stale control responses when control method timed out

* [chore] remove unused code

* [chore] fix code style

* [chore] optimize tags and key_prefix

* [test] fix ci

* [chore] fix code style

* [test] fix ci

* [fix] fix ep control

* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
bukejiyu c62f6b4ea5 [Others] Fix PD reorder for MTP (#6792)
* fix pd reorder in mtp

* add ut

* update

* fix mtp
2026-03-23 21:10:22 +08:00
sunxin 7a78001be2 fix execute_model_normal in empty run (#6968) 2026-03-23 14:07:46 +08:00
周周周 1c38da2118 Make seq_lens_this_time/decoder/encoder equal shape (#6942) 2026-03-20 15:31:52 +08:00
qwes5s5 3b7507a4c2 test_abort (#6743) 2026-03-17 14:06:40 +08:00
huicongyao eab429d05e fix performance drop while no spec (#6866) 2026-03-17 13:06:36 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
ming1753 bb925c605f [Other] Adjust GPUModelRunner to enhance compatibility (#6851) 2026-03-16 14:49:19 +08:00
huicongyao 2e63d88f7a [Optimization][Speculative Decoding]Fuse padding sampling params (#6765)
* optimize speculate pre process unit test

* Add CUDA kernel for building sampling params in speculative decoding

* init infer seed in device

* format code

* add unittest & fix

* fix

* format-code

* format-code

* fix rebase

* .

* fix unitest
2026-03-12 05:05:15 -07:00
RAM cdaf6dd400 [RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599)
* cherry-pick  Support Fully Async and PrefixCache step 1

* copy routing_indices_cache.py from 2.4

* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)

* cherry-pick [RL] Clear Requests status of R3 (#6569)

* delete code

* fix rename bug

* fix status shape bug

* fix ci
2026-03-12 01:13:30 -07:00
Yonghua Li 7811eeccaa [fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) 2026-03-11 15:02:32 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
sunxin 812657beee fix pd overlap (#6753) 2026-03-10 20:29:54 +08:00
AIbin c3aceb6bdc [Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689)
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
sunxin 28f7727a3d [Feature] Set overlap schedule as default (#6668)
* overlap default
2026-03-09 22:34:54 +08:00
jc b0fd242add [BugFix] Fix error in dynamic c8 cache (#6544)
* [BugFix] Fix error in dynamic c8 cache

* fix device id
2026-03-06 10:11:23 +08:00
sunxin 0dc7034ce0 [Model Runner] Deprecate not_need_stop (#6356)
* Deprecate not_need_stop
2026-03-05 10:55:42 +08:00
sunxin aee97e3aae fix exist_prefill_flag when preempted task (#6629) 2026-03-04 11:11:40 +08:00
huicongyao 0f718baaf2 [Speculative Decoding]Reformat input preprocess for spec decode (#6501)
* add speculate_pre_process kernel

* reduce one slice

* make d2h async && fix mtp bug for new pre_process

* fix

* add unitest

* fix: code stype formatting

* fix

* fix: thread race in speculate_preprocess && rename d2h event
2026-03-03 10:22:07 +08:00
ming1753 97eee75677 [Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407)
* Optim GPU Mem Usage

---------

Co-authored-by: huzesen <huzesen@baidu.com>
2026-02-28 15:07:43 +08:00
cmcamdy 13447279aa [XPU] Fix PD + MTP (#6495)
* fix pd + mtp

* fix code style

* fix PD + MTP, D get P's first token

* add anno for gpu(speculate_update)

* update draft insertv1

* fix wapper & kernel

* fix wapper

* fix code stype
2026-02-27 19:07:35 +08:00
gongweibao edd31e8849 [Feature] Add Deterministic Inference Support (#6476)
* add

* [tests] Add Paddle attention determinism tests and refactor resource manager

Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* add

* add

* add

* add

* add more

* add more

* fixsome

* fixsome

* fix bugs

* fix bugs

* only in gpu

* add docs

* fix comments

* fix some

* fix some

* fix comments

* add more

* fix potential problem

* remove not need

* remove not need

* remove no need

* fix bug

* fix bugs

* fix comments

* fix comments

* Update tests/ce/deterministic/test_determinism_verification.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/inter_communicator/test_ipc_signal.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/engine/test_sampling_params_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism_standalone.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix comments

* fix import error

* fix a bug

* fix bugs

* fix bugs

* fix coverage

* refine codes

* refine code

* fix comments

* fix comments

* fix comments

* rm not need

* fix allreduce large tensor bug

* mv log files

* mv log files

* add files

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-26 19:31:51 -08:00
GoldPancake 2178f2829b [Speculative Decoding] Support suffix decoding (#6403)
* support suffix decoding
2026-02-26 11:42:05 +08:00
Yuanle Liu 6d3fede240 [OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 (#6493)
* Initial plan

* Migrate PRs #6311, #6129, #6305 to develop and merge unit tests

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix

* update

* fix

* fix ci

* fix ci

* Initial plan

* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add disable-thinking case to test_chat_with_response_max_tokens

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add both reasoning_max_tokens and response_max_tokens case

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix ci

* fix ci

* fix ci

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
2026-02-25 21:36:50 +08:00
Yonghua Li e2332a1112 [BugFix] fix num_cpu_blocks computation (#6438)
* [BugFix] fix num_cpu_blocks computation

* [fix] fix syntax and log

* [fix] pre-commit

* [fix] use getattr

* [fix] ci test
2026-02-13 11:05:14 +08:00
yzwu 60e75ea8e8 [Iluvatar][CI] Fix cannot import get_stop (#6165) 2026-02-10 16:57:23 +08:00
kevin 3ce842b55b [BugFix] add reset shared inputs when update weight dummy run (#6331)
* fix dummy run input bug

* update code

* update code

* update code

* update code
2026-02-10 10:29:03 +08:00
kevin d60daca4a8 [Feature] consider multimodal model when dummy run (#6045)
* add mm do profile

* updata code

* update code

* update code

* update code

* update test case

* update code

* update code

* fix xpu bug

* update code

* add mm do profile

* update test case

* update code
2026-02-09 17:49:55 +08:00
sunxin 783d56e28a [Optimization] Support logprob async copy (#6362)
* support logprob async copy

* fix prompt logprob

* fix xpu
2026-02-09 17:32:12 +08:00
周周周 2b4748de4f [MTP] refactor MTP pre_process (#6358) 2026-02-09 10:47:15 +08:00
GoldPancake 183b8d325a [RL] Support GLM MTP RL Model (#6267) 2026-02-04 20:14:35 +08:00
sunxin 9b0a82cfa9 [Model Runner] Support overlap schedule (#6259) 2026-02-04 10:49:44 +08:00