freeliuzc
43685a98a7
[BugFix] Fix real token exceeding max_batched_tokens limit ( #7438 )
...
* fix max_num_batched_tokens error compute
* add temperatory solution
* fix bug
2026-04-17 16:18:07 +08:00
kevin
ff47701f31
[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario ( #7364 )
...
## Motivation
在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。
## Modifications
1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
`update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
Yonghua Li
3b8dac3b97
[BugFix] prevent requests from entering running state without a slot ( #7141 )
...
* [fix] prevent requests from entering running state without a slot
* [fix] count abort set
* [fix] count preempted task in waiting list
2026-04-03 14:07:57 +08:00
chenjian
2632e6cf32
[Feature] Support chunk prefill disabled in scheduler v1 ( #7152 )
2026-04-03 10:18:14 +08:00
jc
af51fc46d6
[PD Disaggregation] Write the cache of preempted req to storage and refine PD Disaggregation ( #7107 )
...
* Write the cache of preempted req to storage
* up
* fix
2026-04-01 13:15:52 +08:00
qwes5s5
daa95244f7
abort requests ( #6992 )
2026-03-31 11:02:26 +08:00
freeliuzc
4fd877ed43
[Speculative Decoding] Support mtp expert-parallel and support different modality deploy ( #7018 )
...
* support mtp ep and support different modality
* fix default arg
2026-03-26 13:52:16 +08:00
qwes5s5
3b7507a4c2
test_abort ( #6743 )
2026-03-17 14:06:40 +08:00
jc
04fde3b227
[PD Disaggregation] Prefill and decode support cache storage ( #6768 )
...
* Prefill and decode support cache storage
* up
* up
* update docs and refine mooncake store
* up
2026-03-16 14:44:49 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
ddchenhao66
a502dda1fe
[BugFix] fix multi-step mtp bug ( #6754 )
2026-03-11 10:16:04 +08:00
Yonghua Li
fa1906bd6f
[BugFix] Fix inaccurate cache hit rate and TTFT after request preemption ( #6620 )
...
* [chore] add has_been_rescheduled flag for requests
* [refactor] rename reschedule to preempted for accuracy and fix cache hit metrics
* [chore] add ttft_s
2026-03-05 16:25:02 +08:00
kevin
5d42f19e0a
[BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation ( #6541 )
...
* fix mtp acceptance rate decline
* [BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation
Fix the calculation of can_schedule_block_num_threshold in
ResourceManagerV1. The original formula using need_prefill_tokens
could lead to incorrect threshold values. Now directly use
num_chunk_new_block for accurate block scheduling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-28 16:23:18 +08:00
gongweibao
edd31e8849
[Feature] Add Deterministic Inference Support ( #6476 )
...
* add
* [tests] Add Paddle attention determinism tests and refactor resource manager
Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* add
* add
* add
* add
* add more
* add more
* fixsome
* fixsome
* fix bugs
* fix bugs
* only in gpu
* add docs
* fix comments
* fix some
* fix some
* fix comments
* add more
* fix potential problem
* remove not need
* remove not need
* remove no need
* fix bug
* fix bugs
* fix comments
* fix comments
* Update tests/ce/deterministic/test_determinism_verification.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/inter_communicator/test_ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/engine/test_sampling_params_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism_standalone.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix comments
* fix import error
* fix a bug
* fix bugs
* fix bugs
* fix coverage
* refine codes
* refine code
* fix comments
* fix comments
* fix comments
* rm not need
* fix allreduce large tensor bug
* mv log files
* mv log files
* add files
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-26 19:31:51 -08:00
Yuanle Liu
6d3fede240
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 ( #6493 )
...
* Initial plan
* Migrate PRs #6311 , #6129 , #6305 to develop and merge unit tests
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix
* update
* fix
* fix ci
* fix ci
* Initial plan
* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add disable-thinking case to test_chat_with_response_max_tokens
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add both reasoning_max_tokens and response_max_tokens case
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix ci
* fix ci
* fix ci
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2026-02-25 21:36:50 +08:00
CSWYF3634076
7380bfb476
[BugFix]fix console log metrics waitting queue count ( #6432 )
...
* [BugFix]fix console log metrics waitting queue count
* [BugFix]fix console log metrics waitting queue count unittest
2026-02-11 10:51:49 +08:00
Dangweichong
62ac1e543f
[BugFix] Compatibility fix for download feature links ( #6276 )
...
* [BugFix] Compatibility fix for download feature links
* add download time log
* remove paddle tensor case
2026-02-10 14:21:08 +08:00
CSWYF3634076
335ab70b1c
[Feature] console print metrics add env ( #6413 )
2026-02-10 09:37:11 +08:00
Yonghua Li
5ac5ecd0b0
[BugFix] fix cache transfer tasks failure after cache cleared ( #6202 )
...
* [fix] fix cache transfer tasks failure after cache cleared
* [fix] fix submit_task
* [fix] fix cache manager hang when clearing prefix cache
* [fix] fix list_proxy has no clear method
* [fix] fix barrier
* [fix] add barrier0
* [fix] add cache_task_is_paused_signal
* [fix] fix condition
* [fix] fix cache transfer sync and delay prefix cache tree clearing
* [fix] fix typo
* [chore] polish code
* [fix] revert only rank0 write kv_cache_status_signal
* [fix] fix thread pool and prefix cache manager hang
* [fix] add timeout for task_swapping_event
* [fix] tolerate prefix cache manager error while prefix tree is cleared
* [chore] add more log
* [fix] fix test_prefix_cache_manager
* [fix] fix prefix_cache_status_signal usage
2026-02-08 15:33:56 +08:00
CSWYF3634076
1c0a2b055f
[Feature] console print statistical metrics ( #6339 )
...
* [Feature] console print statistical data
* [Feature] console print statistical data v2 dp_rank
* [Feature] console print statistical data v2 unittest
* [Feature] console print statistical data v3 unittest
2026-02-05 19:20:36 +08:00
chenjian
292bab7e6d
[BugFix] Fix bug for enable output caching ( #6226 )
...
* [BugFix] Fix bug for enable output caching
* fix
* Fix
* fix
* fix ci
2026-01-30 10:55:36 +08:00
wangyifei
b7c5daa316
[RL] add pause, update_weights, resume interface for async RL ( #6052 )
...
* support dynamic run_control_request through zmq from apiserver to common_engine
* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method
* change /is_puased from HTTP POST method to GET method
* add pause、resume、is_paused implementation
* support engine <==> worker communication(request&response)
* support sync weights through RDMA from checkpoint_transfer
* support specified version, rsync_config in update_weights rpc call
* add pause, update_weights, resume interface for async RL
* bug fix: update_weights support using default arguments
* fix typo
* typo fix
* typo fix
* typo fix
* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all
* add "rsync" to LoadConfig.load_strategy Literal type hints
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* typo fix
* typo fix
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* check version/rsync params
* add error log when version.txt not exists
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* raise specified ValueError when paramters check failed
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* tp barrier after run_control_method
* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue
* typo fix
* typo fix
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-01-23 10:18:07 +08:00
Yonghua Li
8d27a523e7
[Feature] [KVCache] support attention_store kv cache backend ( #5823 )
...
* [feat] support attention_store kv cache backend
* [fix] fix codestyle
* [chore] optimize log
* [fix] fix write storage task
* [fix] fix read storage
* [fix] fix code conflict after merge develop
* [fix] fix cache bytes and read task token ids
* [chore] add model for cache transfer manager
* [chore] add some log
* [chore] remove launched_cache_manager_signal
* [fix] fix write_back_storage_task match_block_num condition
* [fix] fix swap_cost_time
* [ci] fix ci
* Update fastdeploy/engine/sched/resource_manager_v1.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/cache_manager/cache_transfer_manager.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/cache_manager/transfer_factory/mooncake_store/attention_store.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-01-22 21:01:23 +08:00
ddchenhao66
3685474799
[XPU] xpu support mm prefill batch ( #6072 )
...
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-19 14:36:35 +08:00
kevin
0e0eaa1c57
[BugFix] fix mm revert bug ( #6061 )
...
* fix mm revert bug
* update code
2026-01-16 08:13:34 -08:00
qwes5s5
b2a2e11551
[Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. ( #5320 )
...
* request disconnect
* request disconnect
* fix bug
* fix bug--amend
---------
Co-authored-by: root <root@yq01-sys-rpm26xc1knu.yq01.baidu.com >
2026-01-16 11:46:13 +08:00
ddchenhao66
9373f373dc
[XPU] fix multi-batch bug in VL model ( #6015 )
...
* [XPU] fix multi-batch bug in VL model
* Add command to kill additional port processes
---------
Co-authored-by: ddchenhao66 <dhaochen163.com>
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-01-14 19:44:58 +08:00
chenjian
74d0f1c01f
[Optim] Robust sync status when preempted happens ( #5796 )
...
* [Bug fix] Sync status for caching output cache
* fix
* fix
* fix bug
* fix
* fix
* support xpu
* fix
* fix
* fix
* fix
* fix
* fix ci
* fix ci
* fix xpu
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-14 12:07:33 +08:00
kevin
cb9f952f32
[BugFix] fix metrics cache tokens ( #6001 )
...
* fix metrics cache tokens
* update code
2026-01-12 22:50:56 -08:00
zhupengyang
9db48ecb34
[XPU] fix dp4 ( #5946 )
2026-01-09 20:36:53 +08:00
kevin
2d2b156252
[BugFix] fix dyc8 cache bug ( #5958 )
...
* fix dyc8 cache bug
* update code
2026-01-08 19:25:47 -08:00
Daci
d8c6ba61f3
[BugFix] resource_manager_v1 lock PD ( #5616 )
...
* bugfix resource_manager_v1 lock PD
* with lock add_prefilled_request
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-01-08 10:02:54 +08:00
chenjian
925e7edd3c
[Bug fix] Limit multi-modal request to 1 ( #5901 )
2026-01-07 20:25:07 +08:00
chenjian
c883a2d3ec
[Optimization] Reduce preemption occurrence when blocks not enough ( #5696 )
...
* [Optimize] Reduce preemption occurrence when blocks not enough for decoding
* fix
* fix
* fix spell
* optimize performance
* fix
2026-01-07 20:01:16 +08:00
kevin
eabd01cd21
[BugFix] fix eb5 prefix bug ( #5879 )
...
* fix eb5 prefix bug
* update ci test
* update code
* update code
* update code
* update code
* update code
* update code
* update code
2026-01-06 23:50:39 -08:00
fmiao2372
1ee285c2d6
[Intel HPU] enable chunked prefill ( #5903 )
...
* [Intel HPU] enable chunked prefill
* fix bug by copilot comments
2026-01-06 21:01:50 +08:00
jc
e911ac2ce7
[BugFix] Refine the preparation of cpu and storage cache ( #5777 )
...
* Refine the preparation of cpu and storage cache
* fix error
* fix error
* up
* fix
* up docs
* fix unittest
* remove debug info
2026-01-05 10:13:30 +08:00
kevin
52dc9a7b85
[BugFix] skip mm revert ( #5848 )
...
* skip mm revert
* update code
* update test
2026-01-04 14:25:45 +08:00
kevin
74e162697f
eb5 mm skip prefix cache ( #5838 )
2025-12-30 05:30:48 -08:00
CSWYF3634076
9286403570
[Models] Add Qwen3-VL Model Support ( #5763 )
...
* support v1 loader
* remove useless code
* remove useless
* [Model] support Qwen3VL images success
* [Model] support Qwen3VL rope_3d
* [Model] support Qwen3VL remove log
* [Model] support Qwen3VL RL
* [Model] support Qwen3VL tp
* [Model] support Qwen3VL video
* [Model] support Qwen3VL fix ernievl
* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds
* [Model] support Qwen3VL fix multi card
* [Model] support Qwen3VL file close
* [Model] support Qwen3VL fix ce
* [Model] support Qwen3VL fix unittest
* [Model] support Qwen3VL add unittest
---------
Co-authored-by: Ayakouji <yuhongh@qq.com >
2025-12-29 17:39:33 +08:00
kevin
894f4e312b
[FDConfig] disable chunked_mm_input in ernie5 ( #5774 )
...
* disable chunked_mm_input in ernie5
* update code
* update code
* update test case
* update testcase
* upate case
2025-12-26 15:31:27 +08:00
RichardWooSJTU
01c18f328f
rename need_block_num_signal ( #5623 )
2025-12-26 11:02:29 +08:00
Juncai
412867fd99
[Feature] Support KV Cache Storage ( #5571 )
...
* Support Mooncake Store
* up
* up
* add op
* fix conflict
* fix error
* up for comments
* avoid thread lock
* up
* fix unittest
* fix unittest
* remove debug info
* consider tp_size > 1
* add default rdma_nics
* add utils
* up
* fix error
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-25 16:30:35 +08:00
ming1753
85db9d5e56
[Others] reschedule preempt task support optional func ( #5649 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
* [Others] reschedule preempt task support optional func
* fix bug
* fix bug
2025-12-23 20:45:52 +08:00
ming1753
81384ef29e
[BugFix] fix download feature bug ( #5669 )
2025-12-22 13:46:39 +08:00
kevin
807e404369
[BugFix] fix eb5 mm prefix cache bug ( #5638 )
...
* fix eb5 mm prefix cache bug
* update code
---------
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2025-12-19 14:57:37 +08:00
yzwu
ac013803f3
[Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode ( #5555 )
2025-12-18 02:14:25 -08:00
Jiang-Jia-Jun
8b6395478a
Revert "[BugFix] reschedule_preempt_task append waiting & PREEMPTED blocksize…" ( #5575 )
...
This reverts commit dbedb0797b .
2025-12-16 11:12:57 +08:00
Daci
dbedb0797b
[BugFix] reschedule_preempt_task append waiting & PREEMPTED blocksize ( #5506 )
...
* bugfix reschedule_preempt_task append waiting & PREEMPTED blocksize
* bugfix reschedule_preempt_task append waiting & PREEMPTED blocksize
* 注释
* [bugfix] PREEMPTED task blocksize
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-12-12 17:43:29 +08:00