K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
sunxin
ae2f9f4d22
[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL ( #7130 )
...
* rl gate fp32
* clean
2026-04-06 21:07:38 -07:00
Yonghua Li
3b8dac3b97
[BugFix] prevent requests from entering running state without a slot ( #7141 )
...
* [fix] prevent requests from entering running state without a slot
* [fix] count abort set
* [fix] count preempted task in waiting list
2026-04-03 14:07:57 +08:00
chenjian
2632e6cf32
[Feature] Support chunk prefill disabled in scheduler v1 ( #7152 )
2026-04-03 10:18:14 +08:00
Yonghua Li
98f3fc9267
[RL] [KVCache] let cache transfer managers update key prefix after weight update and add unit tests ( #7083 )
...
* [test] add a few unit tests
* [feat] update key prefix when model weights are updated
* [test] try to fix test_worker_process
2026-04-02 19:58:41 +08:00
sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
zhouchong
91c832f607
[Feature] Add logging parameters and error output to terminal ( #7098 )
2026-04-01 13:18:42 +08:00
jc
af51fc46d6
[PD Disaggregation] Write the cache of preempted req to storage and refine PD Disaggregation ( #7107 )
...
* Write the cache of preempted req to storage
* up
* fix
2026-04-01 13:15:52 +08:00
luukunn
3651113ee5
[DataProcessor]Remove ENABLE_V1_DATA_PROCESSOR ( #7052 )
...
* remove ENABLE_V1_DATA_PROCESSOR
* fix unit test
* fix unit test
2026-04-01 09:53:41 +08:00
qwes5s5
daa95244f7
abort requests ( #6992 )
2026-03-31 11:02:26 +08:00
chenjian
6727df8286
[Optimization] Optimize ttft for prefill pd ( #6680 )
...
* optimize ttft
* fix
* fix
* fix ci
* fix ci
* fix
* fix bug
* fix
* add comments
* fix ci
* fix
* fix ci
* fix format
* update according to review
* add comment
* fix
* fix format
2026-03-30 20:36:23 +08:00
freeliuzc
4fd877ed43
[Speculative Decoding] Support mtp expert-parallel and support different modality deploy ( #7018 )
...
* support mtp ep and support different modality
* fix default arg
2026-03-26 13:52:16 +08:00
Yonghua Li
a7f52c300d
[Feature] support v1 update/clear api for RL ( #6761 )
...
* [Feature] support v1 update/clear api for RL
* [fix] fix execute_model and add sleep/wakeup api
* [fix] fix mtp and key_prefix
* [chore] move _update_key_prefix to resume method
* [fix] make the interface safe to call multiple times
* [fix] fix some tiny bugs
* [chore] make small changes against pr review
* [docs] add docs for weight update
* [test] add some tests and update docs
* [style] fix code style check
* [test] fix ci
* [fix] fix stale control responses when control method timed out
* [chore] remove unused code
* [chore] fix code style
* [chore] optimize tags and key_prefix
* [test] fix ci
* [chore] fix code style
* [test] fix ci
* [fix] fix ep control
* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
jc
bb881c2c0a
[PD Disaggregation] pd + cache_storage support vl model ( #6906 )
...
* pd + cache_storage support vl model
* support vl model
* fix test
2026-03-23 15:35:20 +08:00
luukunn
f4a79d4c00
[Optimization]Unified data processing for online and offline ( #6891 )
...
* remove process_request
* fix chat
* fix unit test
* remove process response
* fix unit test
* fix offline decode
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com >
* fix sampling_params
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com >
2026-03-19 21:56:09 +08:00
luukunn
c3d8db85c4
[Optimization] Update ZMQ server ( #6735 )
...
* add batch zmq send reaponse
* update
* Revert "update"
This reverts commit 0234a25b47 .
* update
* remove lock
* fix unit test
* add unit test
* add unit test
* pre commit
* add unit test
* fix unit test
* add unit test
* fix worker>1
* update zmq_worker_pid
* fix unit test
* fix unit test
* fix unit test
* add unit test
* fix unit test
* fix first token time
* fix logprobs
* add unit test
* op
* remore debug log
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-03-19 21:53:16 +08:00
Jiang-Jia-Jun
12eb001d0c
Remove comments on multi-mode request handling
...
Removed comments about multi-mode scenarios and request pulling.
2026-03-17 14:49:00 +08:00
jc
950366e58d
[PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd ( #6718 )
...
* [Feature] Register to router with version info for PD disaggregation
Add RegisterManager for PD (Prefill-Decode) disaggregated deployment:
- All instances (Prefill/Decode) register to Router with heartbeat
- Prefill instances fetch Decode instance list from Router
- Prefill instances establish eager RDMA connections to Decode instances
- Register info includes: host_ip, port, role, version, is_paused, connected_decodes
Changes:
- Add RegisterManager class for managing PD registration and RDMA connections
- Add version field to ModelConfig for model version tracking
- Add connected_decodes to register_info for tracking connected Decode instances
- Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable
Test fixes:
- Add None checks for load_config in FDConfig.__init__
- Add version attribute to test mock model configs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* refine
* remove test
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-17 14:43:35 +08:00
qwes5s5
3b7507a4c2
test_abort ( #6743 )
2026-03-17 14:06:40 +08:00
gongweibao
a6351dea0b
[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages ( #6533 )
...
* init
* init
* fix format
* add
* add files
* add ut
* fix some
* add ut
* add more
* add
* fix pre-commit
* fix pre-commit
* fix cover
* skip long seq
* add
* add
* fix
* remove not need
* fix set attr
* fix comments
* fix comments
* fix failed tests
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-16 21:32:43 +08:00
Jiang-Jia-Jun
d113397b09
Simplify available_blocks assignment logic ( #6819 )
2026-03-16 20:12:30 +08:00
jc
04fde3b227
[PD Disaggregation] Prefill and decode support cache storage ( #6768 )
...
* Prefill and decode support cache storage
* up
* up
* update docs and refine mooncake store
* up
2026-03-16 14:44:49 +08:00
RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
ddchenhao66
a502dda1fe
[BugFix] fix multi-step mtp bug ( #6754 )
2026-03-11 10:16:04 +08:00
Jiang-Jia-Jun
b05a6c4206
[BugFix][KVCache] Add inter-process lock to fix NaN error under DP+EP ( #6724 )
...
* [BugFix] Support to fix NaN bug in EP
* Optimze notion for all the funs
* Fix potential lock contention failure issues
* Update fastdeploy/inter_communicator/ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update envs.py
* Update default value for USE_KVCACHE_LOCK
Change default value of USE_KVCACHE_LOCK from 1 to 0.
* Update worker_process.py
* Fix suffix wrong
* Update test_prefix_cache_manager.py
---------
Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-10 21:55:32 +08:00
sunxin
812657beee
fix pd overlap ( #6753 )
2026-03-10 20:29:54 +08:00
sunxin
28f7727a3d
[Feature] Set overlap schedule as default ( #6668 )
...
* overlap default
2026-03-09 22:34:54 +08:00
1
3a85ecf3bc
[Others] Fix typos in log messages and comments ( #6707 )
...
Fix spelling errors in log messages, docstrings, and comments:
- 'occured' -> 'occurred' (8 instances)
- 'Recieve'/'recieved' -> 'Receive'/'received' (7 instances)
- 'happend' -> 'happened' (3 instances)
- 'expet_servic' -> 'expert_service' (2 instances)
- 'meas' -> 'means' (1 instance)
No functional changes. Only log strings, docstrings, and comments are affected.
Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com >
2026-03-09 10:26:25 +08:00
ddchenhao66
3c0ff20328
[BugFix] fix incorrect function parameters of start_data_parallel_service ( #6674 )
2026-03-09 10:15:50 +08:00
SunLei
5d9524fc3c
[Models][Feature] Support new ERNIE reward model and add return_token_ids to reward API ( #6638 )
...
* reward model
* Add support for pooling-based inference in the reward model
* bugfix
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-03-06 18:51:00 +08:00
Yonghua Li
fa1906bd6f
[BugFix] Fix inaccurate cache hit rate and TTFT after request preemption ( #6620 )
...
* [chore] add has_been_rescheduled flag for requests
* [refactor] rename reschedule to preempted for accuracy and fix cache hit metrics
* [chore] add ttft_s
2026-03-05 16:25:02 +08:00
ddchenhao66
fa4815b93a
[BugFix] fix dp sheduler bug in ep4tp1 when start by using multi_api_server ( #6598 )
...
* [BugFix] fix dp sheduler bug in ep4tp1 when start by using multi_api_server
* [BugFix] modify request_queue and result_queue of dp scheduler
2026-03-05 10:04:12 +08:00
yzwu
3345641f4e
[Iluvatar][CI] fix the dim error of seq_lens_encoder and seq_lens_decoder ( #6637 )
2026-03-04 14:00:40 +08:00
qwes5s5
375b5b7b21
[Feature]Log Format Normalization and Trace Log Optimization ( #6370 )
...
* log refactor
* log refactor 2
* log refactor 3
2026-03-03 11:31:45 +08:00
kevin
5d42f19e0a
[BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation ( #6541 )
...
* fix mtp acceptance rate decline
* [BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation
Fix the calculation of can_schedule_block_num_threshold in
ResourceManagerV1. The original formula using need_prefill_tokens
could lead to incorrect threshold values. Now directly use
num_chunk_new_block for accurate block scheduling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-28 16:23:18 +08:00
sunxin
53aaac69da
[Optimization] Enable BF16 gate computation for GLM and Qwen ( #6457 )
...
* gate bf16
* add gate-fp32
* fix
* update baseline
* update
* update
* fix
2026-02-26 21:08:46 -08:00
gongweibao
edd31e8849
[Feature] Add Deterministic Inference Support ( #6476 )
...
* add
* [tests] Add Paddle attention determinism tests and refactor resource manager
Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* add
* add
* add
* add
* add more
* add more
* fixsome
* fixsome
* fix bugs
* fix bugs
* only in gpu
* add docs
* fix comments
* fix some
* fix some
* fix comments
* add more
* fix potential problem
* remove not need
* remove not need
* remove no need
* fix bug
* fix bugs
* fix comments
* fix comments
* Update tests/ce/deterministic/test_determinism_verification.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/inter_communicator/test_ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/engine/test_sampling_params_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism_standalone.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix comments
* fix import error
* fix a bug
* fix bugs
* fix bugs
* fix coverage
* refine codes
* refine code
* fix comments
* fix comments
* fix comments
* rm not need
* fix allreduce large tensor bug
* mv log files
* mv log files
* add files
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-26 19:31:51 -08:00
Yuanle Liu
6d3fede240
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 ( #6493 )
...
* Initial plan
* Migrate PRs #6311 , #6129 , #6305 to develop and merge unit tests
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix
* update
* fix
* fix ci
* fix ci
* Initial plan
* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add disable-thinking case to test_chat_with_response_max_tokens
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add both reasoning_max_tokens and response_max_tokens case
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix ci
* fix ci
* fix ci
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2026-02-25 21:36:50 +08:00
jackyYang6
a29ee57e15
[Feature] Support ThinkingBudget Logits processor to control thinking content length ( #6367 )
...
* feat: add thinking budget logits processor
* add unittest
* fix pre-commit
* add unittest
* docs: clarify operator-level vs logits processor usage and conflict guidance
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-02-25 14:17:09 +08:00
CSWYF3634076
7380bfb476
[BugFix]fix console log metrics waitting queue count ( #6432 )
...
* [BugFix]fix console log metrics waitting queue count
* [BugFix]fix console log metrics waitting queue count unittest
2026-02-11 10:51:49 +08:00
Dangweichong
62ac1e543f
[BugFix] Compatibility fix for download feature links ( #6276 )
...
* [BugFix] Compatibility fix for download feature links
* add download time log
* remove paddle tensor case
2026-02-10 14:21:08 +08:00
CSWYF3634076
335ab70b1c
[Feature] console print metrics add env ( #6413 )
2026-02-10 09:37:11 +08:00
CSWYF3634076
ec128068b7
[Others] Exit to ensure no residual processes (cpu cache & dp) ( #6377 )
...
* [Others] good exit single dp
* [Others] good exit cpu cache dp>1
* [Others] good exit cpu cache dp>1 unittest
2026-02-09 20:38:38 +08:00
chenjian
35c24f3f71
Revert "[Optimize] Optimize ttft for ep ( #6098 )" ( #6402 )
...
This reverts commit 90db0bdd0d .
2026-02-09 19:01:23 +08:00
kevin
d60daca4a8
[Feature] consider multimodal model when dummy run ( #6045 )
...
* add mm do profile
* updata code
* update code
* update code
* update code
* update test case
* update code
* update code
* fix xpu bug
* update code
* add mm do profile
* update test case
* update code
2026-02-09 17:49:55 +08:00
CSWYF3634076
eb8d639fe3
[Engine] apiserver&engine exit when work failed to start ( #6322 )
2026-02-09 15:07:40 +08:00
Yonghua Li
5ac5ecd0b0
[BugFix] fix cache transfer tasks failure after cache cleared ( #6202 )
...
* [fix] fix cache transfer tasks failure after cache cleared
* [fix] fix submit_task
* [fix] fix cache manager hang when clearing prefix cache
* [fix] fix list_proxy has no clear method
* [fix] fix barrier
* [fix] add barrier0
* [fix] add cache_task_is_paused_signal
* [fix] fix condition
* [fix] fix cache transfer sync and delay prefix cache tree clearing
* [fix] fix typo
* [chore] polish code
* [fix] revert only rank0 write kv_cache_status_signal
* [fix] fix thread pool and prefix cache manager hang
* [fix] add timeout for task_swapping_event
* [fix] tolerate prefix cache manager error while prefix tree is cleared
* [chore] add more log
* [fix] fix test_prefix_cache_manager
* [fix] fix prefix_cache_status_signal usage
2026-02-08 15:33:56 +08:00
CSWYF3634076
1c0a2b055f
[Feature] console print statistical metrics ( #6339 )
...
* [Feature] console print statistical data
* [Feature] console print statistical data v2 dp_rank
* [Feature] console print statistical data v2 unittest
* [Feature] console print statistical data v3 unittest
2026-02-05 19:20:36 +08:00
chenjian
90db0bdd0d
[Optimize] Optimize ttft for ep ( #6098 )
...
* optimize ttft
* fix
* fix
* fix ci
* fix ci
* fix
* fix bug
* fix
* add comments
* fix ci
* fix
2026-02-04 15:03:29 +08:00