Commit Graph

87 Commits

Author SHA1 Message Date
qwes5s5 3b7507a4c2 test_abort (#6743) 2026-03-17 14:06:40 +08:00
Yonghua Li 7811eeccaa [fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) 2026-03-11 15:02:32 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
1 3a85ecf3bc [Others] Fix typos in log messages and comments (#6707)
Fix spelling errors in log messages, docstrings, and comments:
- 'occured' -> 'occurred' (8 instances)
- 'Recieve'/'recieved' -> 'Receive'/'received' (7 instances)
- 'happend' -> 'happened' (3 instances)
- 'expet_servic' -> 'expert_service' (2 instances)
- 'meas' -> 'means' (1 instance)

No functional changes. Only log strings, docstrings, and comments are affected.

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
2026-03-09 10:26:25 +08:00
SunLei 5d9524fc3c [Models][Feature] Support new ERNIE reward model and add return_token_ids to reward API (#6638)
* reward model

* Add support for pooling-based inference in the reward model

* bugfix

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-03-06 18:51:00 +08:00
Yonghua Li fa1906bd6f [BugFix] Fix inaccurate cache hit rate and TTFT after request preemption (#6620)
* [chore] add has_been_rescheduled flag for requests

* [refactor] rename reschedule to preempted for accuracy and fix cache hit metrics

* [chore] add ttft_s
2026-03-05 16:25:02 +08:00
yzwu 6674131b0b [Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553) 2026-03-02 14:07:17 +08:00
GoldPancake 2178f2829b [Speculative Decoding] Support suffix decoding (#6403)
* support suffix decoding
2026-02-26 11:42:05 +08:00
Jiang-Jia-Jun 4e06df520e [Feature] 统一请求完成日志格式并增强统计信息 (#6405)
将原来分散的两行日志合并为一行,同时增加更多统计信息展示。

主要变更:
- 整合原有的 "Request finished" 和 "token ratio" 两行日志为单行格式
- 新增 InputToken:输入token数量
- 新增 CachedDetail:缓存详情(包含CachedToken/GPU/CPU)
- 新增 OutputToken:输出token数量
- 新增 TTFT:首Token时延(秒)
- 新增 E2E:整句时延(秒)
- 保留 IsPrefill 和 RecoveryStop 标志

新日志格式示例:
Request=chatcmpl-xxx, InputToken=18, CachedDetail={"CachedToken": 0, "GPU": 0, "CPU": 0}, OutputToken=247, TokenRatio=315.77, TTFT=0.02, E2E=0.78, IsPrefill=False, RecoveryStop=False

Co-authored-by: Ducc <ducc@baidu.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-09 21:06:55 +08:00
chen f18f3b99ed fix zmq hung when sampled_token_id=0 (#6398) 2026-02-09 14:13:18 +08:00
CSWYF3634076 1c0a2b055f [Feature] console print statistical metrics (#6339)
* [Feature] console print statistical data

* [Feature] console print statistical data v2 dp_rank

* [Feature] console print statistical data v2 unittest

* [Feature] console print statistical data v3 unittest
2026-02-05 19:20:36 +08:00
chenjian 292bab7e6d [BugFix] Fix bug for enable output caching (#6226)
* [BugFix] Fix bug for enable output caching

* fix

* Fix

* fix

* fix ci
2026-01-30 10:55:36 +08:00
qwes5s5 38378415c7 add token ratio metrics (#6236) 2026-01-27 17:00:49 +08:00
wangyifei b7c5daa316 [RL] add pause, update_weights, resume interface for async RL (#6052)
* support dynamic run_control_request through zmq from apiserver to common_engine

* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method

* change /is_puased from HTTP POST method to GET method

* add pause、resume、is_paused implementation

* support engine <==> worker communication(request&response)

* support sync weights through RDMA from checkpoint_transfer

* support specified version, rsync_config in update_weights rpc call

* add pause, update_weights, resume interface for async RL

* bug fix: update_weights support using default arguments

* fix typo

* typo fix

* typo fix

* typo fix

* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all

* add "rsync" to LoadConfig.load_strategy Literal type hints

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* typo fix

* typo fix

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* check version/rsync params

* add error log when version.txt not exists

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* raise specified ValueError when paramters check failed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* tp barrier after run_control_method

* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue

* typo fix

* typo fix

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-23 10:18:07 +08:00
qwes5s5 b2a2e11551 [Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. (#5320)
* request disconnect

* request disconnect

* fix bug

* fix bug--amend

---------

Co-authored-by: root <root@yq01-sys-rpm26xc1knu.yq01.baidu.com>
2026-01-16 11:46:13 +08:00
Daci e10b51b8c6 [Feature] get_output_kv_signal blocking read mode & send_first_token (#5836)
* get_output_kv_signal blocking read mode

* send first token before recycle

* xpu get_output_kv_signal blocking read mode

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-15 14:11:03 +08:00
chenjian 74d0f1c01f [Optim] Robust sync status when preempted happens (#5796)
* [Bug fix] Sync status for caching output cache

* fix

* fix

* fix bug

* fix

* fix

* support xpu

* fix

* fix

* fix

* fix

* fix

* fix ci

* fix ci

* fix xpu

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-14 12:07:33 +08:00
GoldPancake 3ca99ab170 [Speculative Decoding] Return accepted tokens per head in response (#5947)
* adjust log level

* add accepted tokens per head

* fix ut

* fix
2026-01-09 14:32:08 +08:00
GoldPancake 4e10ae5d99 [Speculative Decoding] Optimize draft logprob (#5842)
* optimize draft logprob

* fix ut
2025-12-31 13:35:56 +08:00
ddchenhao66 9e45ef7ca9 [XPU]MAX_BSZ aligns gpu settings and disable prefix cache in OCR VL (#5831) 2025-12-31 09:49:12 +08:00
chenjian 91a2b13676 [BugFix] Fix preemption out of real_bsz (#5805) 2025-12-29 09:52:36 +08:00
xiaolei373 a30b4da260 [Feature] Tracing: Fine-Grained Tracing for Request Latency Part1 (#5458) 2025-12-16 16:36:09 +08:00
chenjian 0100ee885f Fix bug for caching output when preempted (#5502) 2025-12-15 17:25:35 +08:00
chenjian 7b0fdf7055 add check health in FD (#5534) 2025-12-15 15:14:45 +08:00
GoldPancake 909059c60a [Feature] Support for request-level speculative decoding metrics monitoring. (#5518)
* support spec metrics monitor per request

* fix bug

* remove debug log

* fix ut bugs
2025-12-12 12:22:18 +08:00
Juncai 80efe98f8d [PD Disaggregation] Add timestamp for analyzing splitwise deployment (#5317)
* Add timestamp for analyzing splitwise deployment

* up

* up

* up

* up

* up

* up

* fix format

* fix
2025-12-08 10:08:44 +08:00
chenjian 3878a99b69 [Fearture] Support cache kv cache for output tokens (#4535)
* [Fearture] Support cache kv cache for output tokens

* fix bug

* fix ci bug

* improve coverage

* enable output caching by default

* fix ci

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-12-04 20:53:08 +08:00
Yonghua Li f4119d51b4 [PD Disaggregation] support DP via v1 router and decouple DP and EP (#5197)
* [fix] support DP via v1 router and decouple DP and EP

* [fix] fix scripts

* [fix] reset model path

* [fix] dp use get_output_ep, fix router port type, update scripts

* [merge] merge with latest code

* [chore] remove some debug log

* [fix] fix code style check

* [fix] fix test_multi_api_server for log_dir name

* [chore] reduce logs

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-04 15:38:43 +08:00
qw86972190 6048ea37bd [XPU]add enable_logprob (#5279)
* [XPU]Update document

* [XPU]Update documentation

* [XPU]add enable_logprob

* Fix code style issues

* “doc”

* “docs”

* “doc”

* Fix code style via pre-commit

---------

Co-authored-by: root <root@gajl-bbc-onlinec-com-1498354.gajl.baidu.com>
2025-12-02 15:32:28 +08:00
qwes5s5 117980dd4e [LogProbs]Enable prompt logprobs output and modify data transmission method for the online interface. (#5089)
* add prompt logprobs

* Merge prompt_logprobs_tensors and prompt_logprobs

* fix param check

* trigger ci

* fix unitest

* fix logprobs bug
2025-12-02 13:49:51 +08:00
cmcamdy 9f4977eb74 [xpu] support mtp for xpu(mix) (#5274)
* [XPU] support kernel for mtp(base)

* [XPU] support kernel for mtp(base)

* format

* format

* format

* fix gather next token

* fix step && add test

* fix

* mv pre/post process

* add adjust batch / gather next token for mtp

* fix code style

* fix mtp kenrel name

* fix mtp kernel test

* mv xpu pre/post process

* mv xpu pre/post process

* [xpu] support mtp

* fix code style
2025-12-01 11:03:14 +08:00
GoldPancake cfc5b0ccf9 [BugFix] fix mtp logprob bugs in chunk prefill (#5244)
* fix mtp logprob bugs in chunk prefill

* fix

* fix
2025-11-27 11:31:29 +08:00
SunLei c424e08dc5 [Speculative Decoding] split draft_tokens into standalone post-processing path (#5205)
* refactor(mtp): split draft_tokens into standalone post-processing path for MTP + logprobs

* Restore Request.__repr__ implementation

* ci

* add envs

* fix unittest
2025-11-27 11:22:41 +08:00
Yonghua Li cead6b26fa [Metrics] Update time_to_first_token to include tokenization & queue time, and remove redundant metrics (#4993)
* [update] update time_to_first_tokens to include queue time, and remove first_token_latency and infer_latency

* [doc] update docs

* [ci] fix test

* [chore] delete redundant code

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2025-11-26 14:42:17 +08:00
GoldPancake ab3a2e45ff fix mtp reschedule (#5165) 2025-11-21 19:31:35 +08:00
chenjian 3ea1b44a58 [Optimization] Improve perf for fd response token with internal adapter (#4992)
* [Optimize] Improve perf for fd response token with internal adapter

* fix

* fix bug

* fix ci

* fix ci

* fix ci

* fix ci
2025-11-21 19:02:03 +08:00
SunLei d9f64adb0e fix: Fix block allocation issue when MTP and logprobs are enabled (#5077)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
2025-11-17 17:50:07 +08:00
qwes5s5 36216e62f0 [Log] Add trace log and add loggingInstrumentor tool (#4692)
* add trace logger and trace print

* trigger ci

* fix unittest

* translate notes and add copyright

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-11-17 11:08:57 +08:00
Juncai 36822fa49c [PD Disaggregation] remove splitwise deployment on single node and refine the code (#4891)
* remove splitwise deployment on single node and refine the code

* up

* up

* up

* add test

* up
2025-11-14 09:56:53 +08:00
Yonghua Li 6c5ab727c1 [BugFix] fix num_requests_running after clear_data (#4927)
* [BugFix] fix num_requests_running after clear_data

* [fix] fix tasks_list and stop flags not cleared when _free_blocks failed
2025-11-13 13:50:21 +08:00
qwes5s5 a2d06118e1 [Logprobs]Support prompt_logprobs and max_logprobs (#4897)
* add prompt logprobs

* trigger ci

* fix unitest

* Update fastdeploy/config.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/entrypoints/llm.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/engine/sampling_params.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/engine/test_sampling_params.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/engine/test_sampling_params.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix max_logprobs

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-12 19:29:48 +08:00
SunLei 3098aee05f [Perf] Support tensor transmission between work and engine with zero-copy to improve efficiency (#4839)
* feat(zmq): support tensor transmission with zero-copy for improved efficiency

* perf: zmq.send disable copy

* zmq recv data for debug

* convert logprobs tensor to cpu
2025-11-11 15:43:11 +08:00
chen 6871aad03d [BugFix] fix token_processor zmq (#4827)
* fix_token_processor_zmq

* check pooling_model's token_ids is None

* revert
2025-11-07 19:43:25 +08:00
Juncai 08ca0f6aea [Feature] [PD] add simple router and refine splitwise deployment (#4709)
* add simple router and refine splitwise deployment

* fix
2025-11-06 14:56:02 +08:00
chenjian cc8f5312f5 [Feature] Add timestamp for profiler (#4726)
* [Feature] Add timestamp for profiler

* fix bug for offine inference

* fix for ci

* fix

* fix ci
2025-11-05 12:04:59 +08:00
ApplEOFDiscord 14f8cddaf1 [Feature] add mm token usage (#4570)
* add mm token usage

* fix unit test

* fix unit test

* fix unit test

* fix model path

* fix unit test

* fix unit test

* fix unit test

* remove uncomment

* change var name

* fix code style

* fix code style

* fix code style

* fix code style

* fix unit test
2025-10-29 14:37:12 +08:00
freeliuzc c63361fd1d [Speculative Decoding][MTP]Support mtp in epdptp mode (#4614)
* support mtp many features

* support mtp reshard in rl mode

* fix function

* support mtp ep

* support mtp in hybird-dp-tp mode

* default open scheduler_v1 in mtp
2025-10-28 16:02:47 +08:00
SunLei 809c1ac7ec feat: add post-processing step for pool_output (#4462)
* feat: add post-processing step for pool_output

* bugfix

* fix: test_serving_embedding

* fix test_request_to_batch_dicts

* fix: code style
2025-10-21 20:24:26 +08:00
ltd0924 fb76cdfb4f [Fearture] Support mm model close prefix cache (#4459)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Feature] support prefix cache in DP

* fix

* Update common_engine.py

* Update common_engine.py

* Update common_engine.py

* Update common_engine.py

* [BugFix] fix workers more than 1

* fix

* Update api_server.py

* fix

* Update api_server.py

* fix

* [Fearture] Support mm model close prefix cache

* Update api_server.py

* Update engine_client.py

* Update engine_client.py

* add test

* Update test_chat.py

* fix

* fix

* Update test_chat.py

* Update test_chat.py

---------

Co-authored-by: ltd0924 <luotingdan@baidu.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-10-21 15:37:59 +08:00
SunLei ee915220bd [Speculative Decoding] Add draft_logprobs Support for Speculative Decode MTP (#4467)
* feat: add draft_logprobs for Speculative Decode MTP

* feat: add draft_logprobs for Speculative Decode MTP

* feat: add draft_logprobs for Speculative Decode MTP

* fix: postprocess for speculative decode

* test: test_speculative_decoding_use_logprobs

* fix: test_completion_echo

* fix test_max_streaming_tokens

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-10-21 14:57:50 +08:00