Commit Graph

718 Commits

Author SHA1 Message Date
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
gongweibao be36133db6 Remove Python-only mode documentation from installation guides (#6784)
Remove BUILD_WHEEL=2 related sections from nvidia_gpu and
kunlunxin_xpu installation docs (both en and zh).

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 13:08:18 +08:00
mouxin 22d308a274 [Docs] Specify the default strategy (#6728)
* [Docs] Update the document

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-03-10 13:16:31 +08:00
周周周 3cc09418f1 support dsv3 use flashmla (#6593) 2026-03-03 11:09:43 +08:00
yzwu 6674131b0b [Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553) 2026-03-02 14:07:17 +08:00
YuBaoku bb51829bd5 [CI] Fix tests and docs to resolve failure (#6572) 2026-03-01 12:33:01 +08:00
kevin fa21fd95c4 [Docs] Update code overview documentation (#6568)
* [Docs] Update code overview documentation

- Add comprehensive FastDeploy code structure overview
- Include detailed module descriptions and development guides
- Add quick development guide for common tasks
- Update both English and Chinese versions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Docs] Update code overview documentation format

- Convert file path links from [file](path) to `file` inline code format
- Add proper spacing for better readability in markdown tables
- Maintain consistent formatting across English and Chinese docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 16:37:01 +08:00
mouxin 049c807d86 [Docs] Update the document (#6539)
Co-authored-by: mouxin <mouxin@baidu.com>
2026-02-27 19:21:10 +08:00
周周周 1503443871 add dsv3 mixed deploy as EP16 TP8 (#6525) 2026-02-27 14:08:25 +08:00
gongweibao 2541462f7e [Feature][Docs] Add Python-only quick install mode (BUILD_WHEEL=2) to build.sh (#6503)
* add pythononly func

* add

* add more feature

* add safe check

* add rsync check

* add

* add

* refine docs

* add installation

* add installation
2026-02-26 16:17:41 +08:00
AIbin 47bfd45bb6 [Docs]add deepseek model doc (#6513)
* add deepseek model doc
2026-02-26 14:08:19 +08:00
MingkunZhang b56a4099c0 [Metax][Docs] update metax guidance documents (#6515) 2026-02-26 14:04:23 +08:00
GoldPancake 2178f2829b [Speculative Decoding] Support suffix decoding (#6403)
* support suffix decoding
2026-02-26 11:42:05 +08:00
jackyYang6 a29ee57e15 [Feature] Support ThinkingBudget Logits processor to control thinking content length (#6367)
* feat: add thinking budget logits processor

* add unittest

* fix pre-commit

* add unittest

* docs: clarify operator-level vs logits processor usage and conflict guidance

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-25 14:17:09 +08:00
bukejiyu 5bfc0938e2 [BugFix] PD reorder fix and add ut (#6375) 2026-02-09 04:42:48 -08:00
chenjian 35c24f3f71 Revert "[Optimize] Optimize ttft for ep (#6098)" (#6402)
This reverts commit 90db0bdd0d.
2026-02-09 19:01:23 +08:00
luukunn fd56d85346 add environment_variables (#6385) 2026-02-09 15:29:49 +08:00
chen 29a270bb38 [Docs] Add Doc for Online quantification (#6399)
* add doc for dynamic quant

* check
2026-02-08 22:09:18 -08:00
Jiang-Jia-Jun 18e79dd660 [Metrics] Support cpu-cache-block-num (#6390)
Co-authored-by: root <root@szzj-bcc-offline-1487319.szzj.baidu.com>
2026-02-09 10:27:56 +08:00
chenjian 90db0bdd0d [Optimize] Optimize ttft for ep (#6098)
* optimize ttft

* fix

* fix

* fix ci

* fix ci

* fix

* fix bug

* fix

* add comments

* fix ci

* fix
2026-02-04 15:03:29 +08:00
mouxin 6e96bd0bd2 [Feature] Fix counter release logic & update go-router download URL (#6280)
* [Doc] Update prerequisites in the documentation

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Fix counter release logic

* [Feature] Update go-router download URL

* [Feature] Update go-router download URL

* [Feature] Update go-router download URL

* [Feature] Update go-router download URL

* [Feature] Update token counter logic and docs

* [Feature] Update token counter logic and docs

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-02-04 15:02:38 +08:00
Jiang-Jia-Jun 793dac0f9d Modify Nightly Build installation commands for fastdeploy
Update the installation instructions for the Nightly Build of fastdeploy to use the cu126 index for both SM86/89 and SM80/90 architectures.
2026-02-03 20:24:27 +08:00
Jiang-Jia-Jun 829139a5e5 Fix Nightly build installation URLs for fastdeploy-gpu
Updated installation instructions for the latest Nightly build of fastdeploy-gpu to use the correct URLs for CUDA 12.6.
2026-02-03 20:24:19 +08:00
mouxin 506f1545cd [Feature] Enhance Router with /v1/completions, docs, scripts, and version info (#5966)
* [Doc] Update prerequisites in the documentation

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-01-30 10:28:48 +08:00
yuxuan 44b52701f6 [Feature] Support NVFP4 MoE on SM100 (#6003)
* fp4 dense

* [WIP] support nvfp4, dense part

* [wip] developing loading qwen model

* loading

* update

* dense fp4 OK, cudagraph error

* [WIP] moe forward part

* with flashinfer-backend

* qwen3_moe_fp4

* update

* support flashinfer-cutlass moe, qwen3-moe-fp4 OK

* support ernie4.5-fp4

* fix load error

* add some ut

* add docs

* fix CLA, test

* fix the apply() in ModelOptNvFp4FusedMoE

* fix CodeStyle

* del the PADDLE_COMPATIBLE_API

* fix broken url: nvidia_gpu.md

* fix docs

* fix token_ids

* fix CI in Hopper

* move flashinfer imports inside the function

* fix model_runner

Removed the logic for generating random padding IDs.

* Remove skip condition for CUDA version in nvfp4 test

* add test for nvfp4

* fix according to review

* Add Chinese translation link to NVFP4 documentation

* del flashinfer.py

* fix unittest

---------

Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>
2026-01-29 14:16:07 +08:00
qwes5s5 38378415c7 add token ratio metrics (#6236) 2026-01-27 17:00:49 +08:00
CSWYF3634076 08c411518f [Loader] support dummy load weight (#6169)
* [Loader] support dummy load weight

* [Loader] support dummy load weight v2

* [Loader] support dummy load weight unittest

* [Loader] support dummy load weight unittest v2

* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
wangyifei 53dc56f11b [Docs] add docs of /v1/pause、/v1/resume、/v1/is_paused (#6192)
* support dynamic run_control_request through zmq from apiserver to common_engine

* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method

* change /is_puased from HTTP POST method to GET method

* add pause、resume、is_paused implementation

* support engine <==> worker communication(request&response)

* support sync weights through RDMA from checkpoint_transfer

* support specified version, rsync_config in update_weights rpc call

* add pause, update_weights, resume interface for async RL

* bug fix: update_weights support using default arguments

* fix typo

* typo fix

* typo fix

* typo fix

* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all

* add "rsync" to LoadConfig.load_strategy Literal type hints

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* typo fix

* typo fix

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* check version/rsync params

* add error log when version.txt not exists

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* raise specified ValueError when paramters check failed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* tp barrier after run_control_method

* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue

* typo fix

* typo fix

* update docs of /v1/pause, /v1/resume, /v1/is_paused

* add zh docs of pause、resume、is_paused

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-23 17:57:51 +08:00
Copilot 96b2cf2c20 [Docs] Update FastDeploy Docker image to 2.4.0 for Nvidia GPU installation (#6168)
* Initial plan

* Update Nvidia GPU Docker image version from 2.3.3 to 2.4.0

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-22 22:01:13 +08:00
Yonghua Li 8d27a523e7 [Feature] [KVCache] support attention_store kv cache backend (#5823)
* [feat] support attention_store kv cache backend

* [fix] fix codestyle

* [chore] optimize log

* [fix] fix write storage task

* [fix] fix read storage

* [fix] fix code conflict after merge develop

* [fix] fix cache bytes and read task token ids

* [chore] add model for cache transfer manager

* [chore] add some log

* [chore] remove launched_cache_manager_signal

* [fix] fix write_back_storage_task match_block_num condition

* [fix] fix swap_cost_time

* [ci] fix ci

* Update fastdeploy/engine/sched/resource_manager_v1.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/cache_manager/cache_transfer_manager.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/cache_manager/transfer_factory/mooncake_store/attention_store.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-22 21:01:23 +08:00
yangjianfengo1 bb635e0819 fix text (#6145) 2026-01-21 19:40:30 +08:00
yinwei 9536cd650b [XPU] update release doc (#6143) 2026-01-21 18:31:25 +08:00
Cheng Yanfei 9ee0156cc3 add HPU tensorwise_fp8 readme (#6091) 2026-01-21 11:48:22 +08:00
yinwei 5385d51808 [XPU]XPU FD Release/2.4 Note 2026-01-20 20:38:34 +08:00
luukunn 56e22a7ddc [Docs]fix doc (#6119)
* fix doc

* fix doc
2026-01-20 19:46:05 +08:00
jackyYang6 00a6a73431 docs: fix pre-commit error of markdown (#6100) 2026-01-20 19:32:05 +08:00
jackyYang6 988e0bc338 [Feature] Add PaddleFormers fallback backend (#5999)
* feat(paddleformers): add dense text model fallback backend

* docs(paddleformers): add user guide and fix code review issues

* add fallback unit test

* precommit format

* fix pre-commit

* fix: address code review feedback

* docs: add PaddleFormers backend documentation (EN) and simplify installation

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-19 21:50:50 +08:00
luukunn 93b7675a64 [Feature]Report FD statistical information (#5646)
* add usage commit

* update envs and xpu

* add requirements

* fix quantization value

* add unit test

* add unit test

* fix unit test

* add unit test

* add unit test

* add unit test

* add unit test

* add unit test

* add unit test

* fix FD_USAGE_STATS_SERVER

* fix

* fix

* add doc

* add doc

* add doc

* add doc

* add doc

* fix file name
2026-01-14 17:54:01 +08:00
MingkunZhang 32fb04703b [Metax][Doc] update metax gpu 'get_started' doc (#6035)
Co-authored-by: root <root@lt-wks-10-0-180-15.pub.metax-tech.com>
2026-01-14 16:11:43 +08:00
Copilot fe7588d8f0 [Docs] Update FastDeploy version to 2.3.3 in NVIDIA GPU installation documentation (#6010)
* Initial plan

* Update FastDeploy version from 2.3.2 to 2.3.3 in NVIDIA GPU installation docs

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-12 23:45:22 +08:00
Copilot 5c53193c4e [Docs] Update GPU version from 2.3.0 to 2.3.2 in installation documentation (#5894)
* Initial plan

* Update GPU version from 2.3.0 to 2.3.2 in NVIDIA GPU installation documentation

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-06 11:06:32 +08:00
jc 8d384f9fd8 [PD Disaggregation] Update usage of pd disaggregation and data parallel (#5742)
* Update usage of pd disaggregation

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up dp docs

* up

* up

* up

* fix unittest
2026-01-05 17:51:29 +08:00
Copilot 7d5282e158 [APIServer][Feature] Add configurable worker health check timeout via FD_WORKER_ALIVE_TIMEOUT (#5865)
* Initial plan

* Add configurable FD_WORKER_ALIVE_TIMEOUT environment variable

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Add test for FD_WORKER_ALIVE_TIMEOUT environment variable

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Update docs/zh/usage/environment_variables.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docs/usage/environment_variables.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Improve test coverage to validate integration with check_health calls

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Remove test_worker_alive_timeout.py per reviewer feedback

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-05 09:47:12 +08:00
yzwu 7b6cc11952 [Iluvatar] Fix FD launch error when specifing CUDA_VISBLE_DEVICE (#5735) 2025-12-26 14:01:27 +08:00
Copilot 5cec66adb8 [Docs] 更新环境变量文档以同步最新代码 (#5713)
* Initial plan

* 更新环境变量文档以匹配最新代码

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-12-23 19:49:20 +08:00
Copilot e9f5397bc9 [Docs] Update parameters documentation with latest code defaults and new parameters (#5709)
* Initial plan

* Update parameters documentation with correct default values and new parameters

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2025-12-23 17:31:44 +08:00
Divano c1aa66df02 Revert "[Optim] Remove limitation of number of kvcache blocks (#5612)" (#5702)
This reverts commit 9da89a374b.
2025-12-23 15:41:33 +08:00
Jiang-Jia-Jun 9da89a374b [Optim] Remove limitation of number of kvcache blocks (#5612)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* [Optim] Remove limitation of number of kvcache blocks

* Update fastdeploy/envs.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/worker/iluvatar_worker.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Add docs

* Update fastdeploy/worker/worker_process.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix ci case

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-23 11:18:29 +08:00
yzwu ac013803f3 [Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode (#5555) 2025-12-18 02:14:25 -08:00
xiaolei373 a30b4da260 [Feature] Tracing: Fine-Grained Tracing for Request Latency Part1 (#5458) 2025-12-16 16:36:09 +08:00