Commit Graph

261 Commits

Author SHA1 Message Date
Jiang-Jia-Jun 33682c6749 [Docs] Update docs for release/2.5 (#7267)
* Update docs for release/2.5

* Update English docs for release/2.5

- Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link
- Update docs/get_started/installation/nvidia_gpu.md:
  - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support
  - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives
  - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option
- Update docs/zh/get_started/installation/nvidia_gpu.md:
  - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Clarify --extra-index-url usage in installation docs

Add note explaining that --extra-index-url is only for downloading
fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed
from the Paddle source specified by -i. Applied to both Chinese and
English nvidia_gpu.md installation guides.

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Update nvidia_gpu.md

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-09 16:07:18 +08:00
yinwei 334b02c12b [XPU][Docs] Update Release2.5 Note (#7187)
* update docs

* update

* update
2026-04-07 18:45:52 +08:00
lizexu123 5f612a348d [BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)
* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
2026-04-03 15:43:19 +08:00
Jingfeng Wu 3b564116d5 [Docs] Add docs for disaggregated deployment (#6700)
* add docs for disaggregated deployment

* pre-commit run for style check

* update docs
2026-04-01 19:27:09 +08:00
mouxin 6cae9b1f50 [Feature] Config eviction_duration (#7125)
* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-04-01 16:46:21 +08:00
zhouchong 91c832f607 [Feature] Add logging parameters and error output to terminal (#7098) 2026-04-01 13:18:42 +08:00
qwes5s5 daa95244f7 abort requests (#6992) 2026-03-31 11:02:26 +08:00
chenjian 6727df8286 [Optimization] Optimize ttft for prefill pd (#6680)
* optimize ttft

* fix

* fix

* fix ci

* fix ci

* fix

* fix bug

* fix

* add comments

* fix ci

* fix

* fix ci

* fix format

* update according to review

* add comment

* fix

* fix format
2026-03-30 20:36:23 +08:00
jackyYang6 05f2d95729 [RL] Adapt async rollout checkpoint update flow (#7042)
* update checkpoint-transfer flow and control update_weights params

* test: add update_weights route validation
2026-03-30 19:19:34 +08:00
yzwu 8789329457 [Iluvatar] Support wi4a16 group_gemm (#7078) 2026-03-30 19:03:51 +08:00
Yonghua Li a7f52c300d [Feature] support v1 update/clear api for RL (#6761)
* [Feature] support v1 update/clear api for RL

* [fix] fix execute_model and add sleep/wakeup api

* [fix] fix mtp and key_prefix

* [chore] move _update_key_prefix to resume method

* [fix] make the interface safe to call multiple times

* [fix] fix some tiny bugs

* [chore] make small changes against pr review

* [docs] add docs for weight update

* [test] add some tests and update docs

* [style] fix code style check

* [test] fix ci

* [fix] fix stale control responses when control method timed out

* [chore] remove unused code

* [chore] fix code style

* [chore] optimize tags and key_prefix

* [test] fix ci

* [chore] fix code style

* [test] fix ci

* [fix] fix ep control

* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
jackyYang6 634d23a38a [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934)
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

* [Docs] Fix thinking_budget markdown formatting

* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
mouxin 96b0ecea6b [Feature] Update Counter Release (#6943) 2026-03-20 10:51:37 +08:00
sunxin 33e01f22a8 [Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 (#6894)
* update sampling

* fix

* fix

* fix mtp

* fix test
2026-03-19 01:43:10 -07:00
mouxin b61731bb96 [Feature][Docs] Adjust prefill release & expose load metrics (#6884) 2026-03-17 15:23:13 +08:00
jc 04fde3b227 [PD Disaggregation] Prefill and decode support cache storage (#6768)
* Prefill and decode support cache storage

* up

* up

* update docs and refine mooncake store

* up
2026-03-16 14:44:49 +08:00
mouxin 49fe68a518 [Docs] Update Golang Router FAQ (#6829) 2026-03-13 15:48:36 +08:00
yzwu 901b38c936 [Iluvatar] Optimize decode group_gemm and Support cuda graph for ernie (#6803) 2026-03-12 19:21:17 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
gongweibao be36133db6 Remove Python-only mode documentation from installation guides (#6784)
Remove BUILD_WHEEL=2 related sections from nvidia_gpu and
kunlunxin_xpu installation docs (both en and zh).

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 13:08:18 +08:00
mouxin 22d308a274 [Docs] Specify the default strategy (#6728)
* [Docs] Update the document

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-03-10 13:16:31 +08:00
周周周 3cc09418f1 support dsv3 use flashmla (#6593) 2026-03-03 11:09:43 +08:00
yzwu 6674131b0b [Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553) 2026-03-02 14:07:17 +08:00
YuBaoku bb51829bd5 [CI] Fix tests and docs to resolve failure (#6572) 2026-03-01 12:33:01 +08:00
kevin fa21fd95c4 [Docs] Update code overview documentation (#6568)
* [Docs] Update code overview documentation

- Add comprehensive FastDeploy code structure overview
- Include detailed module descriptions and development guides
- Add quick development guide for common tasks
- Update both English and Chinese versions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Docs] Update code overview documentation format

- Convert file path links from [file](path) to `file` inline code format
- Add proper spacing for better readability in markdown tables
- Maintain consistent formatting across English and Chinese docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 16:37:01 +08:00
mouxin 049c807d86 [Docs] Update the document (#6539)
Co-authored-by: mouxin <mouxin@baidu.com>
2026-02-27 19:21:10 +08:00
周周周 1503443871 add dsv3 mixed deploy as EP16 TP8 (#6525) 2026-02-27 14:08:25 +08:00
gongweibao 2541462f7e [Feature][Docs] Add Python-only quick install mode (BUILD_WHEEL=2) to build.sh (#6503)
* add pythononly func

* add

* add more feature

* add safe check

* add rsync check

* add

* add

* refine docs

* add installation

* add installation
2026-02-26 16:17:41 +08:00
AIbin 47bfd45bb6 [Docs]add deepseek model doc (#6513)
* add deepseek model doc
2026-02-26 14:08:19 +08:00
MingkunZhang b56a4099c0 [Metax][Docs] update metax guidance documents (#6515) 2026-02-26 14:04:23 +08:00
GoldPancake 2178f2829b [Speculative Decoding] Support suffix decoding (#6403)
* support suffix decoding
2026-02-26 11:42:05 +08:00
jackyYang6 a29ee57e15 [Feature] Support ThinkingBudget Logits processor to control thinking content length (#6367)
* feat: add thinking budget logits processor

* add unittest

* fix pre-commit

* add unittest

* docs: clarify operator-level vs logits processor usage and conflict guidance

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-25 14:17:09 +08:00
bukejiyu 5bfc0938e2 [BugFix] PD reorder fix and add ut (#6375) 2026-02-09 04:42:48 -08:00
chenjian 35c24f3f71 Revert "[Optimize] Optimize ttft for ep (#6098)" (#6402)
This reverts commit 90db0bdd0d.
2026-02-09 19:01:23 +08:00
luukunn fd56d85346 add environment_variables (#6385) 2026-02-09 15:29:49 +08:00
chen 29a270bb38 [Docs] Add Doc for Online quantification (#6399)
* add doc for dynamic quant

* check
2026-02-08 22:09:18 -08:00
Jiang-Jia-Jun 18e79dd660 [Metrics] Support cpu-cache-block-num (#6390)
Co-authored-by: root <root@szzj-bcc-offline-1487319.szzj.baidu.com>
2026-02-09 10:27:56 +08:00
chenjian 90db0bdd0d [Optimize] Optimize ttft for ep (#6098)
* optimize ttft

* fix

* fix

* fix ci

* fix ci

* fix

* fix bug

* fix

* add comments

* fix ci

* fix
2026-02-04 15:03:29 +08:00
mouxin 6e96bd0bd2 [Feature] Fix counter release logic & update go-router download URL (#6280)
* [Doc] Update prerequisites in the documentation

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Fix counter release logic

* [Feature] Update go-router download URL

* [Feature] Update go-router download URL

* [Feature] Update go-router download URL

* [Feature] Update go-router download URL

* [Feature] Update token counter logic and docs

* [Feature] Update token counter logic and docs

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-02-04 15:02:38 +08:00
Jiang-Jia-Jun 793dac0f9d Modify Nightly Build installation commands for fastdeploy
Update the installation instructions for the Nightly Build of fastdeploy to use the cu126 index for both SM86/89 and SM80/90 architectures.
2026-02-03 20:24:27 +08:00
mouxin 506f1545cd [Feature] Enhance Router with /v1/completions, docs, scripts, and version info (#5966)
* [Doc] Update prerequisites in the documentation

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

* [Feature] Enhance Router with /v1/completions, docs, scripts, and version info

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-01-30 10:28:48 +08:00
yuxuan 44b52701f6 [Feature] Support NVFP4 MoE on SM100 (#6003)
* fp4 dense

* [WIP] support nvfp4, dense part

* [wip] developing loading qwen model

* loading

* update

* dense fp4 OK, cudagraph error

* [WIP] moe forward part

* with flashinfer-backend

* qwen3_moe_fp4

* update

* support flashinfer-cutlass moe, qwen3-moe-fp4 OK

* support ernie4.5-fp4

* fix load error

* add some ut

* add docs

* fix CLA, test

* fix the apply() in ModelOptNvFp4FusedMoE

* fix CodeStyle

* del the PADDLE_COMPATIBLE_API

* fix broken url: nvidia_gpu.md

* fix docs

* fix token_ids

* fix CI in Hopper

* move flashinfer imports inside the function

* fix model_runner

Removed the logic for generating random padding IDs.

* Remove skip condition for CUDA version in nvfp4 test

* add test for nvfp4

* fix according to review

* Add Chinese translation link to NVFP4 documentation

* del flashinfer.py

* fix unittest

---------

Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>
2026-01-29 14:16:07 +08:00
qwes5s5 38378415c7 add token ratio metrics (#6236) 2026-01-27 17:00:49 +08:00
wangyifei 53dc56f11b [Docs] add docs of /v1/pause、/v1/resume、/v1/is_paused (#6192)
* support dynamic run_control_request through zmq from apiserver to common_engine

* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method

* change /is_puased from HTTP POST method to GET method

* add pause、resume、is_paused implementation

* support engine <==> worker communication(request&response)

* support sync weights through RDMA from checkpoint_transfer

* support specified version, rsync_config in update_weights rpc call

* add pause, update_weights, resume interface for async RL

* bug fix: update_weights support using default arguments

* fix typo

* typo fix

* typo fix

* typo fix

* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all

* add "rsync" to LoadConfig.load_strategy Literal type hints

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* typo fix

* typo fix

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* check version/rsync params

* add error log when version.txt not exists

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* raise specified ValueError when paramters check failed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* tp barrier after run_control_method

* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue

* typo fix

* typo fix

* update docs of /v1/pause, /v1/resume, /v1/is_paused

* add zh docs of pause、resume、is_paused

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-23 17:57:51 +08:00
Copilot 96b2cf2c20 [Docs] Update FastDeploy Docker image to 2.4.0 for Nvidia GPU installation (#6168)
* Initial plan

* Update Nvidia GPU Docker image version from 2.3.3 to 2.4.0

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-01-22 22:01:13 +08:00
Yonghua Li 8d27a523e7 [Feature] [KVCache] support attention_store kv cache backend (#5823)
* [feat] support attention_store kv cache backend

* [fix] fix codestyle

* [chore] optimize log

* [fix] fix write storage task

* [fix] fix read storage

* [fix] fix code conflict after merge develop

* [fix] fix cache bytes and read task token ids

* [chore] add model for cache transfer manager

* [chore] add some log

* [chore] remove launched_cache_manager_signal

* [fix] fix write_back_storage_task match_block_num condition

* [fix] fix swap_cost_time

* [ci] fix ci

* Update fastdeploy/engine/sched/resource_manager_v1.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/cache_manager/cache_transfer_manager.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/cache_manager/transfer_factory/mooncake_store/attention_store.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-22 21:01:23 +08:00
yangjianfengo1 bb635e0819 fix text (#6145) 2026-01-21 19:40:30 +08:00
yinwei 9536cd650b [XPU] update release doc (#6143) 2026-01-21 18:31:25 +08:00
Cheng Yanfei 9ee0156cc3 add HPU tensorwise_fp8 readme (#6091) 2026-01-21 11:48:22 +08:00
yinwei 5385d51808 [XPU]XPU FD Release/2.4 Note 2026-01-20 20:38:34 +08:00