mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

T

gongweibao edd31e8849 [Feature] Add Deterministic Inference Support (#6476 )

* add

* [tests] Add Paddle attention determinism tests and refactor resource manager

Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* add

* add

* add

* add

* add more

* add more

* fixsome

* fixsome

* fix bugs

* fix bugs

* only in gpu

* add docs

* fix comments

* fix some

* fix some

* fix comments

* add more

* fix potential problem

* remove not need

* remove not need

* remove no need

* fix bug

* fix bugs

* fix comments

* fix comments

* Update tests/ce/deterministic/test_determinism_verification.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/inter_communicator/test_ipc_signal.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/engine/test_sampling_params_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism_standalone.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix comments

* fix import error

* fix a bug

* fix bugs

* fix bugs

* fix coverage

* refine codes

* refine code

* fix comments

* fix comments

* fix comments

* rm not need

* fix allreduce large tensor bug

* mv log files

* mv log files

* add files

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

2026-02-26 19:31:51 -08:00

.github

[CI] Optimize unittest and fix title format (#6464 )

2026-02-11 20:48:56 +08:00

benchmarks

[benchmark] update tool call (#6519 )

2026-02-26 17:06:54 +08:00

custom_ops

[Feature] Add Deterministic Inference Support (#6476 )

2026-02-26 19:31:51 -08:00

dockerfiles

fix text (#6145 )

2026-01-21 19:40:30 +08:00

docs

[Feature][Docs] Add Python-only quick install mode (BUILD_WHEEL=2) to build.sh (#6503 )

2026-02-26 16:17:41 +08:00

examples

[Feature] Fix counter release logic & update go-router download URL (#6280 )

2026-02-04 15:02:38 +08:00

fastdeploy

[Feature] Add Deterministic Inference Support (#6476 )

2026-02-26 19:31:51 -08:00

scripts

[CI] Add retry logic for pip install in iluvatar CI script (#6500 )

2026-02-25 16:01:41 +08:00

tests

[Feature] Add Deterministic Inference Support (#6476 )

2026-02-26 19:31:51 -08:00

tools

[CI] Add clang-format 13.0.0 recommendation to pre_commit.sh

2026-01-08 21:47:19 +08:00

.clang-format

c++ code format (#4527 )

2025-10-22 17:59:50 +08:00

.flake8

update flake8 version to support pre-commit in python3.12 (#3000 )

2025-07-24 01:43:31 -07:00

.gitignore

[Feature] Support KV Cache Storage (#5571 )

2025-12-25 16:30:35 +08:00

.gitmodules

add ignore=all for deepgemm (#4118 )

2025-09-15 21:52:00 +08:00

.pre-commit-config.yaml

[Models] Add Qwen3-VL Moe Model Support (#5913 )

2026-01-08 11:36:42 +08:00

build.sh

[Feature][Docs] Add Python-only quick install mode (BUILD_WHEEL=2) to build.sh (#6503 )

2026-02-26 16:17:41 +08:00

LICENSE

[LLM] First commit the llm deployment code

2025-06-09 19:20:15 +08:00

mkdocs.yml

[Feature] Support ThinkingBudget Logits processor to control thinking content length (#6367 )

2026-02-25 14:17:09 +08:00

pyproject.toml

Fix target_version (#3159 )

2025-08-28 14:17:54 +08:00

README_CN.md

Update FastDeploy release notes in README_CN.md

2026-02-10 20:32:03 +08:00

README_EN.md

Fix formatting in README_EN.md for v2.3 release

2026-02-10 20:32:15 +08:00

README.md

[Doc] Update docs for v2.3.0rc0 (#4828 )

2025-11-05 19:45:53 +08:00

requirements_dcu.txt

[Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. (#5320 )

2026-01-16 11:46:13 +08:00

requirements_guided_decoding.txt

[Feature] Guided Decoding add LLguidance backend (#5124 )

2025-12-03 20:23:57 +08:00

requirements_iluvatar.txt

[Feature] Support stopping the inference for the corresponding request in the online service after a disconnection request. (#5320 )

2026-01-16 11:46:13 +08:00

requirements_metaxgpu.txt

[Metax] adapt to the latest develop (#6282 )

2026-01-29 23:21:20 -08:00

requirements.txt

[Speculative Decoding] Support suffix decoding (#6403 )

2026-02-26 11:42:05 +08:00

setup.py

[Others] Update FASTDEPLOY_VERSION to 2.5.0-dev

2026-02-25 20:12:09 +08:00

README_EN.md

English | 简体中文

Installation | Quick Start | Supported Models

FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

News

[2026-01] FastDeploy v2.4 is released! Featuring PD-separated deployment for DeepSeek V3 and Qwen3-MoE, enhanced MTP speculative decoding, and comprehensive performance boosts for MoE inference and multi-modal Prefix Caching across various hardware backends. See the full v2.4 ReleaseNote for more details.

[2025-11] FastDeploy v2.3: It adds deployment support for two major models, ERNIE-4.5-VL-28B-A3B-Thinking and PaddleOCR-VL-0.9B, across multiple hardware platforms. It further optimizes comprehensive inference performance and brings more deployment features and usability enhancements. For all the upgrade details, refer to the v2.3 Release Note.

[2025-09] FastDeploy v2.2: It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for baidu/ERNIE-21B-A3B-Thinking!

About

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.

Requirements

OS: Linux
Python: 3.10 ~ 3.12

Installation

FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:

Get Started

Learn how to use FastDeploy through our documentation:

Supported Models

Learn how to download models, enable using the torch format, and more:

Full Supported Models List

Advanced Usage

Acknowledgement

FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.

Description

⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.

android graphcore intel jetson kunlun object-detection onnx onnxruntime openvino picodet rockchip serving stable-diffusion tensorrt uie yolov5 yolov8

Readme Apache-2.0 324 MiB

Languages

Python 61.5%

C++ 19.1%

Cuda 17.6%

Go 0.9%

Shell 0.7%

Other 0.1%