* Update docs for release/2.5 * Update English docs for release/2.5 - Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link - Update docs/get_started/installation/nvidia_gpu.md: - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option - Update docs/zh/get_started/installation/nvidia_gpu.md: - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1 Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Clarify --extra-index-url usage in installation docs Add note explaining that --extra-index-url is only for downloading fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed from the Paddle source specified by -i. Applied to both Chinese and English nvidia_gpu.md installation guides. Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Update nvidia_gpu.md --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
6.4 KiB
English | 简体中文
Installation
|
Quick Start
|
Supported Models
FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
News
[2026-03] FastDeploy v2.5 is released! It adds deployment support for Qwen3-VL and Qwen3-VL MoE models, introduces the W4AFP8 quantization method, enhances reinforcement learning training capabilities, and includes 170+ bug fixes and performance optimizations. For all the upgrade details, refer to the v2.5 Release Note.
[2026-01] FastDeploy v2.4: Featuring PD-separated deployment for DeepSeek V3 and Qwen3-MoE, enhanced MTP speculative decoding, and comprehensive performance boosts for MoE inference and multi-modal Prefix Caching across various hardware backends. For all the upgrade details, refer to the v2.4 Release Note.
[2025-11] FastDeploy v2.3: It adds deployment support for two major models, ERNIE-4.5-VL-28B-A3B-Thinking and PaddleOCR-VL-0.9B, across multiple hardware platforms. It further optimizes comprehensive inference performance and brings more deployment features and usability enhancements. For all the upgrade details, refer to the v2.3 Release Note.
[2025-09] FastDeploy v2.2: It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for baidu/ERNIE-21B-A3B-Thinking!
About
FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:
- 🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
- 🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
- 🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
- 🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
- ⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
- 🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.
Requirements
- OS: Linux
- Python: 3.10 ~ 3.12
Installation
FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:
Get Started
Learn how to use FastDeploy through our documentation:
- 10-Minutes Quick Deployment
- ERNIE-4.5 Large Language Model Deployment
- ERNIE-4.5-VL Multimodal Model Deployment
- Offline Inference Development
- Online Service Deployment
- Best Practices
Supported Models
Learn how to download models, enable using the torch format, and more:
Advanced Usage
- Quantization
- PD Disaggregation Deployment
- Speculative Decoding
- Prefix Caching
- Chunked Prefill
- Load-Balancing Scheduling Router
- Global Cache Pooling
Acknowledgement
FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.