6.1 KiB
English | 简体中文
Installation
|
Quick Start
|
Supported Models
FastDeploy : Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
News
[2026-01] FastDeploy v2.4 has been newly released! It now supports PD-disaggregated deployment of DeepSeek V3 and Qwen3-MoE models. The capability of the ERNIE-4.5 reasoning model parser and MTP speculative decoding performance have been improved. The MoE inference performance and multimodal prefix caching performance across multiple hardware platforms have been comprehensively optimized. For details on all upgrades, please refer to the v2.4 Release Note.
[2025-11] FastDeploy v2.3 It adds deployment support for two major models, ERNIE-4.5-VL-28B-A3B-Thinking and PaddleOCR-VL-0.9B, across multiple hardware platforms. It further optimizes comprehensive inference performance and brings more deployment features and usability enhancements. For all the upgrade details, refer to the v2.3 Release Note.
[2025-09] FastDeploy v2.2: It now offers compatibility with models in the HuggingFace ecosystem, has further optimized performance, and newly adds support for baidu/ERNIE-21B-A3B-Thinking!
About
FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:
- 🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
- 🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
- 🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
- 🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
- ⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
- 🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.
Requirements
- OS: Linux
- Python: 3.10 ~ 3.12
Installation
FastDeploy supports inference deployment on NVIDIA GPUs, Kunlunxin XPUs, Iluvatar GPUs, Enflame GCUs, Hygon DCUs and other hardware. For detailed installation instructions:
Get Started
Learn how to use FastDeploy through our documentation:
- 10-Minutes Quick Deployment
- ERNIE-4.5 Large Language Model Deployment
- ERNIE-4.5-VL Multimodal Model Deployment
- Offline Inference Development
- Online Service Deployment
- Best Practices
Supported Models
Learn how to download models, enable using the torch format, and more:
Advanced Usage
- Quantization
- PD Disaggregation Deployment
- Speculative Decoding
- Prefix Caching
- Chunked Prefill
- Load-Balancing Scheduling Router
Acknowledgement
FastDeploy is licensed under the Apache-2.0 open-source license. During development, portions of vLLM code were referenced and incorporated to maintain interface compatibility, for which we express our gratitude.