mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
@@ -0,0 +1,43 @@
|
|||||||
|
[简体中文](../zh/best_practices/DeepSeek-V3.md)
|
||||||
|
|
||||||
|
# DeepSeek-V3/V3.1 Model
|
||||||
|
|
||||||
|
## I. Environment Preparation
|
||||||
|
|
||||||
|
### 1.1 Support Requirements
|
||||||
|
The minimum number of GPUs required for deployment on the following hardware for each quantization precision of DeepSeek-V3/V3.1 is as follows:
|
||||||
|
|
||||||
|
| | WINT4 |
|
||||||
|
|-----|-----|-----|
|
||||||
|
|H800 80GB| 8 |
|
||||||
|
|
||||||
|
### 1.2 Installing FastDeploy
|
||||||
|
|
||||||
|
Installation process reference document [FastDeploy GPU Installation](../get_started/installation/nvidia_gpu.md)
|
||||||
|
|
||||||
|
## II. How to Use
|
||||||
|
|
||||||
|
### 2.1 Basics: Starting the Service
|
||||||
|
|
||||||
|
**Example 1:** Deploying a Wint4 model 16K context service on an H800 with eight GPUs
|
||||||
|
|
||||||
|
```shell
|
||||||
|
|
||||||
|
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
||||||
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
|
export FLAGS_flash_attn_version=3
|
||||||
|
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model "$MODEL_PATH" \
|
||||||
|
--port 8180 \
|
||||||
|
--metrics-port 8181 \
|
||||||
|
--engine-worker-queue-port 8182 \
|
||||||
|
--cache-queue-port 8183 \
|
||||||
|
--tensor-parallel-size 8 \
|
||||||
|
--max-model-len 16384 \
|
||||||
|
--max-num-seq 100 \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--quantization wint4
|
||||||
|
|
||||||
|
```
|
||||||
@@ -0,0 +1,39 @@
|
|||||||
|
[English](../../best_practices/DeepSeek-V3-V3.1.md)
|
||||||
|
|
||||||
|
# DeepSeek-V3/V3.1 模型
|
||||||
|
|
||||||
|
## 一、环境准备
|
||||||
|
### 1.1 支持情况
|
||||||
|
DeepSeek-V3/V3.1 各量化精度,在下列硬件上部署所需要的最小卡数如下:
|
||||||
|
|
||||||
|
| | WINT4 |
|
||||||
|
|-----|-----|
|
||||||
|
|H800 80GB| 8 |
|
||||||
|
|
||||||
|
### 1.2 安装fastdeploy
|
||||||
|
|
||||||
|
安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
|
||||||
|
|
||||||
|
## 二、如何使用
|
||||||
|
### 2.1 基础:启动服务
|
||||||
|
**示例1:** H800上八卡部署wint4模型16K上下文的服务
|
||||||
|
```shell
|
||||||
|
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
||||||
|
|
||||||
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
|
export FLAGS_flash_attn_version=3
|
||||||
|
|
||||||
|
python -m fastdeploy.entrypoints.openai.api_server \
|
||||||
|
--model "$MODEL_PATH" \
|
||||||
|
--port 8180 \
|
||||||
|
--metrics-port 8181 \
|
||||||
|
--engine-worker-queue-port 8182 \
|
||||||
|
--cache-queue-port 8183 \
|
||||||
|
--tensor-parallel-size 8 \
|
||||||
|
--max-model-len 16384 \
|
||||||
|
--max-num-seq 100 \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--quantization wint4
|
||||||
|
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user