mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
@@ -8,22 +8,23 @@
|
|||||||
The minimum number of GPUs required for deployment on the following hardware for each quantization precision of DeepSeek-V3/V3.1 is as follows:
|
The minimum number of GPUs required for deployment on the following hardware for each quantization precision of DeepSeek-V3/V3.1 is as follows:
|
||||||
|
|
||||||
| | WINT4 |
|
| | WINT4 |
|
||||||
|-----|-----|-----|
|
|
||||||
|
|-----|-----|
|
||||||
|
|
||||||
|H800 80GB| 8 |
|
|H800 80GB| 8 |
|
||||||
|
|
||||||
### 1.2 Installing FastDeploy
|
### 1.2 Installing FastDeploy
|
||||||
|
|
||||||
Installation process reference document [FastDeploy GPU Installation](../get_started/installation/nvidia_gpu.md)
|
Refer to the installation process document [FastDeploy GPU Installation](../get_started/installation/nvidia_gpu.md)
|
||||||
|
|
||||||
## II. How to Use
|
## II. How to Use
|
||||||
|
|
||||||
### 2.1 Basics: Starting the Service
|
### 2.1 Basics: Starting the Service
|
||||||
|
|
||||||
**Example 1:** Deploying a Wint4 model 16K context service on an H800 with eight GPUs
|
**Example 1:** Deploying a Wint4 model with 16K context on an H800 with eight GPUs
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
|
MODEL_PATH=/models/DeepSeek/DeepSeek-V3.1-Terminus-BF16
|
||||||
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
|
||||||
export FD_DISABLE_CHUNKED_PREFILL=1
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
export FLAGS_flash_attn_version=3
|
export FLAGS_flash_attn_version=3
|
||||||
@@ -41,3 +42,115 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
--quantization wint4
|
--quantization wint4
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Example 2:** Deploying a 16K context service for the block_wise_fp8 model on 16 cards on an H800
|
||||||
|
|
||||||
|
```shell
|
||||||
|
# Currently only supports configurations with tp_size of 8 and ep_size of 16
|
||||||
|
|
||||||
|
MODEL_PATH=models/DeepSeek/DeepSeek-V3.1-Terminus-BF16
|
||||||
|
|
||||||
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
|
export FLAGS_flash_attn_version=3
|
||||||
|
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
|
export FD_ENABLE_MULTI_API_SERVER=1
|
||||||
|
|
||||||
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
|
--ports "9811" \
|
||||||
|
--num-servers 1 \
|
||||||
|
--args --model "$MODEL_PATH" \
|
||||||
|
--ips "10.95.247.24,10.95.244.147" \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--quantization block_wise_fp8 \
|
||||||
|
--disable-sequence-parallel-moe \
|
||||||
|
--tensor-parallel-size 8 \
|
||||||
|
--num-gpu-blocks-override 1024 \
|
||||||
|
--data-parallel-size 2 \
|
||||||
|
--max-model-len 16384 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--max-num-seqs 20 \
|
||||||
|
--graph-optimization-config '{"use_cudagraph":true}'
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example 3:** Deploying a 16-card block_wise_fp8 model service with 16K contexts on an H800
|
||||||
|
|
||||||
|
This example supports MLA computation using the FlashMLA operator
|
||||||
|
|
||||||
|
```shell
|
||||||
|
MODEL_PATH=models/DeepSeek/DeepSeek-V3.1-Terminus-BF16
|
||||||
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
|
export FLAGS_flash_attn_version=3
|
||||||
|
export USE_FLASH_MLA=1
|
||||||
|
|
||||||
|
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||||
|
export FD_ENABLE_MULTI_API_SERVER=1
|
||||||
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
|
--ports "9811,9812,9813,9814,9815,9816,9817,9818" \
|
||||||
|
--num-servers 8 \
|
||||||
|
--args --model "$MODEL_PATH" \
|
||||||
|
--ips "10.95.246.220,10.95.230.91" \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--quantization block_wise_fp8 \
|
||||||
|
--disable-sequence-parallel-moe \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--num-gpu-blocks-override 1024 \
|
||||||
|
--data-parallel-size 16 \
|
||||||
|
--max-model-len 16384 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--max-num-seqs 20 \
|
||||||
|
--graph-optimization-config '{"use_cudagraph":true}'
|
||||||
|
```
|
||||||
|
|
||||||
|
# DeepSeek-V3.2 Model
|
||||||
|
|
||||||
|
## I. Environment Preparation
|
||||||
|
|
||||||
|
### 1.1 Support Requirements
|
||||||
|
|
||||||
|
The minimum number of GPUs required to deploy the DeepSeek-V3.2 model on the block_wise_fp8 platform under current quantization is as follows:
|
||||||
|
|
||||||
|
| | block_wise_fp8 |
|
||||||
|
|
||||||
|
|-----|-----|
|
||||||
|
|
||||||
|
|H800 80GB| 16 |
|
||||||
|
|
||||||
|
### 1.2 Installing FastDeploy
|
||||||
|
|
||||||
|
Refer to the installation process document [FastDeploy GPU Installation](../get_started/installation/nvidia_gpu.md)
|
||||||
|
|
||||||
|
## II. How to Use
|
||||||
|
|
||||||
|
### 2.1 Basics: Starting the Service
|
||||||
|
|
||||||
|
**Example 1:** Deploying an 8K context service for the block_wise_fp8 model on a 16-GPU H800
|
||||||
|
|
||||||
|
```shell
|
||||||
|
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
||||||
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
|
export FD_ATTENTION_BACKEND="DSA_ATTN"
|
||||||
|
export FD_ENABLE_MULTI_API_SERVER=1
|
||||||
|
|
||||||
|
|
||||||
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
|
--ports "8091,8092,8093,8094,8095,8096,8097,8098" \
|
||||||
|
--num-servers 8 \
|
||||||
|
--args --model "$MODEL_PATH" \
|
||||||
|
--ips "10.95.246.79,10.95.239.17" \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--quantization block_wise_fp8 \
|
||||||
|
--disable-sequence-parallel-moe \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--gpu-memory-utilization 0.85 \
|
||||||
|
--max-num-batched-tokens 8192 \
|
||||||
|
--data-parallel-size 16 \
|
||||||
|
--max-model-len 8192 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--max-num-seqs 20 \
|
||||||
|
--num-gpu-blocks-override 2048 \
|
||||||
|
--graph-optimization-config '{"use_cudagraph":false}' \
|
||||||
|
--no-enable-overlap-schedule
|
||||||
|
```
|
||||||
|
|||||||
@@ -18,7 +18,7 @@ DeepSeek-V3/V3.1 各量化精度,在下列硬件上部署所需要的最小卡
|
|||||||
### 2.1 基础:启动服务
|
### 2.1 基础:启动服务
|
||||||
**示例1:** H800上八卡部署wint4模型16K上下文的服务
|
**示例1:** H800上八卡部署wint4模型16K上下文的服务
|
||||||
```shell
|
```shell
|
||||||
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
MODEL_PATH=/models/DeepSeek/DeepSeek-V3.1-Terminus-BF16
|
||||||
|
|
||||||
export FD_DISABLE_CHUNKED_PREFILL=1
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
@@ -38,10 +38,10 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**示例2:** H800上16卡部署 blockwise_fp8 模型16K上下文的服务
|
**示例2:** H800上16卡部署 block_wise_fp8 模型16K上下文的服务
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
MODEL_PATH=models/DeepSeek/DeepSeek-V3.1-Terminus-BF16
|
||||||
|
|
||||||
export FD_DISABLE_CHUNKED_PREFILL=1
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
@@ -55,7 +55,7 @@ export FD_ENABLE_MULTI_API_SERVER=1
|
|||||||
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
--ports "9811" \
|
--ports "9811" \
|
||||||
--num-servers 1 \
|
--num-servers 1 \
|
||||||
--args --model "$model_path" \
|
--args --model "$MODEL_PATH" \
|
||||||
--ips "10.95.247.24,10.95.244.147" \
|
--ips "10.95.247.24,10.95.244.147" \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--quantization block_wise_fp8 \
|
--quantization block_wise_fp8 \
|
||||||
@@ -70,13 +70,12 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**示例3:** H800上16卡部署 blockwise_fp8 模型16K上下文的服务
|
**示例3:** H800上16卡部署 block_wise_fp8 模型16K上下文的服务
|
||||||
|
|
||||||
这个例子中支持使用FlashMLA算子做MLA的计算
|
这个例子中支持使用FlashMLA算子做MLA的计算
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
MODEL_PATH=models/DeepSeek/DeepSeek-V3.1-Terminus-BF16
|
||||||
|
|
||||||
export FD_DISABLE_CHUNKED_PREFILL=1
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
export FD_ATTENTION_BACKEND="MLA_ATTN"
|
||||||
export FLAGS_flash_attn_version=3
|
export FLAGS_flash_attn_version=3
|
||||||
@@ -87,7 +86,7 @@ export FD_ENABLE_MULTI_API_SERVER=1
|
|||||||
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
--ports "9811,9812,9813,9814,9815,9816,9817,9818" \
|
--ports "9811,9812,9813,9814,9815,9816,9817,9818" \
|
||||||
--num-servers 8 \
|
--num-servers 8 \
|
||||||
--args --model "$model_path" \
|
--args --model "$MODEL_PATH" \
|
||||||
--ips "10.95.246.220,10.95.230.91" \
|
--ips "10.95.246.220,10.95.230.91" \
|
||||||
--no-enable-prefix-caching \
|
--no-enable-prefix-caching \
|
||||||
--quantization block_wise_fp8 \
|
--quantization block_wise_fp8 \
|
||||||
@@ -100,3 +99,46 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
|
|||||||
--max-num-seqs 20 \
|
--max-num-seqs 20 \
|
||||||
--graph-optimization-config '{"use_cudagraph":true}' \
|
--graph-optimization-config '{"use_cudagraph":true}' \
|
||||||
```
|
```
|
||||||
|
|
||||||
|
# DeepSeek-V3.2 Model
|
||||||
|
## 一、环境准备
|
||||||
|
### 1.1 支持情况
|
||||||
|
DeepSeek-V3.2 模型在block_wise_fp8在现量化下,在下列硬件上部署所需要的最小卡数如下:
|
||||||
|
|
||||||
|
| | block_wise_fp8 |
|
||||||
|
|-----|-----|
|
||||||
|
|H800 80GB| 16 |
|
||||||
|
|
||||||
|
### 1.2 安装fastdeploy
|
||||||
|
|
||||||
|
安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)
|
||||||
|
|
||||||
|
## 二、如何使用
|
||||||
|
### 2.1 基础:启动服务
|
||||||
|
**示例1:** H800上十六卡部署block_wise_fp8模型8K上下文的服务
|
||||||
|
|
||||||
|
```shell
|
||||||
|
MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16
|
||||||
|
export FD_DISABLE_CHUNKED_PREFILL=1
|
||||||
|
export FD_ATTENTION_BACKEND="DSA_ATTN"
|
||||||
|
export FD_ENABLE_MULTI_API_SERVER=1
|
||||||
|
|
||||||
|
python -m fastdeploy.entrypoints.openai.multi_api_server \
|
||||||
|
--ports "8091,8092,8093,8094,8095,8096,8097,8098" \
|
||||||
|
--num-servers 8 \
|
||||||
|
--args --model "$MODEL_PATH" \
|
||||||
|
--ips "10.95.246.79,10.95.239.17" \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--quantization block_wise_fp8 \
|
||||||
|
--disable-sequence-parallel-moe \
|
||||||
|
--tensor-parallel-size 1 \
|
||||||
|
--gpu-memory-utilization 0.85 \
|
||||||
|
--max-num-batched-tokens 8192 \
|
||||||
|
--data-parallel-size 16 \
|
||||||
|
--max-model-len 8192 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--max-num-seqs 20 \
|
||||||
|
--num-gpu-blocks-override 2048 \
|
||||||
|
--graph-optimization-config '{"use_cudagraph":false}' \
|
||||||
|
--no-enable-overlap-schedule
|
||||||
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user