diff --git a/docs/best_practices/DeepSeek-V3.md b/docs/best_practices/DeepSeek-V3.md new file mode 100644 index 0000000000..d7b9c1a02b --- /dev/null +++ b/docs/best_practices/DeepSeek-V3.md @@ -0,0 +1,43 @@ +[简体中文](../zh/best_practices/DeepSeek-V3.md) + +# DeepSeek-V3/V3.1 Model + +## I. Environment Preparation + +### 1.1 Support Requirements +The minimum number of GPUs required for deployment on the following hardware for each quantization precision of DeepSeek-V3/V3.1 is as follows: + +| | WINT4 | +|-----|-----|-----| +|H800 80GB| 8 | + +### 1.2 Installing FastDeploy + +Installation process reference document [FastDeploy GPU Installation](../get_started/installation/nvidia_gpu.md) + +## II. How to Use + +### 2.1 Basics: Starting the Service + +**Example 1:** Deploying a Wint4 model 16K context service on an H800 with eight GPUs + +```shell + +MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16 +export FD_DISABLE_CHUNKED_PREFILL=1 +export FD_ATTENTION_BACKEND="MLA_ATTN" +export FLAGS_flash_attn_version=3 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model "$MODEL_PATH" \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --cache-queue-port 8183 \ + --tensor-parallel-size 8 \ + --max-model-len 16384 \ + --max-num-seq 100 \ + --no-enable-prefix-caching \ + --quantization wint4 + +``` diff --git a/docs/zh/best_practices/DeepSeek-V3.md b/docs/zh/best_practices/DeepSeek-V3.md new file mode 100644 index 0000000000..1271c11cfc --- /dev/null +++ b/docs/zh/best_practices/DeepSeek-V3.md @@ -0,0 +1,39 @@ +[English](../../best_practices/DeepSeek-V3-V3.1.md) + +# DeepSeek-V3/V3.1 模型 + +## 一、环境准备 +### 1.1 支持情况 +DeepSeek-V3/V3.1 各量化精度,在下列硬件上部署所需要的最小卡数如下: + +| | WINT4 | +|-----|-----| +|H800 80GB| 8 | + +### 1.2 安装fastdeploy + +安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md) + +## 二、如何使用 +### 2.1 基础:启动服务 + **示例1:** H800上八卡部署wint4模型16K上下文的服务 +```shell +MODEL_PATH=/models/DeepSeek-V3.2-Exp-BF16 + +export FD_DISABLE_CHUNKED_PREFILL=1 +export FD_ATTENTION_BACKEND="MLA_ATTN" +export FLAGS_flash_attn_version=3 + +python -m fastdeploy.entrypoints.openai.api_server \ + --model "$MODEL_PATH" \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --cache-queue-port 8183 \ + --tensor-parallel-size 8 \ + --max-model-len 16384 \ + --max-num-seq 100 \ + --no-enable-prefix-caching \ + --quantization wint4 + +```