diff --git a/docs/usage/kunlunxin_xpu_deployment.md b/docs/usage/kunlunxin_xpu_deployment.md index 1f1936838c..abbabb461c 100644 --- a/docs/usage/kunlunxin_xpu_deployment.md +++ b/docs/usage/kunlunxin_xpu_deployment.md @@ -1,33 +1,11 @@ [简体中文](../zh/usage/kunlunxin_xpu_deployment.md) ## Supported Models - -| Model Name | Context Length | Quantization | XPUs Required | Applicable Version | -|------------|---------------|--------------|---------------|-------------------| -| ERNIE-4.5-300B-A47B | 32K | WINT8 | 8 | 2.5.0 | -| ERNIE-4.5-300B-A47B | 32K | WINT4 | 4 | 2.5.0 | -| ERNIE-4.5-300B-A47B | 32K | WINT4 | 8 | 2.5.0 | -| ERNIE-4.5-300B-A47B | 128K | WINT4 | 8 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 32K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 32K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 32K | WINT4 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 128K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 128K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 128K | WINT4 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 32K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 32K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 128K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 128K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-300B-A47B-W4A8C8-TP4 | 32K | W4A8 | 4 | 2.5.0 | -| ERNIE-4.5-VL-28B-A3B | 32K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-VL-424B-A47B | 32K | WINT8 | 8 | 2.5.0 | -| PaddleOCR-VL-0.9B | 32K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-VL-28B-A3B-Thinking | 128K | WINT8 | 1 | 2.5.0 | - +Note: The following models and deployment commands are valid only for version 2.5.0.
-ERNIE-4.5-300B-A47B (32K, WINT8, 8 XPUs) - Click to view deployment commands +ERNIE-4.5-300B-A47B (32K, WINT8, 8 XPUs) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -40,11 +18,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2,mlx5_3,mlx5_3,mlx5_4,mlx5_4 # Check the RDMA network card names of the machine using `xpu-smi topo -m` export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 @@ -64,14 +42,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-300B-A47B (32K, WINT4, 4 XPUs) - Click to view deployment commands +ERNIE-4.5-300B-A47B (32K, WINT4, 4 XPUs) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3" # or "4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -84,11 +62,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3" # or "4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2 # Check the RDMA network card names of the machine using `xpu-smi topo -m` export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 @@ -108,14 +86,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-300B-A47B (32K, WINT4, 8 XPUs) - Click to view deployment commands +ERNIE-4.5-300B-A47B (32K, WINT4, 8 XPUs) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -128,18 +106,18 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.95 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2,mlx5_3,mlx5_3,mlx5_4,mlx5_4 # Check the RDMA network card names of the machine using `xpu-smi topo -m` export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 export XSHMEM_QP_NUM_PER_RANK=32 export BKCL_RDMA_VERBS=1 python -m fastdeploy.entrypoints.openai.api_server \ - --model /home/ERNIE-4.5-300B-A47B-Paddle \ + --model /PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ --port 8188 \ --engine-worker-queue-port 8124 \ --metrics-port 8125 \ @@ -152,14 +130,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-300B-A47B (128K, WINT4, 8 XPUs) - Click to view deployment commands +ERNIE-4.5-300B-A47B (128K, WINT4, 8 XPUs) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -172,18 +150,18 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2,mlx5_3,mlx5_3,mlx5_4,mlx5_4 # Check the RDMA network card names of the machine using `xpu-smi topo -m` export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 export XSHMEM_QP_NUM_PER_RANK=32 export BKCL_RDMA_VERBS=1 python -m fastdeploy.entrypoints.openai.api_server \ - --model /home/ERNIE-4.5-300B-A47B-Paddle \ + --model /PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ --port 8123 \ --engine-worker-queue-port 8124 \ --metrics-port 8125 \ @@ -196,14 +174,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-21B-A3B (32K, BF16, 1 XPU) - Click to view deployment commands +ERNIE-4.5-21B-A3B (32K, BF16, 1 XPU) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -215,7 +193,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -230,9 +208,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (32K, WINT8, 1 XPU) - Click to view deployment commands +ERNIE-4.5-21B-A3B (32K, WINT8, 1 XPU) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -245,7 +223,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -261,9 +239,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (32K, WINT4, 1 XPU) - Click to view deployment commands +ERNIE-4.5-21B-A3B (32K, WINT4, 1 XPU) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -276,7 +254,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -292,9 +270,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (128K, BF16, 1 XPU) - Click to view deployment commands +ERNIE-4.5-21B-A3B (128K, BF16, 1 XPU) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -306,7 +284,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -321,9 +299,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (128K, WINT8, 1 XPU) - Click to view deployment commands +ERNIE-4.5-21B-A3B (128K, WINT8, 1 XPU) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -336,7 +314,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -352,9 +330,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (128K, WINT4, 1 XPU) - Click to view deployment commands +ERNIE-4.5-21B-A3B (128K, WINT4, 1 XPU) -**Quick Deployment:** +**Quick Launch:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -367,7 +345,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**Best Deployment:** +**Optimal Performance:** ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card python -m fastdeploy.entrypoints.openai.api_server \ @@ -383,7 +361,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (32K, BF16, 1 XPU) - Click to view deployment commands +ERNIE-4.5-0.3B (32K, BF16, 1 XPU) ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card @@ -398,7 +376,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (32K, WINT8, 1 XPU) - Click to view deployment commands +ERNIE-4.5-0.3B (32K, WINT8, 1 XPU) ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card @@ -414,7 +392,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (128K, BF16, 1 XPU) - Click to view deployment commands +ERNIE-4.5-0.3B (128K, BF16, 1 XPU) ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card @@ -429,7 +407,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (128K, WINT8, 1 XPU) - Click to view deployment commands +ERNIE-4.5-0.3B (128K, WINT8, 1 XPU) ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card @@ -445,7 +423,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-300B-A47B-W4A8C8-TP4 (32K, W4A8, 4 XPUs) - Click to view deployment commands +ERNIE-4.5-300B-A47B-W4A8C8-TP4 (32K, W4A8, 4 XPUs) ```bash export XPU_VISIBLE_DEVICES="0,1,2,3" # or "4,5,6,7" @@ -461,7 +439,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-VL-28B-A3B (32K, WINT8, 1 XPU) - Click to view deployment commands +ERNIE-4.5-VL-28B-A3B (32K, WINT8, 1 XPU) ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card @@ -480,7 +458,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-VL-424B-A47B (32K, WINT8, 8 XPUs) - Click to view deployment commands +ERNIE-4.5-VL-424B-A47B (32K, WINT8, 8 XPUs) ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" @@ -500,7 +478,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-PaddleOCR-VL-0.9B (32K, BF16, 1 XPU) - Click to view deployment commands +PaddleOCR-VL-0.9B (32K, BF16, 1 XPU) ```bash export FD_ENABLE_MAX_PREFILL=1 @@ -518,7 +496,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-VL-28B-A3B-Thinking (128K, WINT8, 1 XPU) - Click to view deployment commands +ERNIE-4.5-VL-28B-A3B-Thinking (128K, WINT8, 1 XPU) ```bash export XPU_VISIBLE_DEVICES="0" # Specify any card @@ -538,9 +516,9 @@ python -m fastdeploy.entrypoints.openai.api_server \ ```
-## Quick start +## Examples -### Deploy online serving based on ERNIE-4.5-300B-A47B-Paddle +### Running ERNIE-4.5-300B-A47B-Paddle #### Start service @@ -610,7 +588,7 @@ print('\n') For detailed OpenAI protocol specifications, see [OpenAI Chat Completion API](https://platform.openai.com/docs/api-reference/chat/create). Differences from the standard OpenAI protocol are documented in [OpenAI Protocol-Compatible API Server](../online_serving/README.md). -### Deploy online serving based on ERNIE-4.5-VL-28B-A3B-Paddle +### Running ERNIE-4.5-VL-28B-A3B-Paddle #### Start service Deploy the ERNIE-4.5-VL-28B-A3B-Paddle model with WINT8 precision and 32K context length on 1 XPU @@ -681,7 +659,7 @@ for chunk in response: print('\n') ``` -### Deploy online serving based on PaddleOCR-VL-0.9B +### Running PaddleOCR-VL-0.9B #### Start service @@ -752,7 +730,7 @@ for chunk in response: print('\n') ``` -### Deploy online serving based on ERNIE-4.5-VL-28B-A3B-Thinking +### Running ERNIE-4.5-VL-28B-A3B-Thinking #### Start service Deploy the ERNIE-4.5-VL-28B-A3B-Thinking model with WINT8 precision and 128K context length on 1 XPU diff --git a/docs/zh/usage/kunlunxin_xpu_deployment.md b/docs/zh/usage/kunlunxin_xpu_deployment.md index 17df4ef791..67e0600bb2 100644 --- a/docs/zh/usage/kunlunxin_xpu_deployment.md +++ b/docs/zh/usage/kunlunxin_xpu_deployment.md @@ -2,32 +2,11 @@ ## 支持的模型 -| 模型名称 | 上下文长度 | 量化 | 所需卡数 | 适用版本 | -|----------|-----------|------|---------|---------| -| ERNIE-4.5-300B-A47B | 32K | WINT8 | 8 | 2.5.0 | -| ERNIE-4.5-300B-A47B | 32K | WINT4 | 4 | 2.5.0 | -| ERNIE-4.5-300B-A47B | 32K | WINT4 | 8 | 2.5.0 | -| ERNIE-4.5-300B-A47B | 128K | WINT4 | 8 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 32K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 32K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 32K | WINT4 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 128K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 128K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-21B-A3B | 128K | WINT4 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 32K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 32K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 128K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-0.3B | 128K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-300B-A47B-W4A8C8-TP4 | 32K | W4A8 | 4 | 2.5.0 | -| ERNIE-4.5-VL-28B-A3B | 32K | WINT8 | 1 | 2.5.0 | -| ERNIE-4.5-VL-424B-A47B | 32K | WINT8 | 8 | 2.5.0 | -| PaddleOCR-VL-0.9B | 32K | BF16 | 1 | 2.5.0 | -| ERNIE-4.5-VL-28B-A3B-Thinking | 128K | WINT8 | 1 | 2.5.0 | - +注:以下模型支持和部署命令仅适用于 2.5.0 版本
-ERNIE-4.5-300B-A47B (32K, WINT8, 8 卡) - 点击查看部署命令 +ERNIE-4.5-300B-A47B (32K, WINT8, 8 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -40,11 +19,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2,mlx5_3,mlx5_3,mlx5_4,mlx5_4 # 通过 `xpu-smi topo -m` 命令查看机器的RDMA网卡名称 export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 @@ -64,14 +43,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-300B-A47B (32K, WINT4, 4 卡) - 点击查看部署命令 +ERNIE-4.5-300B-A47B (32K, WINT4, 4 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3" # 或 "4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -84,11 +63,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3" # 或 "4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2 # 通过 `xpu-smi topo -m` 命令查看机器的RDMA网卡名称 export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 @@ -108,14 +87,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-300B-A47B (32K, WINT4, 8 卡) - 点击查看部署命令 +ERNIE-4.5-300B-A47B (32K, WINT4, 8 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -128,18 +107,18 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.95 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2,mlx5_3,mlx5_3,mlx5_4,mlx5_4 # 通过 `xpu-smi topo -m` 命令查看机器的RDMA网卡名称 export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 export XSHMEM_QP_NUM_PER_RANK=32 export BKCL_RDMA_VERBS=1 python -m fastdeploy.entrypoints.openai.api_server \ - --model /home/ERNIE-4.5-300B-A47B-Paddle \ + --model /PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ --port 8188 \ --engine-worker-queue-port 8124 \ --metrics-port 8125 \ @@ -152,14 +131,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-300B-A47B (128K, WINT4, 8 卡) - 点击查看部署命令 +ERNIE-4.5-300B-A47B (128K, WINT4, 8 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -m fastdeploy.entrypoints.openai.api_server \ @@ -172,18 +151,18 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" export BKCL_ENABLE_XDR=1 -export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 +export BKCL_RDMA_NICS=mlx5_1,mlx5_1,mlx5_2,mlx5_2,mlx5_3,mlx5_3,mlx5_4,mlx5_4 # 通过 `xpu-smi topo -m` 命令查看机器的RDMA网卡名称 export BKCL_TRACE_TOPO=1 export BKCL_PCIE_RING=1 export XSHMEM_MODE=1 export XSHMEM_QP_NUM_PER_RANK=32 export BKCL_RDMA_VERBS=1 python -m fastdeploy.entrypoints.openai.api_server \ - --model /home/ERNIE-4.5-300B-A47B-Paddle \ + --model /PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ --port 8123 \ --engine-worker-queue-port 8124 \ --metrics-port 8125 \ @@ -196,14 +175,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --enable-expert-parallel \ --enable-prefix-caching \ --data-parallel-size 1 \ - --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' + --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' ```
-ERNIE-4.5-21B-A3B (32K, BF16, 1 卡) - 点击查看部署命令 +ERNIE-4.5-21B-A3B (32K, BF16, 1 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -215,7 +194,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -230,9 +209,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (32K, WINT8, 1 卡) - 点击查看部署命令 +ERNIE-4.5-21B-A3B (32K, WINT8, 1 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -245,7 +224,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -261,9 +240,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (32K, WINT4, 1 卡) - 点击查看部署命令 +ERNIE-4.5-21B-A3B (32K, WINT4, 1 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -276,7 +255,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -292,9 +271,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (128K, BF16, 1 卡) - 点击查看部署命令 +ERNIE-4.5-21B-A3B (128K, BF16, 1 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -306,7 +285,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -321,9 +300,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (128K, WINT8, 1 卡) - 点击查看部署命令 +ERNIE-4.5-21B-A3B (128K, WINT8, 1 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -336,7 +315,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -352,9 +331,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-21B-A3B (128K, WINT4, 1 卡) - 点击查看部署命令 +ERNIE-4.5-21B-A3B (128K, WINT4, 1 卡) -**快速部署:** +**快速启动:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -367,7 +346,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --gpu-memory-utilization 0.9 ``` -**最优部署:** +**性能更优:** ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 python -m fastdeploy.entrypoints.openai.api_server \ @@ -383,7 +362,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (32K, BF16, 1 卡) - 点击查看部署命令 +ERNIE-4.5-0.3B (32K, BF16, 1 卡) ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 @@ -398,7 +377,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (32K, WINT8, 1 卡) - 点击查看部署命令 +ERNIE-4.5-0.3B (32K, WINT8, 1 卡) ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 @@ -414,7 +393,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (128K, BF16, 1 卡) - 点击查看部署命令 +ERNIE-4.5-0.3B (128K, BF16, 1 卡) ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 @@ -429,7 +408,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-0.3B (128K, WINT8, 1 卡) - 点击查看部署命令 +ERNIE-4.5-0.3B (128K, WINT8, 1 卡) ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 @@ -445,7 +424,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-300B-A47B-W4A8C8-TP4 (32K, W4A8, 4 卡) - 点击查看部署命令 +ERNIE-4.5-300B-A47B-W4A8C8-TP4 (32K, W4A8, 4 卡) ```bash export XPU_VISIBLE_DEVICES="0,1,2,3" # 或 "4,5,6,7" @@ -461,7 +440,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-VL-28B-A3B (32K, WINT8, 1 卡) - 点击查看部署命令 +ERNIE-4.5-VL-28B-A3B (32K, WINT8, 1 卡) ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 @@ -480,7 +459,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-VL-424B-A47B (32K, WINT8, 8 卡) - 点击查看部署命令 +ERNIE-4.5-VL-424B-A47B (32K, WINT8, 8 卡) ```bash export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" @@ -500,7 +479,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-PaddleOCR-VL-0.9B (32K, BF16, 1 卡) - 点击查看部署命令 +PaddleOCR-VL-0.9B (32K, BF16, 1 卡) ```bash export FD_ENABLE_MAX_PREFILL=1 @@ -518,7 +497,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
-ERNIE-4.5-VL-28B-A3B-Thinking (128K, WINT8, 1 卡) - 点击查看部署命令 +ERNIE-4.5-VL-28B-A3B-Thinking (128K, WINT8, 1 卡) ```bash export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 @@ -538,10 +517,9 @@ python -m fastdeploy.entrypoints.openai.api_server \ ```
-## 快速开始 - -### 基于ERNIE-4.5-300B-A47B-Paddle模型部署在线服务 +## 示例 +### 运行ERNIE-4.5-300B-A47B-Paddle #### 启动服务 基于 WINT4 精度和 32K 上下文部署 ERNIE-4.5-300B-A47B-Paddle 模型到 4 卡 P800 服务器 @@ -611,7 +589,7 @@ print('\n') OpenAI 协议的更多说明可参考文档 [OpenAI Chat Completion API](https://platform.openai.com/docs/api-reference/chat/create),以及与 OpenAI 协议的区别可以参考 [兼容 OpenAI 协议的服务化部署](../online_serving/README.md)。 -### 基于ERNIE-4.5-VL-28B-A3B-Paddle模型部署在线服务 +### 运行ERNIE-4.5-VL-28B-A3B-Paddle #### 启动服务 @@ -683,7 +661,7 @@ for chunk in response: print('\n') ``` -### 基于PaddleOCR-VL-0.9B模型部署在线服务 +### 运行PaddleOCR-VL-0.9B #### 启动服务 @@ -754,7 +732,7 @@ for chunk in response: print('\n') ``` -### 基于ERNIE-4.5-VL-28B-A3B-Thinking模型部署在线服务 +### 运行ERNIE-4.5-VL-28B-A3B-Thinking #### 启动服务