[简体中文](../zh/best_practices/Disaggregated.md) # PD Disaggregated Deployment Best Practices This document provides a comprehensive guide to FastDeploy's PD (Prefill-Decode) disaggregated deployment solution, covering both single-machine and cross-machine deployment modes with support for Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP). ## 1. Deployment Overview and Environment Preparation This guide demonstrates deployment practices using the ERNIE-4.5-300B-A47B-Paddle model on H100 80GB GPUs. Below are the minimum GPU requirements for different deployment configurations: **Single-Machine Deployment (8 GPUs, Single Node)** | Configuration | TP | DP | EP | GPUs Required | |---------|----|----|----|---------| | P:TP4DP1
D:TP4DP1 | 4 | 1 | - | 8 | | P:TP1DP4EP4
D:TP1DP4EP4| 1 | 4 | ✓ | 8 | **Multi-Machine Deployment (16 GPUs, Cross-Node)** | Configuration | TP | DP | EP | GPUs Required | |---------|----|----|----|---------| | P:TP8DP1
D:TP8DP1 | 8 | 1 | - | 16 | | P:TP4DP2
D:TP4DP2 | 4 | 2 | - | 16 | | P:TP1DP8EP8
D:TP1DP8EP8 | 1 | 8 | ✓ | 16 | **Important Notes**: 1. **Quantization**: All configurations above use WINT4 quantization, specified via `--quantization wint4` 2. **EP Limitations**: When Expert Parallelism (EP) is enabled, only TP=1 is currently supported; multi-TP scenarios are not yet available 3. **Cross-Machine Network**: Cross-machine deployment requires RDMA network support for high-speed KV Cache transmission 4. **GPU Calculation**: Total GPUs = TP × DP × 2, with identical configurations for both Prefill and Decode instances 5. **CUDA Graph Capture**: Decode instances enable CUDA Graph capture by default for inference acceleration, while Prefill instances do not ### 1.1 Installing FastDeploy Please refer to the [FastDeploy Installation Guide](https://paddlepaddle.github.io/FastDeploy/zh/install/) to set up your environment. For model downloads, please check the [Supported Models List](https://paddlepaddle.github.io/FastDeploy/zh/model_summary/). ### 1.2 Deployment Topology **Single-Machine Deployment Topology** ``` ┌──────────────────────────────┐ │ Single Machine 8×H100 80GB │ │ ┌──────────────┐ │ │ │ Router │ │ │ │ 0.0.0.0:8109│ │ │ └──────────────┘ │ │ │ │ │ ┌────┴────┐ │ │ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ │ │ │Prefill │ │Decode │ │ │ │GPU 0-3 │ │GPU 4-7 │ │ │ └─────────┘ └─────────┘ │ └──────────────────────────────┘ ``` **Cross-Machine Deployment Topology** ``` ┌─────────────────────┐ ┌─────────────────────┐ │ Prefill Machine │ RDMA Network │ Decode Machine │ │ 8×H100 80GB │◄────────────────────►│ 8×H100 80GB │ │ │ │ │ │ ┌──────────────┐ │ │ │ │ │ Router │ │ │ │ │ │ 0.0.0.0:8109 │───┼──────────────────────┼────────── │ │ └──────────────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ ▼ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │Prefill Nodes │ │ │ │Decode Nodes │ │ │ │GPU 0-7 │ │ │ │GPU 0-7 │ │ │ └──────────────┘ │ │ └──────────────┘ │ └─────────────────────┘ └─────────────────────┘ ``` --- ## 2. Single-Machine PD Disaggregated Deployment ### 2.1 Test Scenarios and Parallelism Configuration This chapter demonstrates the **TP4DP1|D:TP4DP1** configuration test scenario: - **Tensor Parallelism (TP)**: 4 — Each 4 GPUs independently load complete model parameters - **Data Parallelism (DP)**: 1 — Each GPU forms a data parallelism group - **Expert Parallelism (EP)**: Not enabled **To test other parallelism configurations, adjust parameters as follows:** 1. **TP Adjustment**: Modify `--tensor-parallel-size` 2. **DP Adjustment**: Modify `--data-parallel-size`, ensuring `--ports` and `--num-servers` remain consistent with DP 3. **EP Toggle**: Add or remove `--enable-expert-parallel` 4. **GPU Allocation**: Control GPUs used by Prefill and Decode instances via `CUDA_VISIBLE_DEVICES` ### 2.2 Startup Scripts #### Start Router ```bash python -m fastdeploy.router.launch \ --port 8109 \ --splitwise ``` Note: This uses the Python version of the router. If needed, you can also use the high-performance [Golang version router](../online_serving/router.md). #### Start Prefill Nodes ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 python -m fastdeploy.entrypoints.openai.api_server \ --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ --port 8188 \ --splitwise-role "prefill" \ --cache-transfer-protocol "rdma,ipc" \ --router "0.0.0.0:8109" \ --quantization wint4 \ --tensor-parallel-size 4 \ --data-parallel-size 1 \ --max-model-len 8192 \ --max-num-seqs 64 ``` #### Start Decode Nodes ```bash export CUDA_VISIBLE_DEVICES=4,5,6,7 python -m fastdeploy.entrypoints.openai.multi_api_server \ --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ --ports 8200,8201 \ --splitwise-role "decode" \ --cache-transfer-protocol "rdma,ipc" \ --router "0.0.0.0:8109" \ --quantization wint4 \ --tensor-parallel-size 2 \ --data-parallel-size 2 \ --max-model-len 8192 \ --max-num-seqs 64 ``` ### 2.3 Key Parameter Descriptions | Parameter | Description | |-----|------| | `--splitwise` | Enable PD disaggregated mode | | `--splitwise-role` | Node role: `prefill` or `decode` | | `--cache-transfer-protocol` | KV Cache transfer protocol: `rdma` or `ipc` | | `--router` | Router service address | | `--quantization` | Quantization strategy (wint4/wint8/fp8, etc.) | | `--tensor-parallel-size` | Tensor parallelism degree (TP) | | `--data-parallel-size` | Data parallelism degree (DP) | | `--max-model-len` | Maximum sequence length | | `--max-num-seqs` | Maximum concurrent sequences | | `--num-gpu-blocks-override` | GPU KV Cache block count override | --- ## 3. Cross-Machine PD Disaggregated Deployment ### 3.1 Deployment Principles Cross-machine PD disaggregation deploys Prefill and Decode instances on different physical machines: - **Prefill Machine**: Runs the Router and Prefill nodes, responsible for processing input sequence prefill computation - **Decode Machine**: Runs Decode nodes, communicates with the Prefill machine via RDMA network, responsible for autoregressive decoding generation ### 3.2 Test Scenarios and Parallelism Configuration This chapter demonstrates the **TP1DP8EP8|D:TP1DP8EP8** cross-machine configuration (16 GPUs total): - **Tensor Parallelism (TP)**: 1 - **Data Parallelism (DP)**: 8 — 8 GPUs per machine, totaling 8 Prefill instances and 8 Decode instances - **Expert Parallelism (EP)**: Enabled — MoE layer shared experts are distributed across 8 GPUs for parallel computation **To test other cross-machine parallelism configurations, adjust parameters as follows:** 1. **Inter-Machine Communication**: Ensure RDMA network connectivity between machines; Prefill machine needs `KVCACHE_RDMA_NICS` environment variable configured 2. **Router Address**: The `--router` parameter on the Decode machine must point to the actual IP address of the Prefill machine 3. **Port Configuration**: The number of ports in the `--ports` list must match `--num-servers` and `--data-parallel-size` 4. **GPU Visibility**: Each machine specifies its local GPUs via `CUDA_VISIBLE_DEVICES` ### 3.3 Prefill Machine Startup Scripts #### Start Router ```bash unset http_proxy && unset https_proxy python -m fastdeploy.router.launch \ --port 8109 \ --splitwise ``` #### Start Prefill Nodes ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m fastdeploy.entrypoints.openai.multi_api_server \ --ports 8198,8199,8200,8201,8202,8203,8204,8205 \ --num-servers 8 \ --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ --splitwise-role "prefill" \ --cache-transfer-protocol "rdma,ipc" \ --router ":8109" \ --quantization wint4 \ --tensor-parallel-size 1 \ --data-parallel-size 8 \ --enable-expert-parallel \ --max-model-len 8192 \ --max-num-seqs 64 ``` ### 3.4 Decode Machine Startup Scripts #### Start Decode Nodes ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m fastdeploy.entrypoints.openai.multi_api_server \ --ports 8198,8199,8200,8201,8202,8203,8204,8205 \ --num-servers 8 \ --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ --splitwise-role "decode" \ --cache-transfer-protocol "rdma,ipc" \ --router ":8109" \ --quantization wint4 \ --tensor-parallel-size 1 \ --data-parallel-size 8 \ --enable-expert-parallel \ --max-model-len 8192 \ --max-num-seqs 64 ``` **Note**: Please replace `` with the actual IP address of the Prefill machine. ## 4. Sending Test Requests ```bash curl -X POST "http://localhost:8109/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "你好,请介绍一下自己。"} ], "max_tokens": 100, "stream": false }' ``` ## 5. Frequently Asked Questions (FAQ) If you encounter issues during use, please refer to [FAQ](./FAQ.md) for solutions.