Files
FastDeploy/docs/features/data_parallel_service.md
T
mouxin 049c807d86 [Docs] Update the document (#6539)
Co-authored-by: mouxin <mouxin@baidu.com>
2026-02-27 19:21:10 +08:00

100 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
[简体中文](../zh/features/data_parallel_service.md)
# Data Parallelism (DP)
Data Parallelism (DP) is a distributed inference approach in which incoming requests are distributed across multiple **identical model replicas**, with each replica independently handling inference for its assigned requests.
In practice, especially when deploying **Mixture-of-Experts (MoE)** models, **Data Parallelism (DP)** is often combined with **Expert Parallelism (EP)**.
Each DP service independently performs the Attention computation, while all DP services collaboratively participate in the MoE computation, thereby improving overall inference performance.
FastDeploy supports DP-based inference and provides the `multi_api_server` interface to launch multiple inference services simultaneously.
![Architecture](./images/no_scheduler_img.png)
---
## Launching FastDeploy Services
Taking the **ERNIE-4.5-300B** model as an example, the following command launches a service with **DP=8, TP=1, EP=8**:
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel
```
### Parameter Description
* `num-servers`: Number of DP service instances to launch.
* `ports`: API server ports for the DP service instances. The number of ports must match `num-servers`.
* `metrics-ports`: Metrics server ports for the DP service instances. The number must match `num-servers`. If not specified, available ports will be allocated automatically.
* `args`: Arguments passed to each DP service instance. Refer to the [parameter documentation](../parameters.md) for details.
---
## Request Scheduling
After launching multiple DP services using the data parallel strategy, incoming user requests must be distributed across the services by a scheduler to achieve load balancing.
### Web ServerBased Scheduling
Once the IP addresses and ports of the DP service instances are known, common web servers (such as **Nginx**) can be used to implement request scheduling. Details are omitted here.
### FastDeploy Router
FastDeploy provides a Python-based [Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/router) to handle request reception and scheduling.
A high-performance Router implementation is currently under development.
The usage and request scheduling workflow is as follows:
* Start the Router
* Start FastDeploy service instances (either single-DP or multi-DP), which register themselves with the Router
* User requests are sent to the Router
* The Router selects an appropriate service instance based on the global load status
* The Router forwards the request to the selected instance for inference
* The Router receives the generated result from the instance and returns it to the user
---
## Quick Start Example
### Launching the Router
Start the Router service. Logs are written to `log_router/router.log`. `fd-router` installation instructions can be found in the [Router documentation](../online_serving/router.md).
```shell
export FD_LOG_DIR="log_router"
/usr/local/bin/fd-router \
--port 30000
```
### Launching DP Services with Router
Again using the **ERNIE-4.5-300B** model as an example, the following command launches **DP=8, TP=1, EP=8** services and registers them with the Router via the `--router` argument:
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel \
--router "0.0.0.0:30000"
```