[PD Disaggregation] Update usage of pd disaggregation and data parallel (#5742)

* Update usage of pd disaggregation

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up dp docs

* up

* up

* up

* fix unittest
This commit is contained in:
jc
2026-01-05 17:51:29 +08:00
committed by GitHub
parent 690d4bcdb0
commit 8d384f9fd8
15 changed files with 441 additions and 385 deletions
+67 -119
View File
@@ -1,78 +1,27 @@
[简体中文](../zh/features/data_parallel_service.md)
# Data Parallelism
Under the MOE model, enabling Expert Parallelism (EP) combined with Data Parallelism (DP), where EP distributes expert workloads and DP enables parallel request processing.
# Data Parallelism (DP)
## Data Distribution Strategy
FastDeploy uses the splitwise scheduler to monitor the load status of each DP node and distribute incoming data accordingly.
Data Parallelism (DP) is a distributed inference approach in which incoming requests are distributed across multiple **identical model replicas**, with each replica independently handling inference for its assigned requests.
The splitwise scheduler relies on Redis to store DP load status and distribute received data.
In practice, especially when deploying **Mixture-of-Experts (MoE)** models, **Data Parallelism (DP)** is often combined with **Expert Parallelism (EP)**.
Each DP service independently performs the Attention computation, while all DP services collaboratively participate in the MoE computation, thereby improving overall inference performance.
### Expert Parallelism + Hybrid Deployment
FastDeploy provides the splitwise scheduler that monitors DP load status and schedules incoming data.
The scheduling flow is shown below - users randomly request IP and port, obtain load status via Redis, and data is distributed to less-loaded DPs for inference.
![Scheduling Architecture](./images/scheduler_img.png)
FastDeploy supports DP-based inference and provides the `multi_api_server` interface to launch multiple inference services simultaneously.
#### Offline Inference
```python
prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
"我要采访一位科幻作家,创建一个包含5个问题的列表"
]
![Architecture](./images/no_scheduler_img.png)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
---
llm = LLM(
model="ERNIE-4_5-300B-A47B-FP8-Paddle",
tensor_parallel_size=1,
data_parallel_size=8,
max_model_len=8192,
num_gpu_blocks_override=1024,
engine_worker_queue_port="6077,6078,6079,6080,6081,6082,6083,6084",
enable_expert_parallel=True,
scheduler_name="splitwise",
scheduler_host="127.0.0.1",
scheduler_topic="test",
scheduler_port=6379
)
outputs = llm.generate(prompts, sampling_params)
## Launching FastDeploy Services
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print("generated_text: ", generated_text)
print("\n")
```
#### Online Inference
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "6077,6078,6079,6080,6081,6082,6083,6084" \
--data-parallel-size 8 --tensor-parallel-size 1\
--enable-expert-parallel \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
### User-Managed Scheduling
FastDeploy provides multi_api_server, allowing users to launch multiple API servers and manually select DPs for requests. In this case, users can add their own load balancing models for scheduling. (Currently only supports online inference)
#### Online Inference
![Scheduling Architecture](./images/no_scheduler_img.png)
Taking the **ERNIE-4.5-300B** model as an example, the following command launches a service with **DP=8, TP=1, EP=8**:
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
@@ -85,68 +34,67 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
```
### Parameter Description
- num-servers: Number of API servers to launch
- ports: Ports for API servers
- args: Arguments for API servers
### Data Parallelism + Disaggregated Deployment
Refer to [Disaggregated Deployment](disaggregated.md#multi-machine-disaggregated-deployment)
* `num-servers`: Number of DP service instances to launch.
* `ports`: API server ports for the DP service instances. The number of ports must match `num-servers`.
* `metrics-ports`: Metrics server ports for the DP service instances. The number must match `num-servers`. If not specified, available ports will be allocated automatically.
* `args`: Arguments passed to each DP service instance. Refer to the [parameter documentation](../parameters.md) for details.
#### Online Inference
For multi-machine deployment, ensure network cards support RDMA and all cluster nodes are interconnected.
---
**Note**:
- `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
- The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
## Request Scheduling
**Prefill Instance**
```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--splitwise-role "prefill" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
After launching multiple DP services using the data parallel strategy, incoming user requests must be distributed across the services by a scheduler to achieve load balancing.
### Web ServerBased Scheduling
Once the IP addresses and ports of the DP service instances are known, common web servers (such as **Nginx**) can be used to implement request scheduling. Details are omitted here.
### FastDeploy Router
FastDeploy provides a Python-based [Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/router) to handle request reception and scheduling.
A high-performance Router implementation is currently under development.
The usage and request scheduling workflow is as follows:
* Start the Router
* Start FastDeploy service instances (either single-DP or multi-DP), which register themselves with the Router
* User requests are sent to the Router
* The Router selects an appropriate service instance based on the global load status
* The Router forwards the request to the selected instance for inference
* The Router receives the generated result from the instance and returns it to the user
---
## Quick Start Example
### Launching the Router
Start the Router service. Logs are written to `log_router/router.log`.
```shell
export FD_LOG_DIR="log_router"
python -m fastdeploy.router.launch \
--host 0.0.0.0 \
--port 30000
```
**Decode Instance**
```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8187 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--scheduler-name "splitwise" \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
### Launching DP Services with Router
Again using the **ERNIE-4.5-300B** model as an example, the following command launches **DP=8, TP=1, EP=8** services and registers them with the Router via the `--router` argument:
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel \
--router "0.0.0.0:30000"
```
+167 -49
View File
@@ -2,78 +2,195 @@
# Disaggregated Deployment
Large model inference consists of two phases: Prefill and Decode, which are compute-intensive and memory access-intensive respectively. Deploying Prefill and Decode separately in certain scenarios can improve hardware utilization, effectively increase throughput, and reduce overall sentence latency.
Large Language Model (LLM) inference is divided into two phases: **Prefill** and **Decode**, which are compute-intensive and memory-bound, respectively.
* Prefill phase: Processes all input Tokens (such as user prompts), completes the model's forward propagation, and generates the first token.
* Decode phase: Starting from the first generated token, it generates one token at a time autoregressively until reaching the stop token. For N output tokens, the Decode phase requires (N-1) forward propagations that must be executed serially. During generation, the number of tokens to attend to increases, and computational requirements gradually grow.
* **Prefill Phase:** Processes all input tokens, completes the model's forward pass, and generates the first token.
* **Decode Phase:** Generates subsequent tokens based on the first token and the cached KV Cache. Assuming a total output of N tokens, the Decode phase requires executing (N-1) forward passes.
The core of disaggregated deployment is to deploy Prefill and Decode on different computing resources to improve their respective utilization. To achieve disaggregated deployment, communication between Prefill and Decode must be considered.
During actual inference, Prefill needs to transmit the computed KV Cache to the Decode instance, which then reads the KV Cache for continuation.
Disaggregated deployment involves deploying Prefill and Decode on distinct computing resources, each using optimal configurations. This approach improves hardware utilization, increases throughput, and reduces end-to-end latency.
## KV Cache Transmission Methods
We provide two transmission methods for KV Cache, targeting intra-machine and inter-machine scenarios respectively.
<p align="center">
<img src="../zh/features/images/mix_pd.png" width="50%">
</p>
### Intra-machine Transmission
Uses cudaMemcpyPeer for KV Cache transmission between two GPUs within a single machine, offering low latency and high throughput.
Compared to mixed deployment, the core implementation differences of disaggregated deployment lie in **KV Cache transmission** and **request scheduling**.
### Inter-machine Transmission
For transmission between multiple machines, uses high-speed RDMA network for KV Cache transmission. We provide the `rdma_comm` high-speed transmission network library for cross-machine KV Cache transmission.
## KV Cache Transmission
## PD Disaggregated Scheduling
![Splitwise Scheduler](./images/disaggregated.png)
Building upon the global scheduler, FastDeploy supports the PD disaggregated scheduling strategy, specifically designed for large language model inference scenarios, decoupling the two phases of the inference process:
* Prefill phase: Builds KV cache, compute-intensive, high memory usage but low latency.
* Decode phase: Performs autoregressive decoding, serial process, time-consuming but with low memory usage.
In disaggregated deployment, the KV Cache generated by the request in the Prefill instance needs to be transmitted to the Decode instance. FastDeploy provides two transmission methods targeting intra-node and inter-node scenarios.
In multi-instance scenarios, each incoming request needs to be assigned to different Prefill and Decode instances based on different strategies. Through role separation (Prefill nodes handle request reception and processing, Decode nodes complete subsequent generation), resource allocation can be more finely controlled to improve throughput and GPU utilization.
**Intra-node transmission:** Uses `cudaMemcpyPeer` for KV Cache transmission between two GPUs within a single node.
**Inter-node transmission:** Uses a self-developed [RDMA transmission library](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/cache_manager/transfer_factory/kvcache_transfer) to transfer KV Cache between multiple nodes.
## PD Disaggregated Request Scheduling
For PD (Prefill-Decode) disaggregated deployment, FastDeploy provides a Python version of the [Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/router) to implement request reception and scheduling. The usage method and scheduling flow are as follows:
* Start the Router.
* Start PD instances, the PD instances will register with the Router.
* User requests are sent to the Router.
* The Router selects a suitable PD instance pair based on the load conditions of the PD instances.
* The Router forwards the request to the selected PD instance.
* The Router receives the generation results from the PD instance and returns them to the user.
A high-performance version of the Router is currently under development. Stay tuned.
## Usage Instructions
### Multi-machine Disaggregated Deployment
### Router-based Disaggregated Deployment
#### Prerequisite: Redis
#### Environment Preparation
> **⚠️ NOTE**
> **Redis requirement: version 6.2.0 or higher**
> Versions below this may not support the required commands.
>
* Installation via `conda`
Please refer to the [documentation](https://github.com/PaddlePaddle/FastDeploy/tree/develop/docs/zh/get_started/installation) to prepare the environment. Using Docker is recommended.
If you are setting up the runtime environment manually, ensure that RDMA dependency packages (`librdmacm-dev`, `libibverbs-dev`, `iproute2`) and the [MLNX_OFED](https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/) driver are installed.
```bash
apt update --fix-missing
apt-get install -y librdmacm-dev libibverbs-dev iproute2
# Download and install MLNX_OFED
./mlnxofedinstall --user-space-only --skip-distro-check --without-fw-update --force --without-ucx-cuda
```
Pull the latest FastDeploy code, build, and install.
```bash
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```
#### Deploy Services
**Quick Start**
Start the Router service. The `--splitwise` parameter specifies the scheduling mode as disaggregated deployment. Log information is output to `log_router/router.log`.
```bash
export FD_LOG_DIR="log_router"
python -m fastdeploy.router.launch \
--host 0.0.0.0 \
--port 30000 \
--splitwise
```
Start the Prefill instance. Compared to single-node deployment, add the `--splitwise-role` parameter to specify the instance role as Prefill, and the `--router` parameter to specify the Router interface. Other parameters remain the same as mixed deployment.
```bash
export CUDA_VISIBLE_DEVICES=0
export FD_LOG_DIR="log_prefill"
python -m fastdeploy.entrypoints.openai.api_server \
--model "PaddlePaddle/ERNIE-4.5-0.3B-Paddle" \
--port 31000 \
--splitwise-role prefill \
--router "0.0.0.0:30000"
```
Start the Decode instance.
```bash
export CUDA_VISIBLE_DEVICES=1
export FD_LOG_DIR="log_decode"
python -m fastdeploy.entrypoints.openai.api_server \
--model "PaddlePaddle/ERNIE-4.5-0.3B-Paddle" \
--port 32000 \
--splitwise-role decode \
--router "0.0.0.0:30000"
```
After the Prefill and Decode instances are successfully started and registered with the Router, you can send requests.
```bash
curl -X POST "http://0.0.0.0:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"max_tokens": 100,
"stream": false
}'
```
**Details Description**
Parameter description for starting Prefill/Decode instances in disaggregated deployment:
* `--splitwise-role`: Specifies the instance role. Options are `prefill`, `decode`, and `mixed`. Default is `mixed`.
* `--cache-transfer-protocol`: Specifies the KV Cache transfer protocol. Options are `rdma` and `ipc`. Default is `rdma` and `ipc`. If PD instances are on the same machine, `ipc` transmission is prioritized.
* `--rdma-comm-ports`: Specifies RDMA communication ports, separated by commas. The number of ports must equal `dp_size * tp_size`. If unspecified, FD will internally find free ports.
* `--pd-comm-port`: Specifies the interaction interface for PD instances, separated by commas. The number of ports must equal `dp_size`. If unspecified, FD will internally find free ports.
* `--router`: Specifies the Router interface.
If the Prefill and Decode instances are deployed on different machines, RDMA network connectivity between the machines must be ensured.
To manually specify RDMA network interfaces, you can set the `KVCACHE_RDMA_NICS` environment variable. Multiple NICs should be separated by commas. FastDeploy provides a script to detect RDMA NICs automatically:
`bash FastDeploy/scripts/get_rdma_nics.sh <device>`, where `<device>` can be either `cpu` or `gpu`.
If the `KVCACHE_RDMA_NICS` environment variable is not set, FastDeploy will automatically detect available RDMA NICs internally.
**Examples**
PD disaggregated deployment supports features such as prefix caching, Tensor Parallelism (TP), and Data Parallelism (DP). For specific examples, please refer to [examples/splitwise](https://github.com/PaddlePaddle/FastDeploy/tree/develop/examples/splitwise).
### SplitwiseScheduler-based Disaggregated Deployment
**Note: Using SplitwiseScheduler is not recommended. It is recommended to use the Router for request scheduling.**
#### Environment Preparation
* Install using `conda`
> **⚠️ Note**
> **Redis Version Requirement: 6.2.0 and above**
> Versions below this may not support required commands.
```bash
# Install
conda install redis
# Start
nohup redis-server > redis.log 2>&1 &
```
* Installation via `apt`
* Install using `apt`
```bash
# Install
sudo apt install redis-server -y
# Start
sudo systemctl start redis-server
```
* Installation via `yum`
* Install using `yum`
```bash
# Install
sudo yum install redis -y
# Start
sudo systemctl start redis
```
#### Online Inference Service
For multi-machine deployment, confirm that the NIC supports RDMA and that all nodes in the cluster have network connectivity.
#### Deploy Services
For multi-node deployment, ensure that the current network interface card supports RDMA and that all nodes in the cluster have network connectivity.
**Note**:
* `KVCACHE_RDMA_NICS` specifies RDMA network cards for the current machine, multiple cards should be separated by commas.
* The repository provides an automatic RDMA network card detection script `bash scripts/get_rdma_nics.sh <device>`, where <device> can be `cpu` or `gpu`.
**Prefill Instance**
* `KVCACHE_RDMA_NICS` specifies the RDMA NICs of the current machine; separate multiple NICs with commas.
* The repository provides a script to automatically detect RDMA NICs: `bash scripts/get_rdma_nics.sh <device>`, where `<device>` can be `cpu` or `gpu`.
**prefill instance**
```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3
export ENABLE_V1_KVCACHE_SCHEDULER=0
@@ -82,8 +199,7 @@ export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8180 \
--metrics-port 8181 \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 4 \
@@ -95,10 +211,12 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
**Decode Instance**
**decode instance**
```bash
export FD_LOG_DIR="log_decode"
@@ -109,8 +227,7 @@ export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4.5-300B-A47B-BF16 \
--port 8184 \
--metrics-port 8185 \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port 8186 \
--cache-queue-port 8187 \
--tensor-parallel-size 4 \
@@ -122,22 +239,23 @@ python -m fastdeploy.entrypoints.openai.api_server \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
```
### Parameter Description
Parameter Explanation:
* --splitwise-role: Specifies whether the current service is prefill or decode
* --cache-queue-port: Specifies the cache service port for communication between prefill and decode services
* `--splitwise-role`: Specifies whether the current service is prefill or decode.
* `--cache-queue-port`: Specifies the cache service port used for communication between prefill and decode services.
#### Single-machine Parameters
* --inner-prefill-ports: Only required for Decode instance, specifies the port list of prefill instances to connect to
Multi-node Parameter Explanation:
#### Multi-machine Parameters
* --cache-transfer-protocol: Specifies KV Cache transmission protocol, supports ipc and rdma, default is ipc
* --scheduler-name: For PD disaggregation, set to "splitwise"
* --scheduler-host: Redis address to connect to
* --scheduler-port: Redis port to connect to
* --scheduler-ttl: Specifies Redis TTL time in seconds
* --pd-comm-port: Specifies PD communication port
* --rdma-comm-ports: Specifies RDMA communication ports, multiple ports separated by commas, quantity should match GPU count
* `--cache-transfer-protocol`: Specifies the KV Cache transfer protocol; supports `ipc` and `rdma`. Defaults to `ipc`.
* `--scheduler-name`: Set to `splitwise` for PD disaggregation.
* `--scheduler-host`: The Redis address to connect to.
* `--scheduler-port`: The Redis port to connect to.
* `--scheduler-ttl`: Specifies the Redis TTL (Time To Live) in seconds.
* `--scheduler-topic`: Specifies the Redis topic.
* `--pd-comm-port`: Specifies the PD communication port.
* `--rdma-comm-ports`: Specifies the RDMA communication ports, separated by commas; the quantity must match the number of cards.
+48 -129
View File
@@ -1,83 +1,22 @@
[English](../../features/data_parallel_service.md)
# 数据并行
在MOE模型下,开启专家并行(EP)与数据并行(DP)相结合,EP 分摊专家负载,结合 DP 实现请求并行处理。
# DP数据并行
DP数据并行,是分布式推理的一种方式,指在多个“完全相同的模型副本”之间分发不同的请求,每个副本完成请求推理。
## 数据分发策略
FastDeploy 通过splitwise scheduler 感知各个DP的负载状态,对接收到数据进行分发。
通常在部署MOE模型时,数据并行(DP)和 专家并行(EP)相结合,每个DP服务独立完成Attention部分推理,所有DP服务协同完成MOE部分推理,提升整体的推理性能。
splitwise scheduler 依赖redis存储各个DP的负载状态,对接收到的数据进行分发
Fastdeploy 支持DP数据并行,提供了`multi_api_server`接口可以一次性启动多个推理服务
### 专家并行 + 混合式部署
![架构图](./images/no_scheduler_img.png)
FastDeploy 提供了splitwise scheduler,可以感知各个DP的负载状态,对接收到的数据进行调度。
具体调度流程如下图,用户随机请求ip 与端口,通过redis获取负载状态,将数据分发到负载较低的DP进行推理。
![数据调度架构图](./images/scheduler_img.png)
#### 离线推理
```python
prompts = [
"Hello, my name is",
"你好,请问今天是星期",
"请写6个以数字开头的成语",
"写一个300字的小说大纲,内容是李白穿越到现代,最后成为公司文职人员的故事",
"我要采访一位科幻作家,创建一个包含5个问题的列表"
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=128)
llm = LLM(
model="ERNIE-4_5-300B-A47B-FP8-Paddle",
tensor_parallel_size=1,
data_parallel_size=8,
max_model_len=8192,
num_gpu_blocks_override=1024,
engine_worker_queue_port="6077,6078,6079,6080,6081,6082,6083,6084",
enable_expert_parallel=True,
scheduler_name="splitwise",
scheduler_host="127.0.0.1",
scheduler_topic="test",
scheduler_port=6379
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs.text
print("generated_text: ", generated_text)
print("\n")
```
#### 在线推理
```shell
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "6077,6078,6079,6080,6081,6082,6083,6084" \
--data-parallel-size 8 --tensor-parallel-size 1\
--enable-expert-parallel \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
```
### 用户自行调度
FastDeploy 提供了multi_api_server,用户可以拉起多个api server,用户自行选择dp 进行请求,在该种情况下用户可以自行添加负载均衡模型进行调度。(目前该种方式只支持在线推理)
#### 在线推理
![数据调度架构图](./images/no_scheduler_img.png)
## 启动Fastdeploy服务
以ERNIE-4.5-300B模型为例,启动DP8、TP1、EP8的服务:
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
@@ -89,74 +28,54 @@ python -m fastdeploy.entrypoints.openai.multi_api_server \
--enable-expert-parallel
```
### 参数说明
- num-servers: 指定拉起的api server 的数量
- ports: 指定拉起的api server端口
- args: 指定拉起的api server 的参数
参数说明
- num-servers: 指定拉起的DP服务数量
- ports: 指定拉起DP服务的api server端口,数量需要和num-servers一致
- metrics-ports: 指定拉起DP服务的metrics server端口,数量需要和num-servers一致;如果为空,则内部自行分配可用端口
- args: 指定拉起DP服务的参数,可以参考[文档](../parameters.md);如果端口(除了`ports`)没有手动设置,会自动分配可用端口
### 数据并行 + 分离式部署
## 请求调度
具体可以参考[分离式部署](disaggregated.md#多机分离式部署)
使用DP数据并行策略启动多个DP服务后,用户的请求需要通过调度器来分发到不同的服务,做到负载均衡。
#### 在线推理
### Web 服务器
多机部署时需要确认当前网卡是否支持RDMA,并且需要集群中所有节点网络互通
获知了DP服务实例的IP和端口后,大家可以通过常用的Web 服务器(比如Nginx),自行实现请求调度,此处不再赘述
**注意**
- `KVCACHE_RDMA_NICS` 指定当前机器的RDMA网卡,多个网卡用逗号隔开。
- 仓库中提供了自动检测RDMA网卡的脚本 `bash scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu``gpu`
### FastDeploy Router
**prefill 实例**
FastDeploy提供[Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/router)(Python版本)来实现请求收发和请求调度。高性能版本Router正在开发中,敬请期待。
```bash
export FD_LOG_DIR="log_prefill"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
使用方式和请求调度流程如下:
- 启动Router
- 启动FastDeploy服务实例(可以单DP或者多DP的服务),向Router进行注册
- 用户请求发送到Router
- Router根据全局实例的负载情况,为请求选择合适的实例
- Router将请求发给选定的实例进行推理
- Router接收实例的生成结果,返回给用户
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8180 --metrics-port 8181 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--splitwise-role "prefill" \
--scheduler-name "splitwise" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-topic "test" \
--scheduler-ttl 9000
上手示例:
- 启动Router服务,日志信息输出在`log_router/router.log`
```
export FD_LOG_DIR="log_router"
python -m fastdeploy.router.launch \
--host 0.0.0.0 \
--port 30000 \
```
**decode 实例**
```bash
export FD_LOG_DIR="log_decode"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
echo "set RDMA NICS"
export $(bash scripts/get_rdma_nics.sh gpu)
echo "KVCACHE_RDMA_NICS ${KVCACHE_RDMA_NICS}"
python -m fastdeploy.entrypoints.openai.api_server \
--model ERNIE-4_5-300B-A47B-FP8-Paddle \
--port 8184 --metrics-port 8185 \
--engine-worker-queue-port "25611,25621,25631,25641,25651,25661,25671,25681" \
--cache-queue-port 8187 \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--enable-expert-parallel \
--scheduler-name "splitwise" \
--cache-transfer-protocol "rdma,ipc" \
--rdma-comm-ports "7671,7672,7673,7674,7675,7676,7677,7678" \
--pd-comm-port "2334" \
--scheduler-host "127.0.0.1" \
--scheduler-port 6379 \
--scheduler-ttl 9000
--scheduler-topic "test" \
--splitwise-role "decode"
- 同样以ERNIE-4.5-300B模型为例,启动DP8、TP1、EP8的服务,通过`--router`指定Router服务:
```shell
export FD_ENABLE_MULTI_API_SERVER=1
python -m fastdeploy.entrypoints.openai.multi_api_server \
--num-servers 8 \
--ports "1811,1822,1833,1844,1855,1866,1877,1888" \
--metrics-ports "3101,3201,3301,3401,3501,3601,3701,3801" \
--args --model ERNIE-4_5-300B-A47B-FP8-Paddle \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--max-model-len 12288 \
--max-num-seqs 64 \
--num-gpu-blocks-override 256 \
--enable-expert-parallel \
--router "0.0.0.0:30000"
```
+123 -25
View File
@@ -2,36 +2,137 @@
# 分离式部署
大模型推理分为两个部分Prefill和Decode阶段,分别为计算密集型Prefill)和存储密集型(Decode)两部分。将Prefill 和 Decode 分开部署在一定场景下可以提高硬件利用率,有效提高吞吐,降低整句时延,
LLM大模型推理分为Prefill和Decode两个阶段,分别为计算密集型和访存密集型。
* Prefill阶段:处理输入的全部Token(如用户输入的Prompt,完成模型的前向传播Forward),生成首token。
* Decode阶段:从生成第首token后,采用自回归一次生成一个token,直到生成到stop token结束;设输出N✖️tokenDecode阶段需要执行(N-1)次前向传播,只能串行执行,并且在生成过程中,需要关注的token数越来越多,计算量也逐渐增大
* Prefill阶段:处理输入的全部Token,完成模型的前向计算Forward),生成首token。
* Decode阶段:基于首token和缓存的KV Cache,生成其他token;假定总共输出NtokenDecode阶段需要执行(N-1)次前向计算
分离式部署核心是将PrefillDecode 部署在不同的计算资源上,提高各自的利用率。要想实现分离式部署,不可避免的需要考虑Prefill 和 Decode 之间的通信问题
在实际推理过程中Prefill 需要将其计算得到的KV Cache 传输至Decode 实例,Decode 读取KV Cache 进行续推。
分离式部署是将PrefillDecode部署在不同的计算资源上,各自使用最佳的配置,可以提高硬件利用率、提高吞吐、降低整句时延
## KV Cache 传输方式
针对KV Cache 传输我们提供了2种传输方式,分别针对单机内与多机间的场景。
<p align="center">
<img src="images/mix_pd.png" width="50%">
</p>
### 单机内传输
通过cudaMemcpyPeer进行单机内两个GPU之间KV Cache传输,时延低且吞吐高
分离式部署相比集中式部署,实现的核心差异在于KV Cache传输和请求调度。
### 多机间传输
针对多机之间的传输,通过高速网络RDMA传输KV Cache。 针对RDMA传输我们提供了高速传输的网络库`rdma_comm` 实现跨机的KV Cache传输。
## KV Cache 传输
## PD 分离调度
![Splitwise Scheduler](images/disaggregated.png)
在全局调度器的基础上,FastDeploy 支持 PD 分离调度策略,专为大语言模型推理场景设计,将推理流程中的两个阶段解耦:
* Prefill 阶段:构建 KV 缓存,计算密集,显存占用高但延迟低;
* Decode 阶段:进行自回归解码,过程串行、耗时长但显存占用低。
在分离式部署中,请求在Prefill实例中生成的KV Cache,需要传输至Decode实例。FastDeploy提供了2种传输方式,分别针对单机内与多机间的场景。
多实例情况下,每收到一条请求需要根据不同的策略将请求分配到不同的Prefill实例和Decode实例。通过角色分离(prefill 节点负责接收并处理请求,decode节点完成后续生成),可以更细粒度地控制资源分配、提高吞吐量与 GPU 利用率
单机内传输:通过cudaMemcpyPeer进行单机内两个GPU之间KV Cache传输
多机间传输:使用自研的[RDMA传输库](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/cache_manager/transfer_factory/kvcache_transfer)在多机之间传输KV Cache。
## PD 分离请求调度
针对PD分离式部署,FastDeploy提供Python版本[Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/router)来实现请求收发和请求调度。使用方式和调度流程如下:
* 启动Router
* 启动PD实例,PD实例会注册到Router
* 用户请求发送到Router
* Router根据PD实例的负载情况为请求选择合适的PD实例对
* Router将请求发给选定的PD实例
* P实例收到请求后向D实例申请Cache Block
* P实例推理生成首token,同时layerwise传输Cache给D实例,完成后将首token发送给Router和D实例
* D实例收到请求和首token后,继续生成后续token,发送给Router
* Router接收PD实例的生成结果,返回给用户
高性能版本Router正在开发中,敬请期待。
## 使用说明
### 多机分离式部署
### 基于Router多机分离式部署
#### 前置依赖 Redis
#### 环境准备
大家可以参考[文档](https://github.com/PaddlePaddle/FastDeploy/tree/develop/docs/zh/get_started/installation)准备环境,推荐使用Docker。
如果是自行准备运行环境,需要确保安装RDMA依赖包(librdmacm-dev libibverbs-dev iproute2)和驱动[MLNX_OFED](https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/)。
```
apt update --fix-missing
apt-get install -y librdmacm-dev libibverbs-dev iproute2
# 下载并安装MLNX_OFED
./mlnxofedinstall --user-space-only --skip-distro-check --without-fw-update --force --without-ucx-cuda
```
拉取FastDeploy最新代码,编译安装(最新release 2.3和2.4版本还没有最新分离式部署的功能特性)。
```
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```
#### 部署服务
**快速上手**
启动Router服务,其中`--splitwise`参数指定为分离式部署的调度方式,日志信息输出在`log_router/router.log`
```
export FD_LOG_DIR="log_router"
python -m fastdeploy.router.launch \
--host 0.0.0.0 \
--port 30000 \
--splitwise
```
启动Prefill实例。对比单机部署,增加`--splitwise-role`参数指定实例角色为Prefill,增加`--router`参数指定Router的接口,其他参数和单机部署相同。
```
export CUDA_VISIBLE_DEVICES=0
export FD_LOG_DIR="log_prefill"
python -m fastdeploy.entrypoints.openai.api_server \
--model "PaddlePaddle/ERNIE-4.5-0.3B-Paddle" \
--port 31000 \
--splitwise-role prefill \
--router "0.0.0.0:30000"
```
启动Decode实例。
```
export CUDA_VISIBLE_DEVICES=1
export FD_LOG_DIR="log_decode"
python -m fastdeploy.entrypoints.openai.api_server \
--model "PaddlePaddle/ERNIE-4.5-0.3B-Paddle" \
--port 32000 \
--splitwise-role decode \
--router "0.0.0.0:30000"
```
Prefill和Decode实例启动成功,并且向Router注册成功后,可以发送请求。
```
curl -X POST "http://0.0.0.0:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"max_tokens": 100,
"stream": false
}'
```
**具体说明**
分离式部署启动Prefill/Decode实例的参数说明如下,其他参数设置和mixed部署类似,具体参考[文档](../../zh/parameters.md)
* `--splitwise-role`: 指定实例角色,可选值为`prefill``decode``mixed`,默认是`mixed`
* `--cache-transfer-protocol`: 指定KV Cache传输协议,可选值为`rdma``ipc`,默认是`rdma,ipc`;PD实例在同一台机器,支持两种传输方式,优先使用`ipc`传输;PD实例不在同一台机器,只支持`rdma`传输;如果使用rdma传输,需要确保多台机器的RDMA网络互通
* `--rdma-comm-ports`: 指定RDMA通信端口,多个端口用逗号隔开,端口数量需要和dp_size*tp_size相同;可以不指定,FD内部会找空闲的端口
* `--pd-comm-port`: 指定PD实例的交互接口,多个端口用逗号隔开,端口数量需要和dp_size相同;可以不指定,FD内部会找空闲的端口
* `--router`:指定Router的服务地址
注意:
* 如果想手动指定RDMA网卡,可以设置`KVCACHE_RDMA_NICS`环境变量,多个网卡名用逗号隔开,Fastdeploy提供了检测RDMA网卡的脚本`bash FastDeploy/scripts/get_rdma_nics.sh <device>`, 其中 <device> 可以是 `cpu``gpu`。如果不设置`KVCACHE_RDMA_NICS`环境变量, Fastdeploy内部会自动检测可用的RDMA网卡。
* 分离式部署也可以使用[benchmark](../../../benchmarks/)工具向Router服务发请求,开启`--pd-metrics`参数可以统计到更多分析指标。
* 根据Decode实例的显存资源和最大处理请求数`max_num_seqs`来调整请求并发,如果请求并发很高但是Decode资源不足,Prefill会为特定请求持续向Decode申请资源,导致Prefill资源利用率过低;设置`export PREFILL_CONTINUOUS_REQUEST_DECODE_RESOURCES=0'来关闭该行为,特定请求遇到Decode资源不足会直接向Router返回错误。
* 分离式部署支持多种并行策略,如果使用DP并行,必须使用`python -m fastdeploy.entrypoints.openai.multi_api_server`来启动服务。
**Examples示例**
PD分离式部署支持前缀缓存、TP并行、DP并行等特性,具体examples可以参考[examples/splitwise](https://github.com/PaddlePaddle/FastDeploy/tree/develop/examples/splitwise)。
### 基于SplitwiseScheduler多机分离式部署
**注意:不推荐使用SplitwiseScheduler,推荐使用Router来做请求调度。**
#### 环境准备
* 使用`conda`安装
> **⚠️ 注意**
@@ -63,7 +164,7 @@ sudo yum install redis -y
sudo systemctl start redis
```
#### 在线推理服务
#### 部署服务
多机部署时需要确认当前网卡是否支持RDMA,并且需要集群中所有节点网络互通。
@@ -126,16 +227,13 @@ python -m fastdeploy.entrypoints.openai.api_server \
--splitwise-role "decode"
```
### 参数说明
参数说明
* --splitwise-role: 指定当前服务为prefill还是decode
* --cache-queue-port: 指定cache服务的端口,用于prefill和decode服务通信
#### 单机参数说明
机参数说明
* --inner-prefill-ports: 仅需Decode实例填写,指定需要连接的prefill实例的端口列表
#### 多机参数说明
* --cache-transfer-protocol: 指定KV Cache传输协议,支持ipc和rdma,默认ipc
* --scheduler-name: PD分离情况下为splitwise
* --scheduler-host: 连接的redis地址
Binary file not shown.

After

Width:  |  Height:  |  Size: 131 KiB

+6 -32
View File
@@ -1,36 +1,10 @@
# Run the Examples on NVIDIA CUDA GPU
## Prepare the Environment
Refer to [NVIDIA CUDA GPU Installation](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/) to pull the docker image, such as:
```
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.3.0
```
PD分离式部署,请参考[使用文档](../../docs/zh/features/disaggregated.md)。
In the docker container, the [NVIDIA MLNX_OFED](https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/) and [Redis](https://redis.io/) are pre-installed.
PD分离式部署,推荐使用Router来做请求调度(即是V1模式)。
## Build and install FastDeploy
启动脚本:
```
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
export ENABLE_FD_RDMA=1
# Argument 1: Whether to build wheel package (1 for yes, 0 for compile only)
# Argument 2: Python interpreter path
# Argument 3: Whether to compile CPU inference operators
# Argument 4: Target GPU architectures
bash build.sh 1 python false [80,90]
```
## Run the Examples
Run the shell scripts in this directory, ```bash start_v0_tp1.sh``` or ```bash start_v1_tp1.sh```
Note that, there are two methods for splitwise deployment:
* v0: using splitwise_scheduler or dp_scheduler, in which the requests are scheduled in the engine.
* v1: using router, in which the requests are scheduled in the router.
# Run the Examples on Kunlunxin XPU
Coming soon...
* `start_v1_tp1.sh`:使用Router调度,P和D实例是TP1。
* `start_v1_tp2.sh`:使用Router调度,P和D实例是TP2。
* `start_v1_dp2.sh`:使用Router调度,P和D实例是DP2 TP1。
+1 -1
View File
@@ -3,7 +3,7 @@ set -e
# Test splitwise deployment
# There are two methods for splitwise deployment:
# v0: using splitwise_scheduler or dp_scheduler
# v0: using splitwise_scheduler or dp_scheduler (deprecated)
# v1: using local_scheduler + router
# prepare environment
-5
View File
@@ -1,11 +1,6 @@
#!/bin/bash
set -e
# Test splitwise deployment
# There are two methods for splitwise deployment:
# v0: using splitwise_scheduler or dp_scheduler
# v1: using local_scheduler + router
MODEL_NAME="PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
DATA_PARALLEL_SIZE=2
TENSOR_PARALLEL_SIZE=1
-5
View File
@@ -1,11 +1,6 @@
#!/bin/bash
set -e
# Test splitwise deployment
# There are two methods for splitwise deployment:
# v0: using splitwise_scheduler or dp_scheduler
# v1: using local_scheduler + router
# prepare environment
export MODEL_NAME="PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
export FD_DEBUG=1
-5
View File
@@ -1,11 +1,6 @@
#!/bin/bash
set -e
# Test splitwise deployment
# There are two methods for splitwise deployment:
# v0: using splitwise_scheduler or dp_scheduler
# v1: using local_scheduler + router
# prepare environment
export MODEL_NAME="PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
export FD_DEBUG=1
+22 -11
View File
@@ -58,7 +58,7 @@ from fastdeploy.splitwise.internal_adapter_utils import InternalAdapter
from fastdeploy.splitwise.splitwise_connector import SplitwiseConnector
from fastdeploy.trace.constants import LoggingEventName
from fastdeploy.trace.trace_logger import print as trace_print
from fastdeploy.utils import EngineError, envs, get_logger, llm_logger
from fastdeploy.utils import EngineError, console_logger, envs, get_logger, llm_logger
try:
TokenProcessor = load_token_processor_plugins()
@@ -157,6 +157,7 @@ class EngineService:
def start(self, async_llm_pid=None):
self.running = True
console_logger.debug("Start engineService...")
if self.use_async_llm:
self.start_worker_service(async_llm_pid)
@@ -807,7 +808,7 @@ class EngineService:
# start async preprocess
self.resource_manager.apply_async_preprocess(task)
need_delete_tasks = []
if envs.FD_OFFLINE_PERF_TEST_FOR_PD:
if envs.PREFILL_CONTINUOUS_REQUEST_DECODE_RESOURCES:
for task in tasks:
# assure can allocate block ids in P
while not self.resource_manager.preallocate_resource_in_p(task):
@@ -1352,6 +1353,7 @@ class EngineService:
threading.Thread(target=decode_loop, daemon=True).start()
def start_cache_service(self, device_ids, ipc_signal_suffix):
console_logger.debug("Start cache manager...")
return self.resource_manager.cache_manager.launch_cache_manager(
cache_config=self.cfg.cache_config,
tensor_parallel_size=self.cfg.parallel_config.tensor_parallel_size,
@@ -1379,17 +1381,24 @@ class EngineService:
return False
def _register_to_router(self):
"""If use router, register this server to router"""
timeout = 5
sleep_seconds = 10
"""
Periodically send server information to the router for registeration, and it is used
as a heartbeat signal.
"""
def _register():
timeout = 5
sleep_seconds = 5
is_registered = False
while True:
try:
api_server_host = self.cfg.router_config.api_server_host
api_server_port = self.cfg.router_config.api_server_port
api_server_url = f"http://{api_server_host}:{api_server_port}"
if not check_service_health(api_server_url):
time.sleep(sleep_seconds)
self.llm_logger.info("Wait for API service health and then register to router")
time.sleep(sleep_seconds)
continue
@@ -1401,20 +1410,22 @@ class EngineService:
)
if resp.ok:
self.llm_logger.info("Successfully registered to the router!")
break
if not is_registered:
is_registered = True
self.llm_logger.info("Register to router successfully")
else:
self.llm_logger.error(
f"Router registration failed: {resp.status_code}, "
f"Send server info to router failed: {resp.status_code}, "
f"{resp.text}, {self.cfg.register_info}"
)
time.sleep(sleep_seconds)
except requests.exceptions.RequestException as e:
self.llm_logger.error(f"Register to router request error: {e}")
except Exception as e:
self.llm_logger.exception(f"Unexpected error during router registration: {e}")
time.sleep(sleep_seconds)
if self.cfg.router_config.router is not None:
if self.cfg.router_config.router is None:
self.llm_logger.info("Router is not enabled, skip registering to router")
else:
register_thread = threading.Thread(target=_register, daemon=True)
register_thread.start()
+1
View File
@@ -500,6 +500,7 @@ class LLMEngine:
start gpu worker service
"""
console_logger.debug("Start worker process...")
log_dir = os.getenv("FD_LOG_DIR", default="log")
command_prefix = self._setting_environ_variables()
current_file_path = os.path.abspath(__file__)
+4 -2
View File
@@ -140,8 +140,10 @@ environment_variables: dict[str, Callable[[], Any]] = {
"ENCODE_FEATURE_BOS_SK": lambda: os.getenv("ENCODE_FEATURE_BOS_SK"),
# The ENDPOINT of bos storing the features while multi_modal infer
"ENCODE_FEATURE_ENDPOINT": lambda: os.getenv("ENCODE_FEATURE_ENDPOINT"),
# Enable offline perf test mode for PD disaggregation
"FD_OFFLINE_PERF_TEST_FOR_PD": lambda: int(os.getenv("FD_OFFLINE_PERF_TEST_FOR_PD", "0")),
# Whether the Prefill instance continuously requests Decode resources in PD disaggregation
"PREFILL_CONTINUOUS_REQUEST_DECODE_RESOURCES": lambda: int(
os.getenv("PREFILL_CONTINUOUS_REQUEST_DECODE_RESOURCES", "1")
),
"FD_ENABLE_E2W_TENSOR_CONVERT": lambda: int(os.getenv("FD_ENABLE_E2W_TENSOR_CONVERT", "0")),
"FD_ENGINE_TASK_QUEUE_WITH_SHM": lambda: int(os.getenv("FD_ENGINE_TASK_QUEUE_WITH_SHM", "0")),
"FD_FILL_BITMASK_BATCH": lambda: int(os.getenv("FD_FILL_BITMASK_BATCH", "4")),
+1 -1
View File
@@ -214,7 +214,7 @@ class SplitwiseConnector:
time.sleep(0.001)
if time.time() - start_time > envs.FD_PREFILL_WAIT_DECODE_RESOURCE_SECONDS:
del self.current_request_ids[task.request_id]
return False, "timeout"
return False, "prefill waits for decode resource timeout"
msg = self.current_request_ids[task.request_id]
del self.current_request_ids[task.request_id]
+1 -1
View File
@@ -493,7 +493,7 @@ def test_check_decode_allocated_times_out(monkeypatch):
monkeypatch.setattr("fastdeploy.splitwise.splitwise_connector.time.sleep", lambda *_: None)
ok, msg = connector.check_decode_allocated(task)
assert (ok, msg) == (False, "timeout")
assert (ok, msg) == (False, "prefill waits for decode resource timeout")
assert "req-timeout" not in connector.current_request_ids