[PD Disaggregation] Prefill and decode support cache storage (#6768)

* Prefill and decode support cache storage * up * up * update docs and refine mooncake store * up
2026-04-22 16:07:51 +08:00 · 2026-03-16 14:44:49 +08:00
parent 72ff7bf4cd
commit 04fde3b227
12 changed files with 1083 additions and 66 deletions
@@ -86,6 +86,7 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU
 - [前缀缓存](./docs/zh/features/prefix_caching.md)
 - [分块预填充](./docs/zh/features/chunked_prefill.md)
 - [负载均衡调度Router](./docs/zh/online_serving/router.md)
+- [全局Cache池化](./docs/zh/features/global_cache_pooling.md)

 ## 致谢

@@ -84,6 +84,7 @@ Learn how to download models, enable using the torch format, and more:
 - [Prefix Caching](./docs/features/prefix_caching.md)
 - [Chunked Prefill](./docs/features/chunked_prefill.md)
 - [Load-Balancing Scheduling Router](./docs/online_serving/router.md)
+- [Global Cache Pooling](./docs/features/global_cache_pooling.md)

 ## Acknowledgement

@@ -0,0 +1,368 @@
+[中文文档](../../zh/features/global_cache_pooling.md) | English
+
+# Global Cache Pooling
+
+This document describes how to use MooncakeStore as the KV Cache storage backend for FastDeploy, enabling **Global Cache Pooling** across multiple inference instances.
+
+## Overview
+
+### What is Global Cache Pooling?
+
+Global Cache Pooling allows multiple FastDeploy instances to share KV Cache through a distributed storage layer. This enables:
+
+- **Cross-instance cache reuse**: KV Cache computed by one instance can be reused by another
+- **PD Disaggregation optimization**: Prefill and Decode instances can share cache seamlessly
+- **Reduced computation**: Avoid redundant prefix computation across requests
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     Mooncake Master Server                       │
+│              (Metadata & Coordination Service)                   │
+└────────────────────────────┬────────────────────────────────────┘
+                             │
+         ┌───────────────────┼───────────────────┐
+         │                   │                   │
+         ▼                   ▼                   ▼
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
+│  FastDeploy     │ │  FastDeploy     │ │  FastDeploy     │
+│  Instance P     │ │  Instance D     │ │  Instance X     │
+│  (Prefill)      │ │  (Decode)       │ │  (Standalone)   │
+└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
+         │                   │                   │
+         └───────────────────┼───────────────────┘
+                             │
+                    ┌────────▼────────┐
+                    │  MooncakeStore  │
+                    │  (Shared KV     │
+                    │   Cache Pool)   │
+                    └─────────────────┘
+```
+
+## Example Scripts
+
+Ready-to-use example scripts are available in [examples/cache_storage/](../../../examples/cache_storage/).
+
+| Script | Scenario | Description |
+|--------|----------|-------------|
+| `run.sh` | Multi-Instance | Two standalone instances sharing cache |
+| `run_03b_pd_storage.sh` | PD Disaggregation | P+D instances with global cache pooling |
+
+## Prerequisites
+
+### Hardware Requirements
+
+- NVIDIA GPU with CUDA support
+- RDMA network (recommended for production) or TCP
+
+### Software Requirements
+
+- Python 3.8+
+- CUDA 11.8+
+- FastDeploy (see installation below)
+
+## Installation
+
+Refer to [NVIDIA CUDA GPU Installation](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/) for FastDeploy installation.
+
+```bash
+# Option 1: Install from PyPI
+pip install fastdeploy-gpu
+
+# Option 2: Build from source
+bash build.sh
+pip install ./dist/fastdeploy*.whl
+```
+
+MooncakeStore is automatically installed when you install FastDeploy.
+
+## Configuration
+
+We support two ways to configure MooncakeStore: via configuration file `mooncake_config.json` or via environment variables.
+
+### Mooncake Configuration File
+
+Create a `mooncake_config.json` file:
+
+```json
+{
+    "metadata_server": "http://0.0.0.0:15002/metadata",
+    "master_server_addr": "0.0.0.0:15001",
+    "global_segment_size": 1000000000,
+    "local_buffer_size": 134217728,
+    "protocol": "rdma",
+    "rdma_devices": ""
+}
+```
+
+Set the `MOONCAKE_CONFIG_PATH` environment variable to enable the configuration:
+
+```bash
+export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.json
+```
+
+Configuration parameters:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `metadata_server` | HTTP metadata server URL | Required |
+| `master_server_addr` | Master server address | Required |
+| `global_segment_size` | Memory space each TP process shares to global shared memory (bytes) | 1GB |
+| `local_buffer_size` | Local buffer size for data transfer (bytes) | 128MB |
+| `protocol` | Transfer protocol: `rdma` or `tcp` | `rdma` |
+| `rdma_devices` | RDMA device names (comma-separated) | Auto-detect |
+
+### Environment Variables
+
+Mooncake can also be configured via environment variables:
+
+| Variable | Description |
+|----------|-------------|
+| `MOONCAKE_MASTER_SERVER_ADDR` | Master server address (e.g., `10.0.0.1:15001`) |
+| `MOONCAKE_METADATA_SERVER` | Metadata server URL |
+| `MOONCAKE_GLOBAL_SEGMENT_SIZE` | Memory space each TP process shares to global shared memory (bytes) |
+| `MOONCAKE_LOCAL_BUFFER_SIZE` | Local buffer size (bytes) |
+| `MOONCAKE_PROTOCOL` | Transfer protocol (`rdma` or `tcp`) |
+| `MOONCAKE_RDMA_DEVICES` | RDMA device names |
+
+## Usage Scenarios
+
+### Scenario 1: Multi-Instance Cache Sharing
+
+Run multiple FastDeploy instances sharing a global KV Cache pool.
+
+**Step 1: Start Mooncake Master**
+
+```bash
+mooncake_master \
+    --port=15001 \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=15002 \
+    --metrics_port=15003
+```
+
+**Step 2: Start FastDeploy Instances**
+
+Instance 0:
+```bash
+export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
+export CUDA_VISIBLE_DEVICES=0
+
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model ${MODEL_NAME} \
+       --port 52700 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --kvcache-storage-backend mooncake
+```
+
+Instance 1:
+```bash
+export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
+export CUDA_VISIBLE_DEVICES=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model ${MODEL_NAME} \
+       --port 52800 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --kvcache-storage-backend mooncake
+```
+
+**Step 3: Test Cache Reuse**
+
+Send the same prompt to both instances. The second instance should reuse the KV Cache computed by the first instance.
+
+```bash
+# Request to Instance 0
+curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
+
+# Request to Instance 1 (should hit cached KV)
+curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
+```
+
+### Scenario 2: PD Disaggregation with Global Cache
+
+This scenario combines **PD Disaggregation** with **Global Cache Pooling**, enabling:
+
+- Prefill instances to read Decode instances' output cache
+- Optimal multi-turn conversation performance
+
+**Architecture:**
+
+```
+         ┌──────────────────────────────────────────┐
+         │              Router                       │
+         │         (Load Balancer)                   │
+         └─────────────────┬────────────────────────┘
+                           │
+           ┌───────────────┴───────────────┐
+           │                               │
+           ▼                               ▼
+    ┌─────────────┐                 ┌─────────────┐
+    │   Prefill   │                 │   Decode    │
+    │  Instance   │◄───────────────►│  Instance   │
+    │             │   KV Transfer   │             │
+    └──────┬──────┘                 └──────┬──────┘
+           │                               │
+           └───────────────┬───────────────┘
+                           │
+                  ┌────────▼────────┐
+                  │  MooncakeStore  │
+                  │  (Global Cache) │
+                  └─────────────────┘
+```
+
+**Step 1: Start Mooncake Master**
+
+```bash
+mooncake_master \
+    --port=15001 \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=15002
+```
+
+**Step 2: Start Router**
+
+```bash
+python -m fastdeploy.router.launch \
+    --port 52700 \
+    --splitwise
+```
+
+**Step 3: Start Prefill Instance**
+
+```bash
+export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
+export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
+export MOONCAKE_PROTOCOL="rdma"
+export CUDA_VISIBLE_DEVICES=0
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port 52400 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role prefill \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:52700" \
+    --kvcache-storage-backend mooncake
+```
+
+**Step 4: Start Decode Instance**
+
+```bash
+export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
+export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
+export MOONCAKE_PROTOCOL="rdma"
+export CUDA_VISIBLE_DEVICES=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port 52500 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role decode \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:52700" \
+    --enable-output-caching \
+    --kvcache-storage-backend mooncake
+```
+
+**Step 5: Send Requests via Router**
+
+```bash
+curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'
+```
+
+## FastDeploy Parameters for Mooncake
+
+| Parameter | Description |
+|-----------|-------------|
+| `--kvcache-storage-backend mooncake` | Enable Mooncake as KV Cache storage backend |
+| `--enable-output-caching` | Enable output token caching (Decode instance recommended) |
+| `--cache-transfer-protocol rdma` | Use RDMA for KV transfer between P and D |
+| `--splitwise-role prefill/decode` | Set instance role in PD disaggregation |
+| `--router` | Router address for PD disaggregation |
+
+## Verification
+
+### Check Cache Hit
+
+To verify cache hit in logs:
+
+```bash
+# For multi-instance scenario
+grep -E "storage_cache_token_num" log_*/api_server.log
+
+# For PD disaggregation scenario
+grep -E "storage_cache_token_num" log_prefill/api_server.log
+```
+
+If `storage_cache_token_num > 0`, the instance successfully read cached KV blocks from the global pool.
+
+### Monitor Mooncake Master
+
+```bash
+# Check master status
+curl http://localhost:15002/metadata
+
+# Check metrics (if metrics_port is configured)
+curl http://localhost:15003/metrics
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**1. Port Already in Use**
+
+```bash
+# Check port usage
+ss -ltn | grep 15001
+
+# Kill existing process
+kill -9 $(lsof -t -i:15001)
+```
+
+**2. RDMA Connection Failed**
+
+- Verify RDMA devices: `ibv_devices`
+- Check RDMA network: `ibv_devinfo`
+- Fallback to TCP: Set `MOONCAKE_PROTOCOL=tcp`
+
+**3. Cache Not Being Shared**
+
+- Verify all instances connect to the same Mooncake master
+- Check metadata server URL is consistent
+- Verify `global_segment_size` is large enough
+
+**4. Permission Denied on /dev/shm**
+
+```bash
+# Clean up stale shared memory files
+find /dev/shm -type f -print0 | xargs -0 rm -f
+```
+
+### Debug Mode
+
+Enable debug logging:
+
+```bash
+export FD_DEBUG=1
+```
+
+## More Resources
+
+- [Mooncake Official Documentation](https://github.com/kvcache-ai/Mooncake)
+- [Mooncake Troubleshooting Guide](https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/troubleshooting/troubleshooting.md)
+- [FastDeploy Documentation](https://paddlepaddle.github.io/FastDeploy/)
@@ -0,0 +1,367 @@
+[English](../../features/global_cache_pooling.md) | 中文文档
+
+# 全局缓存池化
+
+本文档介绍如何将 MooncakeStore 作为 FastDeploy 的 KV Cache 存储后端，实现多推理实例间的**全局缓存池化**。
+
+## 概述
+
+### 什么是全局缓存池化？
+
+全局缓存池化允许多个 FastDeploy 实例通过分布式存储层共享 KV Cache，具有以下优势：
+
+- **跨实例缓存复用**：一个实例计算的 KV Cache 可被其他实例复用
+- **PD 分离架构优化**：Prefill 和 Decode 实例可无缝共享缓存
+- **减少重复计算**：避免跨请求的重复前缀计算
+
+### 架构图
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     Mooncake Master 服务                         │
+│              (元数据与协调服务)                                    │
+└────────────────────────────┬────────────────────────────────────┘
+                             │
+         ┌───────────────────┼───────────────────┐
+         │                   │                   │
+         ▼                   ▼                   ▼
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
+│  FastDeploy     │ │  FastDeploy     │ │  FastDeploy     │
+│  Instance P     │ │  Instance D     │ │  Instance X     │
+│  (Prefill)      │ │  (Decode)       │ │  (Standalone)   │
+└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
+         │                   │                   │
+         └───────────────────┼───────────────────┘
+                             │
+                    ┌────────▼────────┐
+                    │  MooncakeStore  │
+                    │  (共享 KV       │
+                    │   Cache 池)     │
+                    └─────────────────┘
+```
+
+## 示例脚本
+
+开箱即用的示例脚本位于 [examples/cache_storage/](../../../examples/cache_storage/)。
+
+| 脚本 | 场景 | 说明 |
+|------|------|------|
+| `run.sh` | 多实例缓存共享 | 两个独立实例共享缓存 |
+| `run_03b_pd_storage.sh` | PD 分离 | P+D 实例配合全局缓存池 |
+
+## 环境要求
+
+### 硬件要求
+
+- 支持 CUDA 的 NVIDIA GPU
+- RDMA 网络（生产环境推荐）或 TCP 网络
+
+### 软件要求
+
+- Python 3.8+
+- CUDA 11.8+
+- FastDeploy（见下方安装说明）
+
+## 安装步骤
+
+参考 [NVIDIA CUDA GPU 安装指南](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/) 安装 FastDeploy。
+
+```bash
+# 方式一：从 PyPI 安装
+pip install fastdeploy-gpu
+
+# 方式二：从源码编译
+bash build.sh
+pip install ./dist/fastdeploy*.whl
+```
+
+安装FastDeploy后自动安装了MooncakeStore。
+
+## 配置说明
+
+我们支持两种方式配置MooncakeStore，一是通过配置文件`mooncake_config.json`，二是通过环境变量进行配置。
+
+### Mooncake 配置文件
+
+创建 `mooncake_config.json` 配置文件：
+
+```json
+{
+    "metadata_server": "http://0.0.0.0:15002/metadata",
+    "master_server_addr": "0.0.0.0:15001",
+    "global_segment_size": 1000000000,
+    "local_buffer_size": 134217728,
+    "protocol": "rdma",
+    "rdma_devices": ""
+}
+```
+
+设置MOONCAKE_CONFIG_PATH环境变量后，配置文件生效：
+```bash
+export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.json
+```
+
+配置参数说明：
+
+| 参数 | 说明 | 默认值 |
+|------|------|--------|
+| `metadata_server` | HTTP 元数据服务地址 | 必填 |
+| `master_server_addr` | Master 服务地址 | 必填 |
+| `global_segment_size` | 每个TP进程给全局共享内存共享的内存空间（字节） | 1GB |
+| `local_buffer_size` | 数据传输本地缓冲区大小（字节） | 128MB |
+| `protocol` | 传输协议：`rdma` 或 `tcp` | `rdma` |
+| `rdma_devices` | RDMA 设备名称（逗号分隔） | 自动检测 |
+
+### 环境变量配置
+
+Mooncake 也支持通过环境变量进行配置：
+
+| 环境变量 | 说明 |
+|----------|------|
+| `MOONCAKE_MASTER_SERVER_ADDR` | Master 服务地址（如 `10.0.0.1:15001`） |
+| `MOONCAKE_METADATA_SERVER` | 元数据服务 URL |
+| `MOONCAKE_GLOBAL_SEGMENT_SIZE` | 每个TP进程给全局共享内存共享的内存空间（字节） |
+| `MOONCAKE_LOCAL_BUFFER_SIZE` | 本地缓冲区大小（字节） |
+| `MOONCAKE_PROTOCOL` | 传输协议（`rdma` 或 `tcp`） |
+| `MOONCAKE_RDMA_DEVICES` | RDMA 设备名称 |
+
+## 使用场景
+
+### 场景一：多实例缓存共享
+
+运行多个 FastDeploy 实例，共享全局 KV Cache 池。
+
+**步骤 1：启动 Mooncake Master**
+
+```bash
+mooncake_master \
+    --port=15001 \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=15002 \
+    --metrics_port=15003
+```
+
+**步骤 2：启动 FastDeploy 实例**
+
+实例 0：
+```bash
+export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
+export CUDA_VISIBLE_DEVICES=0
+
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model ${MODEL_NAME} \
+       --port 52700 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --kvcache-storage-backend mooncake
+```
+
+实例 1：
+```bash
+export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
+export CUDA_VISIBLE_DEVICES=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model ${MODEL_NAME} \
+       --port 52800 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --kvcache-storage-backend mooncake
+```
+
+**步骤 3：测试缓存复用**
+
+向两个实例发送相同的 prompt，第二个实例应能复用第一个实例计算的 KV Cache。
+
+```bash
+# 请求实例 0
+curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
+
+# 请求实例 1（应命中缓存）
+curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
+```
+
+### 场景二：PD 分离 + 全局缓存池
+
+此场景将 **PD 分离架构** 与 **全局缓存池化** 结合，实现：
+
+- Prefill 实例可读取 Decode 实例的输出缓存
+- 优化多轮对话性能
+
+**架构图：**
+
+```
+         ┌──────────────────────────────────────────┐
+         │              Router                       │
+         │           (负载均衡器)                    │
+         └─────────────────┬────────────────────────┘
+                           │
+           ┌───────────────┴───────────────┐
+           │                               │
+           ▼                               ▼
+    ┌─────────────┐                 ┌─────────────┐
+    │   Prefill   │                 │   Decode    │
+    │  Instance   │◄───────────────►│  Instance   │
+    │             │   KV Transfer   │             │
+    └──────┬──────┘                 └──────┬──────┘
+           │                               │
+           └───────────────┬───────────────┘
+                           │
+                  ┌────────▼────────┐
+                  │  MooncakeStore  │
+                  │  (全局缓存池)   │
+                  └─────────────────┘
+```
+
+**步骤 1：启动 Mooncake Master**
+
+```bash
+mooncake_master \
+    --port=15001 \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=15002
+```
+
+**步骤 2：启动 Router**
+
+```bash
+python -m fastdeploy.router.launch \
+    --port 52700 \
+    --splitwise
+```
+
+**步骤 3：启动 Prefill 实例**
+
+```bash
+export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
+export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
+export MOONCAKE_PROTOCOL="rdma"
+export CUDA_VISIBLE_DEVICES=0
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port 52400 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role prefill \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:52700" \
+    --kvcache-storage-backend mooncake
+```
+
+**步骤 4：启动 Decode 实例**
+
+```bash
+export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
+export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
+export MOONCAKE_PROTOCOL="rdma"
+export CUDA_VISIBLE_DEVICES=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port 52500 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role decode \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:52700" \
+    --enable-output-caching \
+    --kvcache-storage-backend mooncake
+```
+
+**步骤 5：通过 Router 发送请求**
+
+```bash
+curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'
+```
+
+## FastDeploy Mooncake 相关参数
+
+| 参数 | 说明 |
+|------|------|
+| `--kvcache-storage-backend mooncake` | 启用 Mooncake 作为 KV Cache 存储后端 |
+| `--enable-output-caching` | 启用输出 token 缓存（推荐 Decode 实例开启） |
+| `--cache-transfer-protocol rdma` | P 和 D 之间使用 RDMA 进行 KV 传输 |
+| `--splitwise-role prefill/decode` | 设置实例在 PD 分离中的角色 |
+| `--router` | PD 分离场景下的 Router 地址 |
+
+## 验证方法
+
+### 检查缓存命中
+
+通过日志验证缓存命中情况：
+
+```bash
+# 多实例场景
+grep -E "storage_cache_token_num" log_*/api_server.log
+
+# PD 分离场景
+grep -E "storage_cache_token_num" log_prefill/api_server.log
+```
+
+如果 `storage_cache_token_num > 0`，表示实例成功从全局池读取了缓存的 KV 块。
+
+### 监控 Mooncake Master
+
+```bash
+# 检查 master 状态
+curl http://localhost:15002/metadata
+
+# 检查指标（如配置了 metrics_port）
+curl http://localhost:15003/metrics
+```
+
+## 故障排查
+
+### 常见问题
+
+**1. 端口被占用**
+
+```bash
+# 检查端口使用情况
+ss -ltn | grep 15001
+
+# 终止占用进程
+kill -9 $(lsof -t -i:15001)
+```
+
+**2. RDMA 连接失败**
+
+- 检查 RDMA 设备：`ibv_devices`
+- 检查 RDMA 网络：`ibv_devinfo`
+- 降级使用 TCP：设置 `MOONCAKE_PROTOCOL=tcp`
+
+**3. 缓存未共享**
+
+- 确认所有实例连接到同一个 Mooncake master
+- 检查元数据服务 URL 是否一致
+- 确认 `global_segment_size` 足够大
+
+**4. /dev/shm 权限不足**
+
+```bash
+# 清理残留的共享内存文件
+find /dev/shm -type f -print0 | xargs -0 rm -f
+```
+
+### 调试模式
+
+开启调试日志：
+
+```bash
+export FD_DEBUG=1
+```
+
+## 更多资源
+
+- [Mooncake 官方文档](https://github.com/kvcache-ai/Mooncake)
+- [Mooncake 故障排查指南](https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/troubleshooting/troubleshooting.md)
+- [FastDeploy 文档](https://paddlepaddle.github.io/FastDeploy/)
@@ -1,57 +1,31 @@
-# MooncakeStore for FastDeploy
+# Global Cache Pooling Examples

-This document describes how to use MooncakeStore as the backend of FastDeploy.
+This directory contains example scripts for Global Cache Pooling with MooncakeStore.

-## Preparation
+## Documentation

-### Install FastDeploy
+- [English Documentation](../../docs/features/global_cache_pooling.md)
+- [中文文档](../../docs/zh/features/global_cache_pooling.md)

-Refer to [NVIDIA CUDA GPU Installation](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/) for Fastdeploy installation.
-
-### Install MooncakeStore
+## Quick Start

 ```bash
-pip install mooncake-transfer-engine
-```
-
-## Run Examples
-
-The example script is provided in `run.sh`. You can run it directly:
-```
+# Multi-instance scenario
 bash run.sh
+
+# PD disaggregation scenario
+bash run_03b_pd_storage.sh
 ```

-In the example script, we will start a Mooncake master server and two FastDeploy server.
+## Scripts

-Launch Mooncake master server:
-```bash
-mooncake_master \
-    --port=15001 \
-    --enable_http_metadata_server=true  \
-    --http_metadata_server_host=0.0.0.0 \
-    --http_metadata_server_port=15002 \
-    --metrics_port=15003 \
-```
+| Script | Scenario | Description |
+|--------|----------|-------------|
+| `run.sh` | Multi-Instance | Two standalone instances sharing cache |
+| `run_03b_pd_storage.sh` | PD Disaggregation | P+D instances with global cache pooling |

-More parameter can be found in the [official guide](https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/python-api-reference/transfer-engine.md).
+## Files

-Launch the Fastdeploy with Mooncake enabled.
-
-```bash
-export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
-
-python -m fastdeploy.entrypoints.openai.api_server \
-       --model ${MODEL_NAME} \
-       --port ${PORT} \
-       --metrics-port $((PORT + 1)) \
-       --engine-worker-queue-port $((PORT + 2)) \
-       --cache-queue-port $((PORT + 3)) \
-       --max-model-len 32768 \
-       --max-num-seqs 32 \
-       --kvcache-storage-backend mooncake
-```
-
-## Troubleshooting
-
-For more details, please refer to:
-https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/troubleshooting/troubleshooting.md
+- `mooncake_config.json` - Mooncake configuration file
+- `utils.sh` - Utility functions for scripts
+- `stop.sh` - Stop all running services
@@ -1,7 +1,6 @@
 {
-    "local_hostname":"localhost",
    "metadata_server":"http://0.0.0.0:15002/metadata",
-    "global_segment_size":8589934592,
+    "global_segment_size":1000000000,
    "local_buffer_size":134217728,
    "protocol":"rdma",
    "rdma_devices": "",
@@ -2,6 +2,7 @@
 set -e

 export MODEL_NAME="PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
+# export MODEL_NAME="/work/models/PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
 export MOONCAKE_CONFIG_PATH=./mooncake_config.json
 export FD_DEBUG=1

@@ -15,8 +16,8 @@ S0_PORT=52700
 S1_PORT=52800

 ports=(
-    $S0_PORT $((S0_PORT + 1)) $((S0_PORT + 2)) $((S0_PORT + 3))
-    $S1_PORT $((S1_PORT + 1)) $((S1_PORT + 2)) $((S1_PORT + 3))
+    $S0_PORT
+    $S1_PORT
 )
 check_ports "${ports[@]}" || {
    echo "❌ Some ports are in use. Please release them."
@@ -41,9 +42,6 @@ echo "server 0 port: ${S0_PORT}"
 nohup python -m fastdeploy.entrypoints.openai.api_server \
       --model ${MODEL_NAME} \
       --port ${S0_PORT} \
-       --metrics-port $((S0_PORT + 1)) \
-       --engine-worker-queue-port $((S0_PORT + 2)) \
-       --cache-queue-port $((S0_PORT + 3)) \
       --max-model-len 32768 \
       --max-num-seqs 32 \
       --kvcache-storage-backend mooncake \
@@ -58,9 +56,6 @@ echo "server 1 port: ${S1_PORT}"
 nohup python -m fastdeploy.entrypoints.openai.api_server \
       --model ${MODEL_NAME} \
       --port ${S1_PORT} \
-       --metrics-port $((S1_PORT + 1)) \
-       --engine-worker-queue-port $((S1_PORT + 2)) \
-       --cache-queue-port $((S1_PORT + 3)) \
       --max-model-len 32768 \
       --max-num-seqs 32 \
       --kvcache-storage-backend mooncake \
@@ -0,0 +1,219 @@
+#!/bin/bash
+set -e
+
+# =============================================================================
+# PD Disaggregation + Global Cache Pooling Test Script
+# Reference: start_v1_tp1.sh (PD Disaggregation) + run.sh (Mooncake Cache Pooling)
+# Note: Modify CUDA_VISIBLE_DEVICES environment variables for PD instances
+# =============================================================================
+
+# ======================== Environment Variables Configuration ========================
+export MODEL_NAME="/work/models/PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
+export FD_DEBUG=1
+
+# Mooncake Configuration (using environment variables)
+master_ip="127.0.0.1"
+master_port=15001
+metadata_port=15002
+
+export MOONCAKE_MASTER_SERVER_ADDR="${master_ip}:${master_port}"
+export MOONCAKE_METADATA_SERVER="http://${master_ip}:${metadata_port}/metadata"
+export MOONCAKE_GLOBAL_SEGMENT_SIZE="50000000000"
+# export MOONCAKE_PROTOCOL="tcp"
+export MOONCAKE_PROTOCOL="rdma"
+# export MOONCAKE_RDMA_DEVICES="mlx5_0"
+
+# ======================== Port Configuration ========================
+P_PORT=52400
+D_PORT=52500
+ROUTER_PORT=52700
+LOG_DATE=$(date +%Y%m%d_%H%M%S)
+
+# ======================== Cleanup and Preparation ========================
+unset http_proxy && unset https_proxy
+rm -rf log_*
+
+source ./utils.sh
+
+# Check ports
+ports=($P_PORT $D_PORT $ROUTER_PORT $master_port $metadata_port)
+check_ports "${ports[@]}" || {
+    echo "❌ Some ports are in use. Please release them."
+    exit 1
+}
+
+# ======================== Start Mooncake Master ========================
+echo "=== Starting Mooncake Master ==="
+export FD_LOG_DIR="log_master"
+mkdir -p ${FD_LOG_DIR}
+
+nohup mooncake_master \
+    --port=${master_port} \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=${metadata_port} \
+    2>&1 > ${FD_LOG_DIR}/nohup &
+
+sleep 2  # Wait for Mooncake Master to start
+
+# ======================== Start Router ========================
+echo "=== Starting Router ==="
+export FD_LOG_DIR="log_router"
+mkdir -p ${FD_LOG_DIR}
+echo "Router log: ${FD_LOG_DIR}, port: ${ROUTER_PORT}"
+
+nohup python -m fastdeploy.router.launch \
+    --port ${ROUTER_PORT} \
+    --splitwise \
+    2>&1 > ${FD_LOG_DIR}/nohup &
+
+sleep 2  # Wait for Router to start
+
+# ======================== Start P Instance (Prefill) ========================
+echo "=== Starting Prefill Instance ==="
+export CUDA_VISIBLE_DEVICES=3
+export FD_LOG_DIR="log_prefill"
+mkdir -p ${FD_LOG_DIR}
+echo "Prefill log: ${FD_LOG_DIR}, port: ${P_PORT}, GPU: ${CUDA_VISIBLE_DEVICES}"
+
+nohup python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port ${P_PORT} \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role prefill \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:${ROUTER_PORT}" \
+    --kvcache-storage-backend mooncake \
+    2>&1 > ${FD_LOG_DIR}/nohup &
+
+
+# ======================== Start D Instance (Decode) ========================
+echo "=== Starting Decode Instance ==="
+export CUDA_VISIBLE_DEVICES=7
+export FD_LOG_DIR="log_decode"
+mkdir -p ${FD_LOG_DIR}
+echo "Decode log: ${FD_LOG_DIR}, port: ${D_PORT}, GPU: ${CUDA_VISIBLE_DEVICES}"
+
+nohup python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port ${D_PORT} \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role decode \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:${ROUTER_PORT}" \
+    --enable-output-caching \
+    --kvcache-storage-backend mooncake \
+    2>&1 > ${FD_LOG_DIR}/nohup &
+
+
+# ======================== Wait for Services to be Ready ========================
+echo "=== Waiting for services to be ready ==="
+wait_for_health ${P_PORT}
+wait_for_health ${D_PORT}
+
+# Wait for services to register to Router
+sleep 10
+echo "✅ All services are ready!"
+
+# ======================== Send Test Requests ========================
+# Test scenario: Multi-turn conversation, verify that output cache written by D instance can be read by P instance
+#
+# Flow:
+# 1. Request 1: Send first round question, D instance generates answer and writes to global cache (prompt + output)
+# 2. Request 2: Send second round conversation (first round Q&A + follow-up), P instance should hit global cache for first round's complete KV cache
+#
+echo ""
+echo "=== Multi-turn Conversation Test for Global Cache Pooling ==="
+
+# First round question
+msg1="深圳是中国经济实力最强的城市之一。近年来，深圳 GDP 持续稳步增长，2023 年突破 3.4 万亿元人民币，2024 年接近 3.7 万亿元，长期位居全国城市前列。深圳经济以第二产业和第三产业为主，高端制造业、电子信息产业和现代服务业发达，形成了以科技创新为核心的产业结构。依托华为、腾讯、大疆等龙头企业，深圳在数字经济、人工智能、新能源等领域具有显著优势。同时，深圳进出口总额常年位居全国城市第一，是中国对外开放和高质量发展的重要引擎。深圳2024年 GDP 是多少？"
+
+echo ""
+echo ">>> Request 1: First round question"
+echo "    Purpose: D instance generates output and writes to global cache (prompt + output)"
+echo ""
+
+# Send first round request and get response
+response1=$(curl -s -X POST "http://0.0.0.0:${ROUTER_PORT}/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"messages\": [
+      {\"role\": \"user\", \"content\": \"${msg1}\"}
+    ],
+    \"max_tokens\": 200,
+    \"min_tokens\": 130,
+    \"stream\": false,
+    \"top_p\": 0
+  }")
+
+echo "Response 1:"
+echo "${response1}" | python3 -m json.tool 2>/dev/null || echo "${response1}"
+
+# Extract first round response content
+assistant_reply=$(echo "${response1}" | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['choices'][0]['message']['content'])" 2>/dev/null || echo "")
+
+if [ -z "${assistant_reply}" ]; then
+    echo "❌ Failed to get response from Request 1"
+    exit 1
+fi
+
+# JSON escape assistant_reply to prevent newlines, quotes, and other special characters from breaking JSON format
+assistant_reply_escaped=$(python3 -c "import json,sys; print(json.dumps(sys.stdin.read().strip()))" <<< "${assistant_reply}")
+
+echo ""
+echo "Assistant reply extracted: ${assistant_reply}..."
+
+# Wait for D instance to write output cache to global storage
+echo ""
+echo ">>> Waiting for D instance to write output cache to global storage..."
+sleep 5
+
+# Second round follow-up question
+msg2="那深圳2023年的GDP是多少？和2024年相比增长了多少？"
+
+echo ""
+echo ">>> Request 2: Second round (multi-turn conversation)"
+echo "    Purpose: P instance should hit global cache including D's output from Request 1"
+echo "    Check log_prefill/nohup for 'storage_match' to verify cache hit"
+echo ""
+
+# Send second round request (including complete multi-turn conversation history)
+response2=$(curl -s -X POST "http://0.0.0.0:${ROUTER_PORT}/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"messages\": [
+      {\"role\": \"user\", \"content\": \"${msg1}\"},
+      {\"role\": \"assistant\", \"content\": ${assistant_reply_escaped}},
+      {\"role\": \"user\", \"content\": \"${msg2}\"}
+    ],
+    \"max_tokens\": 100,
+    \"stream\": false,
+    \"top_p\": 0
+  }")
+
+echo "Response 2:"
+echo "${response2}" | python3 -m json.tool 2>/dev/null || echo "${response2}"
+
+# Extract second round response content and display
+assistant_reply2=$(echo "${response2}" | python3 -c "import sys, json; data=json.load(sys.stdin); print(data['choices'][0]['message']['content'])" 2>/dev/null || echo "")
+echo ""
+echo "Assistant reply 2: ${assistant_reply2}"
+
+echo ""
+echo ""
+echo "=== Test completed ==="
+echo ""
+echo "Verification Steps:"
+echo "1. Check log_prefill/nohup for Request 2's cache hit info:"
+echo "   grep -E 'storage_match|cache_hit|matched.*block' log_prefill/nohup"
+echo ""
+echo "2. If 'storage_match_token_num > 0' in Request 2, it means P instance"
+echo "   successfully read the output cache written by D instance from Request 1"
+echo ""
+echo "Log files:"
+echo "  - Prefill: log_prefill/nohup"
+echo "  - Decode:  log_decode/nohup"
+echo "  - Router:  log_router/nohup"
+echo "  - Master:  log_master/nohup"
@@ -1139,6 +1139,73 @@ class PrefixCacheManager:
        cost_time = time.time() - tic
        logger.info(f"finish write cache back to storage, req_id: {req_id}, cost_time: {cost_time:.6f}s")

+    def write_cache_to_storage_decode(self, request: Request):
+        """
+        D instance (Decode Node) simplified write method, does not rely on Radix Tree.
+
+        D instance does not maintain Radix Tree, so it cannot get keys through req_leaf_map.
+        Need to calculate cache keys directly based on token_ids.
+
+        Key generation algorithm is exactly the same as P instance (chained hash):
+        - Block 0: key_0 = get_hash_str(token_ids[0:block_size], [])
+        - Block 1: key_1 = get_hash_str(token_ids[block_size:2*block_size], [key_0])
+        - Block n: key_n = get_hash_str(token_ids[n*block_size:(n+1)*block_size], [key_{n-1}])
+
+        Incremental write logic is handled by CacheTransferManager.
+        """
+        if self.kvcache_storage_backend is None:
+            return
+
+        # 1. Get complete token_ids
+        token_ids = request.prompt_token_ids
+        if isinstance(token_ids, np.ndarray):
+            token_ids = token_ids.tolist()
+        else:
+            token_ids = list(token_ids)
+
+        if self.config.cache_config.enable_output_caching:
+            token_ids = token_ids + request.output_token_ids
+
+        # 2. Calculate cache keys using chained hash (consistent with P instance)
+        keys = []
+        prefix_block_key = []  # Initial is empty list
+        block_size = self.config.cache_config.block_size
+
+        for i in range(0, len(token_ids), block_size):
+            block_token_ids = token_ids[i : i + block_size]
+            if len(block_token_ids) < block_size:
+                break  # Do not cache incomplete block
+
+            # Calculate hash key for current block
+            key = get_hash_str(block_token_ids, prefix_block_key)
+            keys.append(key)
+
+            # Update prefix_block_key to current key (for next block)
+            prefix_block_key = [key]
+
+        if not keys:
+            return
+
+        # 3. Get corresponding gpu_block_ids
+        gpu_block_ids = request.block_tables[: len(keys)]
+
+        # 4. Construct WriteStorageTask and send
+        # Incremental logic is handled by CacheTransferManager.write_back_storage_task()
+        req_id = request.request_id
+        logger.info(f"[D instance] start write cache to storage, req_id: {req_id}, block num: {len(keys)}")
+
+        write_storage_task = WriteStorageTask(
+            task_id=req_id,
+            keys=keys,
+            token_ids=token_ids,
+            gpu_block_ids=gpu_block_ids,
+        )
+
+        tic = time.time()
+        self.issue_write_back_storage_task(write_storage_task, is_sync=True)
+        cost_time = time.time() - tic
+        logger.info(f"[D instance] finish write cache to storage, req_id: {req_id}, cost_time: {cost_time:.6f}s")
+
    def issue_write_back_storage_task(self, task: WriteStorageTask, is_sync=True):
        if self.kvcache_storage_backend is None:
            return
@@ -79,6 +79,11 @@ class MooncakeStoreConfig:
            logger.info(f"No RDMA devices specified, defaulting to all available devices: {rdma_devices}")
        if metadata_server is None or master_server_addr is None:
            raise ValueError("Both MOONCAKE_METADATA_SERVER and MOONCAKE_MASTER_SERVER_ADDR must be provided.")
+        if local_hostname == "localhost":
+            raise ValueError(
+                "The local hostname for mooncake must not be `localhost` to avoid "
+                "potential error, please do not manually set this param."
+            )

        return MooncakeStoreConfig(
            local_hostname=local_hostname,
@@ -558,8 +558,6 @@ class EngineArgs:

        if not self.tokenizer:
            self.tokenizer = self.model
-        if self.splitwise_role == "decode":
-            self.enable_prefix_caching = False
        if (
            not current_platform.is_cuda()
            and not current_platform.is_xpu()
@@ -694,7 +694,11 @@ class ResourceManagerV1(ResourceManager):
        return False

    def cache_output_tokens(self, request):
-        if self.config.cache_config.enable_prefix_caching and self.config.cache_config.enable_output_caching:
+        if (
+            self.config.cache_config.enable_prefix_caching
+            and self.config.cache_config.enable_output_caching
+            and self.config.scheduler_config.splitwise_role != "decode"
+        ):
            with self.lock:
                if request.num_computed_tokens >= request.need_prefill_tokens:  # request is decoding
                    self.cache_manager.cache_output_blocks(request, self.config.cache_config.block_size)
@@ -863,7 +867,10 @@ class ResourceManagerV1(ResourceManager):
                        scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens))
                    token_budget -= num_new_tokens
                    request.num_computed_tokens += num_new_tokens
-                    if self.config.cache_config.enable_prefix_caching:
+                    if (
+                        self.config.cache_config.enable_prefix_caching
+                        and self.config.scheduler_config.splitwise_role != "decode"
+                    ):
                        self.cache_manager.update_cache_blocks(
                            request, self.config.cache_config.block_size, request.num_computed_tokens
                        )
@@ -939,7 +946,10 @@ class ResourceManagerV1(ResourceManager):
                            scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens))
                            token_budget -= num_new_tokens
                            request.num_computed_tokens += num_new_tokens
-                            if self.config.cache_config.enable_prefix_caching:
+                            if (
+                                self.config.cache_config.enable_prefix_caching
+                                and self.config.scheduler_config.splitwise_role != "decode"
+                            ):
                                self.cache_manager.update_cache_blocks(
                                    request, self.config.cache_config.block_size, request.num_computed_tokens
                                )
@@ -959,7 +969,10 @@ class ResourceManagerV1(ResourceManager):
                        request.need_prefill_tokens = (
                            request.num_total_tokens
                        )  # Before preempted task rescheduled, preempted task has been sent to engine, no more tokens are output, here num_total_tokens should be static and correct
-                        if self.config.cache_config.enable_prefix_caching:
+                        if (
+                            self.config.cache_config.enable_prefix_caching
+                            and self.config.scheduler_config.splitwise_role != "decode"
+                        ):
                            if (
                                self.cache_manager.num_cpu_blocks > 0
                                or self.config.cache_config.kvcache_storage_backend
@@ -998,7 +1011,10 @@ class ResourceManagerV1(ResourceManager):
                            scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens))
                            token_budget -= num_new_tokens
                            request.num_computed_tokens += num_new_tokens
-                            if self.config.cache_config.enable_prefix_caching:
+                            if (
+                                self.config.cache_config.enable_prefix_caching
+                                and self.config.scheduler_config.splitwise_role != "decode"
+                            ):
                                self.cache_manager.update_cache_blocks(
                                    request, self.config.cache_config.block_size, request.num_computed_tokens
                                )
@@ -1257,9 +1273,10 @@ class ResourceManagerV1(ResourceManager):
            ) // self.config.cache_config.block_size + self.config.cache_config.enc_dec_block_num  # consider for mtp, plus enc_dec_block_num
            if self.config.cache_config.enable_prefix_caching:
                # Enable prefix caching
-                if self.cache_manager.num_cpu_blocks > 0:
+                if self.cache_manager.num_cpu_blocks > 0 or self.config.cache_config.kvcache_storage_backend:
                    if not self.cache_manager.can_allocate_gpu_blocks(
-                        need_prealloc_prefill_blocks
+                        (request.need_prefill_tokens + self.config.cache_config.block_size - 1)
+                        // self.config.cache_config.block_size
                    ):  # to prevent block allocation for matching in hierarchical cache and cause dead lock
                        return False
                success = self.get_prefix_cached_blocks(request)
@@ -1372,7 +1389,7 @@ class ResourceManagerV1(ResourceManager):
            self.running.append(request)

    def _free_blocks(self, request: Request):
-        if self.config.cache_config.enable_prefix_caching:
+        if self.config.cache_config.enable_prefix_caching and self.config.scheduler_config.splitwise_role != "decode":
            self.cache_manager.release_block_ids(request)
            self.cache_manager.recycle_gpu_blocks(
                request.block_tables[request.num_cached_blocks :], request.request_id
@@ -1432,8 +1449,14 @@ class ResourceManagerV1(ResourceManager):
                        del self.req_dict[req_id]

            # Do not block the main thread here
+            # Write cache to storage if kvcache_storage_backend is enabled
            for req in need_postprocess_reqs:
-                self.cache_manager.write_cache_to_storage(req)
+                if self.config.scheduler_config.splitwise_role == "decode":
+                    # D instance uses simplified write method (does not rely on Radix Tree)
+                    self.cache_manager.write_cache_to_storage_decode(req)
+                else:
+                    # P instance / Mixed instance uses standard write method (relies on Radix Tree)
+                    self.cache_manager.write_cache_to_storage(req)

            with self.lock:
                for req in need_postprocess_reqs: