12 KiB
中文文档 | English
Global Cache Pooling
This document describes how to use MooncakeStore as the KV Cache storage backend for FastDeploy, enabling Global Cache Pooling across multiple inference instances.
Overview
What is Global Cache Pooling?
Global Cache Pooling allows multiple FastDeploy instances to share KV Cache through a distributed storage layer. This enables:
- Cross-instance cache reuse: KV Cache computed by one instance can be reused by another
- PD Disaggregation optimization: Prefill and Decode instances can share cache seamlessly
- Reduced computation: Avoid redundant prefix computation across requests
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Mooncake Master Server │
│ (Metadata & Coordination Service) │
└────────────────────────────┬────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ FastDeploy │ │ FastDeploy │ │ FastDeploy │
│ Instance P │ │ Instance D │ │ Instance X │
│ (Prefill) │ │ (Decode) │ │ (Standalone) │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌────────▼────────┐
│ MooncakeStore │
│ (Shared KV │
│ Cache Pool) │
└─────────────────┘
Example Scripts
Ready-to-use example scripts are available in examples/cache_storage/.
| Script | Scenario | Description |
|---|---|---|
run.sh |
Multi-Instance | Two standalone instances sharing cache |
run_03b_pd_storage.sh |
PD Disaggregation | P+D instances with global cache pooling |
Prerequisites
Hardware Requirements
- NVIDIA GPU with CUDA support
- RDMA network (recommended for production) or TCP
Software Requirements
- Python 3.8+
- CUDA 11.8+
- FastDeploy (see installation below)
Installation
Refer to NVIDIA CUDA GPU Installation for FastDeploy installation.
# Option 1: Install from PyPI
pip install fastdeploy-gpu
# Option 2: Build from source
bash build.sh
pip install ./dist/fastdeploy*.whl
MooncakeStore is automatically installed when you install FastDeploy.
Configuration
We support two ways to configure MooncakeStore: via configuration file mooncake_config.json or via environment variables.
Mooncake Configuration File
Create a mooncake_config.json file:
{
"metadata_server": "http://0.0.0.0:15002/metadata",
"master_server_addr": "0.0.0.0:15001",
"global_segment_size": 1000000000,
"local_buffer_size": 134217728,
"protocol": "rdma",
"rdma_devices": ""
}
Set the MOONCAKE_CONFIG_PATH environment variable to enable the configuration:
export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.json
Configuration parameters:
| Parameter | Description | Default |
|---|---|---|
metadata_server |
HTTP metadata server URL | Required |
master_server_addr |
Master server address | Required |
global_segment_size |
Memory space each TP process shares to global shared memory (bytes) | 1GB |
local_buffer_size |
Local buffer size for data transfer (bytes) | 128MB |
protocol |
Transfer protocol: rdma or tcp |
rdma |
rdma_devices |
RDMA device names (comma-separated) | Auto-detect |
Environment Variables
Mooncake can also be configured via environment variables:
| Variable | Description |
|---|---|
MOONCAKE_MASTER_SERVER_ADDR |
Master server address (e.g., 10.0.0.1:15001) |
MOONCAKE_METADATA_SERVER |
Metadata server URL |
MOONCAKE_GLOBAL_SEGMENT_SIZE |
Memory space each TP process shares to global shared memory (bytes) |
MOONCAKE_LOCAL_BUFFER_SIZE |
Local buffer size (bytes) |
MOONCAKE_PROTOCOL |
Transfer protocol (rdma or tcp) |
MOONCAKE_RDMA_DEVICES |
RDMA device names |
Usage Scenarios
Scenario 1: Multi-Instance Cache Sharing
Run multiple FastDeploy instances sharing a global KV Cache pool.
Step 1: Start Mooncake Master
mooncake_master \
--port=15001 \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=15002 \
--metrics_port=15003
Step 2: Start FastDeploy Instances
Instance 0:
export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=0
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52700 \
--max-model-len 32768 \
--max-num-seqs 32 \
--kvcache-storage-backend mooncake
Instance 1:
export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=1
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52800 \
--max-model-len 32768 \
--max-num-seqs 32 \
--kvcache-storage-backend mooncake
Step 3: Test Cache Reuse
Send the same prompt to both instances. The second instance should reuse the KV Cache computed by the first instance.
# Request to Instance 0
curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
# Request to Instance 1 (should hit cached KV)
curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
Scenario 2: PD Disaggregation with Global Cache
This scenario combines PD Disaggregation with Global Cache Pooling, enabling:
- Prefill instances to read Decode instances' output cache
- Optimal multi-turn conversation performance
Architecture:
┌──────────────────────────────────────────┐
│ Router │
│ (Load Balancer) │
└─────────────────┬────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Prefill │ │ Decode │
│ Instance │◄───────────────►│ Instance │
│ │ KV Transfer │ │
└──────┬──────┘ └──────┬──────┘
│ │
└───────────────┬───────────────┘
│
┌────────▼────────┐
│ MooncakeStore │
│ (Global Cache) │
└─────────────────┘
Step 1: Start Mooncake Master
mooncake_master \
--port=15001 \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=15002
Step 2: Start Router
python -m fastdeploy.router.launch \
--port 52700 \
--splitwise
Step 3: Start Prefill Instance
export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=0
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52400 \
--max-model-len 32768 \
--max-num-seqs 32 \
--splitwise-role prefill \
--cache-transfer-protocol rdma \
--router "0.0.0.0:52700" \
--kvcache-storage-backend mooncake
Step 4: Start Decode Instance
export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=1
python -m fastdeploy.entrypoints.openai.api_server \
--model ${MODEL_NAME} \
--port 52500 \
--max-model-len 32768 \
--max-num-seqs 32 \
--splitwise-role decode \
--cache-transfer-protocol rdma \
--router "0.0.0.0:52700" \
--enable-output-caching \
--kvcache-storage-backend mooncake
Step 5: Send Requests via Router
curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'
FastDeploy Parameters for Mooncake
| Parameter | Description |
|---|---|
--kvcache-storage-backend mooncake |
Enable Mooncake as KV Cache storage backend |
--enable-output-caching |
Enable output token caching (Decode instance recommended) |
--cache-transfer-protocol rdma |
Use RDMA for KV transfer between P and D |
--splitwise-role prefill/decode |
Set instance role in PD disaggregation |
--router |
Router address for PD disaggregation |
Verification
Check Cache Hit
To verify cache hit in logs:
# For multi-instance scenario
grep -E "storage_cache_token_num" log_*/api_server.log
# For PD disaggregation scenario
grep -E "storage_cache_token_num" log_prefill/api_server.log
If storage_cache_token_num > 0, the instance successfully read cached KV blocks from the global pool.
Monitor Mooncake Master
# Check master status
curl http://localhost:15002/metadata
# Check metrics (if metrics_port is configured)
curl http://localhost:15003/metrics
Troubleshooting
Common Issues
1. Port Already in Use
# Check port usage
ss -ltn | grep 15001
# Kill existing process
kill -9 $(lsof -t -i:15001)
2. RDMA Connection Failed
- Verify RDMA devices:
ibv_devices - Check RDMA network:
ibv_devinfo - Fallback to TCP: Set
MOONCAKE_PROTOCOL=tcp
3. Cache Not Being Shared
- Verify all instances connect to the same Mooncake master
- Check metadata server URL is consistent
- Verify
global_segment_sizeis large enough
4. Permission Denied on /dev/shm
# Clean up stale shared memory files
find /dev/shm -type f -print0 | xargs -0 rm -f
Debug Mode
Enable debug logging:
export FD_DEBUG=1