apps/FastDeploy

Fork 0

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Files

T

YuBaoku aee293be0f [CI] Optimize: add vl swap_test and remove useless code (#7000 )

2026-03-25 11:33:56 +08:00

12 KiB

Raw Blame History

中文文档 | English

Global Cache Pooling

This document describes how to use MooncakeStore as the KV Cache storage backend for FastDeploy, enabling Global Cache Pooling across multiple inference instances.

Overview

What is Global Cache Pooling?

Global Cache Pooling allows multiple FastDeploy instances to share KV Cache through a distributed storage layer. This enables:

Cross-instance cache reuse: KV Cache computed by one instance can be reused by another
PD Disaggregation optimization: Prefill and Decode instances can share cache seamlessly
Reduced computation: Avoid redundant prefix computation across requests

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Mooncake Master Server                       │
│              (Metadata & Coordination Service)                   │
└────────────────────────────┬────────────────────────────────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│  FastDeploy     │ │  FastDeploy     │ │  FastDeploy     │
│  Instance P     │ │  Instance D     │ │  Instance X     │
│  (Prefill)      │ │  (Decode)       │ │  (Standalone)   │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    ┌────────▼────────┐
                    │  MooncakeStore  │
                    │  (Shared KV     │
                    │   Cache Pool)   │
                    └─────────────────┘

Example Scripts

Ready-to-use example scripts are available in examples/cache_storage/.

Script	Scenario	Description
`run.sh`	Multi-Instance	Two standalone instances sharing cache
`run_03b_pd_storage.sh`	PD Disaggregation	P+D instances with global cache pooling

Prerequisites

Hardware Requirements

NVIDIA GPU with CUDA support
RDMA network (recommended for production) or TCP

Software Requirements

Python 3.8+
CUDA 11.8+
FastDeploy (see installation below)

Installation

Refer to NVIDIA CUDA GPU Installation for FastDeploy installation.

# Option 1: Install from PyPI
pip install fastdeploy-gpu

# Option 2: Build from source
bash build.sh
pip install ./dist/fastdeploy*.whl

MooncakeStore is automatically installed when you install FastDeploy.

Configuration

We support two ways to configure MooncakeStore: via configuration file mooncake_config.json or via environment variables.

Mooncake Configuration File

Create a mooncake_config.json file:

{
    "metadata_server": "http://0.0.0.0:15002/metadata",
    "master_server_addr": "0.0.0.0:15001",
    "global_segment_size": 1000000000,
    "local_buffer_size": 134217728,
    "protocol": "rdma",
    "rdma_devices": ""
}

Set the MOONCAKE_CONFIG_PATH environment variable to enable the configuration:

export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.json

Configuration parameters:

Parameter	Description	Default
`metadata_server`	HTTP metadata server URL	Required
`master_server_addr`	Master server address	Required
`global_segment_size`	Memory space each TP process shares to global shared memory (bytes)	1GB
`local_buffer_size`	Local buffer size for data transfer (bytes)	128MB
`protocol`	Transfer protocol: `rdma` or `tcp`	`rdma`
`rdma_devices`	RDMA device names (comma-separated)	Auto-detect

Environment Variables

Mooncake can also be configured via environment variables:

Variable	Description
`MOONCAKE_MASTER_SERVER_ADDR`	Master server address (e.g., `10.0.0.1:15001`)
`MOONCAKE_METADATA_SERVER`	Metadata server URL
`MOONCAKE_GLOBAL_SEGMENT_SIZE`	Memory space each TP process shares to global shared memory (bytes)
`MOONCAKE_LOCAL_BUFFER_SIZE`	Local buffer size (bytes)
`MOONCAKE_PROTOCOL`	Transfer protocol (`rdma` or `tcp`)
`MOONCAKE_RDMA_DEVICES`	RDMA device names

Usage Scenarios

Run multiple FastDeploy instances sharing a global KV Cache pool.

Step 1: Start Mooncake Master

mooncake_master \
    --port=15001 \
    --enable_http_metadata_server=true \
    --http_metadata_server_host=0.0.0.0 \
    --http_metadata_server_port=15002 \
    --metrics_port=15003

Step 2: Start FastDeploy Instances

Instance 0:

export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=0

python -m fastdeploy.entrypoints.openai.api_server \
       --model ${MODEL_NAME} \
       --port 52700 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
       --kvcache-storage-backend mooncake

Instance 1:

export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
export CUDA_VISIBLE_DEVICES=1

python -m fastdeploy.entrypoints.openai.api_server \
       --model ${MODEL_NAME} \
       --port 52800 \
       --max-model-len 32768 \
       --max-num-seqs 32 \
       --kvcache-storage-backend mooncake

Step 3: Test Cache Reuse

Send the same prompt to both instances. The second instance should reuse the KV Cache computed by the first instance.

# Request to Instance 0
curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'

# Request to Instance 1 (should hit cached KV)
curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'

Scenario 2: PD Disaggregation with Global Cache

This scenario combines PD Disaggregation with Global Cache Pooling, enabling:

Prefill instances to read Decode instances' output cache
Optimal multi-turn conversation performance

Architecture:

         ┌──────────────────────────────────────────┐
         │              Router                       │
         │         (Load Balancer)                   │
         └─────────────────┬────────────────────────┘
                           │
           ┌───────────────┴───────────────┐
           │                               │
           ▼                               ▼
    ┌─────────────┐                 ┌─────────────┐
    │   Prefill   │                 │   Decode    │
    │  Instance   │◄───────────────►│  Instance   │
    │             │   KV Transfer   │             │
    └──────┬──────┘                 └──────┬──────┘
           │                               │
           └───────────────┬───────────────┘
                           │
                  ┌────────▼────────┐
                  │  MooncakeStore  │
                  │  (Global Cache) │
                  └─────────────────┘

Step 1: Start Mooncake Master

mooncake_master \
    --port=15001 \
    --enable_http_metadata_server=true \
    --http_metadata_server_host=0.0.0.0 \
    --http_metadata_server_port=15002

Step 2: Start Router

python -m fastdeploy.router.launch \
    --port 52700 \
    --splitwise

Step 3: Start Prefill Instance

export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=0

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${MODEL_NAME} \
    --port 52400 \
    --max-model-len 32768 \
    --max-num-seqs 32 \
    --splitwise-role prefill \
    --cache-transfer-protocol rdma \
    --router "0.0.0.0:52700" \
    --kvcache-storage-backend mooncake

Step 4: Start Decode Instance

export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
export MOONCAKE_PROTOCOL="rdma"
export CUDA_VISIBLE_DEVICES=1

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${MODEL_NAME} \
    --port 52500 \
    --max-model-len 32768 \
    --max-num-seqs 32 \
    --splitwise-role decode \
    --cache-transfer-protocol rdma \
    --router "0.0.0.0:52700" \
    --enable-output-caching \
    --kvcache-storage-backend mooncake

Step 5: Send Requests via Router

curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'

FastDeploy Parameters for Mooncake

Parameter	Description
`--kvcache-storage-backend mooncake`	Enable Mooncake as KV Cache storage backend
`--enable-output-caching`	Enable output token caching (Decode instance recommended)
`--cache-transfer-protocol rdma`	Use RDMA for KV transfer between P and D
`--splitwise-role prefill/decode`	Set instance role in PD disaggregation
`--router`	Router address for PD disaggregation

Verification

Check Cache Hit

To verify cache hit in logs:

# For multi-instance scenario
grep -E "storage_cache_token_num" log_*/api_server.log

# For PD disaggregation scenario
grep -E "storage_cache_token_num" log_prefill/api_server.log

If storage_cache_token_num > 0, the instance successfully read cached KV blocks from the global pool.

Monitor Mooncake Master

# Check master status
curl http://localhost:15002/metadata

# Check metrics (if metrics_port is configured)
curl http://localhost:15003/metrics

Troubleshooting

Common Issues

1. Port Already in Use

# Check port usage
ss -ltn | grep 15001

# Kill existing process
kill -9 $(lsof -t -i:15001)

2. RDMA Connection Failed

Verify RDMA devices: ibv_devices
Check RDMA network: ibv_devinfo
Fallback to TCP: Set MOONCAKE_PROTOCOL=tcp

3. Cache Not Being Shared

Verify all instances connect to the same Mooncake master
Check metadata server URL is consistent
Verify global_segment_size is large enough

4. Permission Denied on /dev/shm

# Clean up stale shared memory files
find /dev/shm -type f -print0 | xargs -0 rm -f

Debug Mode

Enable debug logging:

export FD_DEBUG=1

12 KiB Raw Blame History