mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 17:11:21 +08:00

Files

T

Yonghua Li a7f52c300d [Feature] support v1 update/clear api for RL (#6761 )

* [Feature] support v1 update/clear api for RL

* [fix] fix execute_model and add sleep/wakeup api

* [fix] fix mtp and key_prefix

* [chore] move _update_key_prefix to resume method

* [fix] make the interface safe to call multiple times

* [fix] fix some tiny bugs

* [chore] make small changes against pr review

* [docs] add docs for weight update

* [test] add some tests and update docs

* [style] fix code style check

* [test] fix ci

* [fix] fix stale control responses when control method timed out

* [chore] remove unused code

* [chore] fix code style

* [chore] optimize tags and key_prefix

* [test] fix ci

* [chore] fix code style

* [test] fix ci

* [fix] fix ep control

* [fix] fix ep control for engine cache queue

2026-03-25 19:18:46 +08:00

12 KiB

Raw Blame History

简体中文

Weight Clear and Update

FastDeploy supports dynamic weight clear and update for RL and RLHF rollout services. This capability is primarily intended to address the following two requirements:

release GPU memory when the rollout engine is idle;
refresh inference weights after the trainer produces a new checkpoint, without restarting the whole service.

This page describes the weight-control interfaces currently supported by FastDeploy, the semantics of each interface, and their typical usage in RLHF training.

Prerequisites

In RLHF scenarios, FastDeploy mainly provides this capability through the online serving mode. Dynamic weight loading must be enabled when starting the service:

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/model \
    --dynamic-load-weight \
    --load_strategy ipc_snapshot

--dynamic-load-weight enables dynamic weight control, and --load_strategy specifies the concrete weight update mechanism. The currently supported update modes are listed below:

Mode	`load_strategy`	Typical use	Notes
CUDA IPC	`ipc`	Training and inference processes on the same node share live tensors	Update source comes from IPC metadata produced by the training side.
IPC snapshot	`ipc_snapshot`	Rollout reloads a snapshot file produced by training	Used by current RL rollout examples.
RDMA / rsync	`rsync`	Trainer publishes a new version and rollout fetches it remotely	`POST /v1/update_weights` is the explicit API for this mode.

API Overview

Compatibility APIs

In FastDeploy <= 2.5, the following simplified APIs are provided for compatibility with the legacy RL control flow.

API	Method	Meaning	Availability
`/clear_load_weight`	`GET`	Clear or offload currently loaded weights	Requires `dynamic_load_weight=True`
`/update_model_weight`	`GET`	Reload weights after a clear/offload operation	Requires `dynamic_load_weight=True`

V1 control APIs

In FastDeploy >= 2.6, the underlying control-signal communication path is optimized and V1 control APIs are introduced. Compared with the legacy APIs, the V1 APIs provide a more stable execution path, clearer semantics, and more flexible control:

API	Method	Request params	Semantics
`/v1/pause`	`POST`	none	Pause request generation, abort running and inflight requests, reset scheduler state, and pause cache transfer if enabled.
`/v1/resume`	`POST`	none	Resume request generation and cache transfer.
`/v1/is_paused`	`GET`	none	Return `{"is_paused": bool}`.
`/v1/sleep`	`POST`	`?tags=weight,kv_cache`	Offload selected GPU memory objects. Supported tags are `weight` and `kv_cache`. If omitted, both are used.
`/v1/wakeup`	`POST`	`?tags=weight,kv_cache`	Reload previously offloaded weights and/or KV cache. On success, the engine resumes automatically.
`/v1/update_weights`	`POST`	JSON `{"version":"...", "rsync_config": {...}}`	Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially `load_strategy=rsync`.

Compatibility Notes

The optimized communication path also applies to the legacy APIs. By setting FD_ENABLE_V1_UPDATE_WEIGHTS=1, the legacy APIs can be switched to the new control path while keeping the original API form.

FD_ENABLE_V1_UPDATE_WEIGHTS=0: use the legacy shared-memory-based control path.
FD_ENABLE_V1_UPDATE_WEIGHTS=1: /clear_load_weight is effectively handled through /v1/sleep, and /update_model_weight is effectively handled through /v1/wakeup. The corresponding pause/resume actions are handled internally by sleep and wakeup.

Note: regardless of whether V1 is enabled, the legacy APIs are not the recommended standard interface for RLHF scenarios and may be gradually deprecated in future releases. The /v1/* control APIs are recommended.

Interface Semantics

`/v1/pause`

/v1/pause is the safe boundary before changing model state.

It does the following:

stops new request generation;
aborts running and inflight requests;
resets scheduler state;
pauses cache transfer when multi-level cache or KV cache storage is enabled.

When a clear boundary is required between one rollout round and the next training stage, this API should be called first.

`/v1/sleep`

/v1/sleep offloads selected runtime state from GPU memory.

Supported tags:

weight: clear model weights from device memory; if enabled, communication groups and DeepEP buffers may also be released.
kv_cache: clear KV cache; MTP cache is also cleared when speculative decoding uses MTP.

If the tags parameter is omitted, FastDeploy defaults to:

/v1/sleep?tags=weight,kv_cache

In the current implementation, sleep automatically performs a pause first. New integrations should not rely on this implicit behavior.

`/v1/wakeup`

/v1/wakeup restores the state offloaded by /v1/sleep.

Depending on tags and configuration, FastDeploy may:

restart communication groups;
recreate DeepEP buffers;
reload model weights from the configured source;
rebuild KV cache;
recapture CUDA Graph.

After wakeup succeeds, FastDeploy automatically calls resume.

`/v1/update_weights`

/v1/update_weights refreshes model parameters directly, without unloading the GPU memory occupied by model weights.

Current request fields:

version: optional string. Used to choose a target checkpoint version.
rsync_config: optional dictionary. Must contain etcd_server when provided.

Important semantics:

the engine must already be paused, otherwise the request fails;
the update is executed on workers only;
this API is meant for explicit weight refresh, especially the rsync path;
it does not implicitly call resume.

Recommended sequence:

POST /v1/pause
POST /v1/update_weights
POST /v1/resume

If GPU memory also needs to be reclaimed between rollout rounds, the sleep / wakeup workflow is more appropriate.

Example Requests

Basic APIs

Pause the engine:

curl -X POST http://127.0.0.1:8000/v1/pause

Resume the engine:

curl -X POST http://127.0.0.1:8000/v1/resume

Sleep / Wakeup APIs

Offload weights and KV cache

# Offload both weights and KV cache
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight,kv_cache"

# Offload only weights
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight"

# Omit parameter, defaults to both
curl -X POST "http://127.0.0.1:8000/v1/sleep"

Restore weights and KV cache

# Restore both weights and KV cache
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight,kv_cache"

# Restore only weights
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight"

# Omit parameter, defaults to both
curl -X POST "http://127.0.0.1:8000/v1/wakeup"

Note: When use_cudagraph=True, KV cache must be restored before weights. This means /v1/wakeup with the kv_cache tag must be called before calling /v1/wakeup with the weight tag. If weights are restored without KV cache, an error will be raised. It is recommended to keep the tags parameter consistent between /v1/sleep and /v1/wakeup.

Update Weights API

Refresh to a new remotely published version:

curl -X POST http://127.0.0.1:8000/v1/update_weights \
  -H "Content-Type: application/json" \
  -d '{
    "version": "global_step_1200",
    "rsync_config": {
      "etcd_server": "127.0.0.1:2379"
    }
  }'

RLHF Usage

Recommended Rollout Service Setup

In RLHF scenarios, FastDeploy rollout services are typically configured as follows:

dynamic_load_weight=True
load_strategy=ipc_snapshot for local snapshot-based refresh;
or load_strategy=rsync for versioned remote refresh.

The rollout utilities in the repository already follow this pattern. A typical example is:

from fastdeploy.rl.rollout_config import RolloutModelConfig
from fastdeploy.rl.rollout_model import RolloutModel

rollout_config = RolloutModelConfig(
    model_name_or_path=model_path,
    tensor_parallel_size=ranks,
    dynamic_load_weight=True,
    load_strategy="ipc_snapshot",
)
rollout_model = RolloutModel(rollout_config)

Training-Side Integration Support

In addition to serving endpoints, FastDeploy provides the following training-side integration capabilities for RLHF:

RolloutModel.state_dict(): exposes the rollout-side inference parameters.
RolloutModel.get_name_mappings_to_training(): exposes the mapping from inference parameter names to training parameter names.

These interfaces can be used to align training checkpoints with rollout-side parameter layouts, especially when inference-side and training-side parameter names are not fully identical.

Common RLHF workflows

The following examples assume the service endpoint is http://127.0.0.1:8000.

Workflow 1: clear and restore

This workflow is suitable when the rollout service stays resident, but GPU memory should be released before training and restored afterward. The recommended sequence is (pause) -> sleep -> wakeup -> (resume), where the steps in parentheses are optional.

# Optional: explicitly pause the engine to establish a clear transition boundary
curl -X POST http://127.0.0.1:8000/v1/pause

# Offload both weights and KV cache
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight,kv_cache"

# Restore both weights and KV cache after training completes
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight,kv_cache"

# Optional: explicitly resume if required by the integration
curl -X POST http://127.0.0.1:8000/v1/resume

Workflow 2: in-place refresh to a new checkpoint

This workflow is suitable when the service remains resident and only needs to switch to a new checkpoint version. The recommended sequence is pause -> update_weights -> resume.

# Pause the engine first
curl -X POST http://127.0.0.1:8000/v1/pause

# Refresh to a new checkpoint version in place
curl -X POST http://127.0.0.1:8000/v1/update_weights \
  -H "Content-Type: application/json" \
  -d '{
    "version": "global_step_1200",
    "rsync_config": {
      "etcd_server": "127.0.0.1:2379"
    }
  }'

# Resume the service after the update completes
curl -X POST http://127.0.0.1:8000/v1/resume

Workflow 3: legacy compatibility APIs

Legacy RL clients can continue to use the compatibility flow clear_load_weight -> update_model_weight.

# Clear or offload the current weights
curl -X GET http://127.0.0.1:8000/clear_load_weight

# Reload weights after the trainer updates the checkpoint
curl -X GET http://127.0.0.1:8000/update_model_weight

For new integrations, the /v1/* APIs are recommended because their control path is more explicit and easier to trace.

Communication Group Clear and Rebuild

FastDeploy provides --shutdown-comm-group-if-worker-idle and --no-shutdown-comm-group-if-worker-idle to explicitly control whether communication groups should also be torn down when weights are offloaded.

Keeping communication groups alive generally improves the stability of weight clearing and reloading. The tradeoff is that more GPU memory remains allocated after weight offload, and the execution time of sleep / wakeup may also increase.

By default:

in EP scenarios, communication groups are kept;
in non-EP scenarios, communication groups are torn down.

CPU Cache Clear and Rebuild

After --swap-space is enabled, the following environment variable can be used to control whether CPU-side cache should also be cleared when /v1/sleep is executed, in order to reduce memory pressure during training.

By default, FastDeploy does not actively clear CPU cache. To clear it together with sleep, set:

export FD_ENABLE_SWAP_SPACE_CLEARING=1

12 KiB Raw Blame History