* [Feature] support v1 update/clear api for RL * [fix] fix execute_model and add sleep/wakeup api * [fix] fix mtp and key_prefix * [chore] move _update_key_prefix to resume method * [fix] make the interface safe to call multiple times * [fix] fix some tiny bugs * [chore] make small changes against pr review * [docs] add docs for weight update * [test] add some tests and update docs * [style] fix code style check * [test] fix ci * [fix] fix stale control responses when control method timed out * [chore] remove unused code * [chore] fix code style * [chore] optimize tags and key_prefix * [test] fix ci * [chore] fix code style * [test] fix ci * [fix] fix ep control * [fix] fix ep control for engine cache queue
12 KiB
Weight Clear and Update
FastDeploy supports dynamic weight clear and update for RL and RLHF rollout services. This capability is primarily intended to address the following two requirements:
- release GPU memory when the rollout engine is idle;
- refresh inference weights after the trainer produces a new checkpoint, without restarting the whole service.
This page describes the weight-control interfaces currently supported by FastDeploy, the semantics of each interface, and their typical usage in RLHF training.
Prerequisites
In RLHF scenarios, FastDeploy mainly provides this capability through the online serving mode. Dynamic weight loading must be enabled when starting the service:
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/model \
--dynamic-load-weight \
--load_strategy ipc_snapshot
--dynamic-load-weight enables dynamic weight control, and --load_strategy specifies the concrete weight update mechanism. The currently supported update modes are listed below:
| Mode | load_strategy |
Typical use | Notes |
|---|---|---|---|
| CUDA IPC | ipc |
Training and inference processes on the same node share live tensors | Update source comes from IPC metadata produced by the training side. |
| IPC snapshot | ipc_snapshot |
Rollout reloads a snapshot file produced by training | Used by current RL rollout examples. |
| RDMA / rsync | rsync |
Trainer publishes a new version and rollout fetches it remotely | POST /v1/update_weights is the explicit API for this mode. |
API Overview
Compatibility APIs
In FastDeploy <= 2.5, the following simplified APIs are provided for compatibility with the legacy RL control flow.
| API | Method | Meaning | Availability |
|---|---|---|---|
/clear_load_weight |
GET |
Clear or offload currently loaded weights | Requires dynamic_load_weight=True |
/update_model_weight |
GET |
Reload weights after a clear/offload operation | Requires dynamic_load_weight=True |
V1 control APIs
In FastDeploy >= 2.6, the underlying control-signal communication path is optimized and V1 control APIs are introduced. Compared with the legacy APIs, the V1 APIs provide a more stable execution path, clearer semantics, and more flexible control:
| API | Method | Request params | Semantics |
|---|---|---|---|
/v1/pause |
POST |
none | Pause request generation, abort running and inflight requests, reset scheduler state, and pause cache transfer if enabled. |
/v1/resume |
POST |
none | Resume request generation and cache transfer. |
/v1/is_paused |
GET |
none | Return {"is_paused": bool}. |
/v1/sleep |
POST |
?tags=weight,kv_cache |
Offload selected GPU memory objects. Supported tags are weight and kv_cache. If omitted, both are used. |
/v1/wakeup |
POST |
?tags=weight,kv_cache |
Reload previously offloaded weights and/or KV cache. On success, the engine resumes automatically. |
/v1/update_weights |
POST |
JSON {"version":"...", "rsync_config": {...}} |
Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially load_strategy=rsync. |
Compatibility Notes
The optimized communication path also applies to the legacy APIs. By setting FD_ENABLE_V1_UPDATE_WEIGHTS=1, the legacy APIs can be switched to the new control path while keeping the original API form.
FD_ENABLE_V1_UPDATE_WEIGHTS=0: use the legacy shared-memory-based control path.FD_ENABLE_V1_UPDATE_WEIGHTS=1:/clear_load_weightis effectively handled through/v1/sleep, and/update_model_weightis effectively handled through/v1/wakeup. The corresponding pause/resume actions are handled internally bysleepandwakeup.
Note: regardless of whether V1 is enabled, the legacy APIs are not the recommended standard interface for RLHF scenarios and may be gradually deprecated in future releases. The /v1/* control APIs are recommended.
Interface Semantics
/v1/pause
/v1/pause is the safe boundary before changing model state.
It does the following:
- stops new request generation;
- aborts running and inflight requests;
- resets scheduler state;
- pauses cache transfer when multi-level cache or KV cache storage is enabled.
When a clear boundary is required between one rollout round and the next training stage, this API should be called first.
/v1/sleep
/v1/sleep offloads selected runtime state from GPU memory.
Supported tags:
weight: clear model weights from device memory; if enabled, communication groups and DeepEP buffers may also be released.kv_cache: clear KV cache; MTP cache is also cleared when speculative decoding uses MTP.
If the tags parameter is omitted, FastDeploy defaults to:
/v1/sleep?tags=weight,kv_cache
In the current implementation, sleep automatically performs a pause first. New integrations should not rely on this implicit behavior.
/v1/wakeup
/v1/wakeup restores the state offloaded by /v1/sleep.
Depending on tags and configuration, FastDeploy may:
- restart communication groups;
- recreate DeepEP buffers;
- reload model weights from the configured source;
- rebuild KV cache;
- recapture CUDA Graph.
After wakeup succeeds, FastDeploy automatically calls resume.
/v1/update_weights
/v1/update_weights refreshes model parameters directly, without unloading the GPU memory occupied by model weights.
Current request fields:
version: optional string. Used to choose a target checkpoint version.rsync_config: optional dictionary. Must containetcd_serverwhen provided.
Important semantics:
- the engine must already be paused, otherwise the request fails;
- the update is executed on workers only;
- this API is meant for explicit weight refresh, especially the
rsyncpath; - it does not implicitly call
resume.
Recommended sequence:
POST /v1/pausePOST /v1/update_weightsPOST /v1/resume
If GPU memory also needs to be reclaimed between rollout rounds, the sleep / wakeup workflow is more appropriate.
Example Requests
Basic APIs
Pause the engine:
curl -X POST http://127.0.0.1:8000/v1/pause
Resume the engine:
curl -X POST http://127.0.0.1:8000/v1/resume
Sleep / Wakeup APIs
Offload weights and KV cache
# Offload both weights and KV cache
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight,kv_cache"
# Offload only weights
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight"
# Omit parameter, defaults to both
curl -X POST "http://127.0.0.1:8000/v1/sleep"
Restore weights and KV cache
# Restore both weights and KV cache
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight,kv_cache"
# Restore only weights
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight"
# Omit parameter, defaults to both
curl -X POST "http://127.0.0.1:8000/v1/wakeup"
Note: When use_cudagraph=True, KV cache must be restored before weights. This means /v1/wakeup with the kv_cache tag must be called before calling /v1/wakeup with the weight tag. If weights are restored without KV cache, an error will be raised. It is recommended to keep the tags parameter consistent between /v1/sleep and /v1/wakeup.
Update Weights API
Refresh to a new remotely published version:
curl -X POST http://127.0.0.1:8000/v1/update_weights \
-H "Content-Type: application/json" \
-d '{
"version": "global_step_1200",
"rsync_config": {
"etcd_server": "127.0.0.1:2379"
}
}'
RLHF Usage
Recommended Rollout Service Setup
In RLHF scenarios, FastDeploy rollout services are typically configured as follows:
dynamic_load_weight=Trueload_strategy=ipc_snapshotfor local snapshot-based refresh;- or
load_strategy=rsyncfor versioned remote refresh.
The rollout utilities in the repository already follow this pattern. A typical example is:
from fastdeploy.rl.rollout_config import RolloutModelConfig
from fastdeploy.rl.rollout_model import RolloutModel
rollout_config = RolloutModelConfig(
model_name_or_path=model_path,
tensor_parallel_size=ranks,
dynamic_load_weight=True,
load_strategy="ipc_snapshot",
)
rollout_model = RolloutModel(rollout_config)
Training-Side Integration Support
In addition to serving endpoints, FastDeploy provides the following training-side integration capabilities for RLHF:
RolloutModel.state_dict(): exposes the rollout-side inference parameters.RolloutModel.get_name_mappings_to_training(): exposes the mapping from inference parameter names to training parameter names.
These interfaces can be used to align training checkpoints with rollout-side parameter layouts, especially when inference-side and training-side parameter names are not fully identical.
Common RLHF workflows
The following examples assume the service endpoint is http://127.0.0.1:8000.
Workflow 1: clear and restore
This workflow is suitable when the rollout service stays resident, but GPU memory should be released before training and restored afterward. The recommended sequence is (pause) -> sleep -> wakeup -> (resume), where the steps in parentheses are optional.
# Optional: explicitly pause the engine to establish a clear transition boundary
curl -X POST http://127.0.0.1:8000/v1/pause
# Offload both weights and KV cache
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight,kv_cache"
# Restore both weights and KV cache after training completes
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight,kv_cache"
# Optional: explicitly resume if required by the integration
curl -X POST http://127.0.0.1:8000/v1/resume
Workflow 2: in-place refresh to a new checkpoint
This workflow is suitable when the service remains resident and only needs to switch to a new checkpoint version. The recommended sequence is pause -> update_weights -> resume.
# Pause the engine first
curl -X POST http://127.0.0.1:8000/v1/pause
# Refresh to a new checkpoint version in place
curl -X POST http://127.0.0.1:8000/v1/update_weights \
-H "Content-Type: application/json" \
-d '{
"version": "global_step_1200",
"rsync_config": {
"etcd_server": "127.0.0.1:2379"
}
}'
# Resume the service after the update completes
curl -X POST http://127.0.0.1:8000/v1/resume
Workflow 3: legacy compatibility APIs
Legacy RL clients can continue to use the compatibility flow clear_load_weight -> update_model_weight.
# Clear or offload the current weights
curl -X GET http://127.0.0.1:8000/clear_load_weight
# Reload weights after the trainer updates the checkpoint
curl -X GET http://127.0.0.1:8000/update_model_weight
For new integrations, the /v1/* APIs are recommended because their control path is more explicit and easier to trace.
Other Related Configuration
Communication Group Clear and Rebuild
FastDeploy provides --shutdown-comm-group-if-worker-idle and --no-shutdown-comm-group-if-worker-idle to explicitly control whether communication groups should also be torn down when weights are offloaded.
Keeping communication groups alive generally improves the stability of weight clearing and reloading. The tradeoff is that more GPU memory remains allocated after weight offload, and the execution time of sleep / wakeup may also increase.
By default:
- in EP scenarios, communication groups are kept;
- in non-EP scenarios, communication groups are torn down.
CPU Cache Clear and Rebuild
After --swap-space is enabled, the following environment variable can be used to control whether CPU-side cache should also be cleared when /v1/sleep is executed, in order to reduce memory pressure during training.
By default, FastDeploy does not actively clear CPU cache. To clear it together with sleep, set:
export FD_ENABLE_SWAP_SPACE_CLEARING=1