mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 08:21:53 +08:00
05f2d95729
* update checkpoint-transfer flow and control update_weights params * test: add update_weights route validation
305 lines
12 KiB
Markdown
305 lines
12 KiB
Markdown
[简体中文](../zh/features/weight_update.md)
|
|
|
|
# Weight Clear and Update
|
|
|
|
FastDeploy supports dynamic weight clear and update for RL and RLHF rollout services. This capability is primarily intended to address the following two requirements:
|
|
|
|
- release GPU memory when the rollout engine is idle;
|
|
- refresh inference weights after the trainer produces a new checkpoint, without restarting the whole service.
|
|
|
|
This page describes the weight-control interfaces currently supported by FastDeploy, the semantics of each interface, and their typical usage in RLHF training.
|
|
|
|
## Prerequisites
|
|
|
|
In RLHF scenarios, FastDeploy mainly provides this capability through the online serving mode. Dynamic weight loading must be enabled when starting the service:
|
|
|
|
```bash
|
|
python -m fastdeploy.entrypoints.openai.api_server \
|
|
--model /path/to/model \
|
|
--dynamic-load-weight \
|
|
--load_strategy ipc_snapshot
|
|
```
|
|
|
|
`--dynamic-load-weight` enables dynamic weight control, and `--load_strategy` specifies the concrete weight update mechanism. The currently supported update modes are listed below:
|
|
|
|
| Mode | `load_strategy` | Typical use | Notes |
|
|
| --- | --- | --- | --- |
|
|
| CUDA IPC | `ipc` | Training and inference processes on the same node share live tensors | Update source comes from IPC metadata produced by the training side. |
|
|
| IPC snapshot | `ipc_snapshot` | Rollout reloads a snapshot file produced by training | Used by current RL rollout examples. |
|
|
| RDMA / rsync | `rsync` | Trainer publishes a new version and rollout fetches it remotely | `POST /v1/update_weights` is the explicit API for this mode. |
|
|
|
|
## API Overview
|
|
|
|
### Compatibility APIs
|
|
|
|
In FastDeploy <= 2.5, the following simplified APIs are provided for compatibility with the legacy RL control flow.
|
|
|
|
| API | Method | Meaning | Availability |
|
|
| --- | --- | --- | --- |
|
|
| `/clear_load_weight` | `GET` | Clear or offload currently loaded weights | Requires `dynamic_load_weight=True` |
|
|
| `/update_model_weight` | `GET` | Reload weights after a clear/offload operation | Requires `dynamic_load_weight=True` |
|
|
|
|
### V1 control APIs
|
|
|
|
In FastDeploy >= 2.6, the underlying control-signal communication path is optimized and V1 control APIs are introduced. Compared with the legacy APIs, the V1 APIs provide a more stable execution path, clearer semantics, and more flexible control:
|
|
|
|
| API | Method | Request params | Semantics |
|
|
| --- | --- | --- | --- |
|
|
| `/v1/pause` | `POST` | none | Pause request generation, abort running and inflight requests, reset scheduler state, and pause cache transfer if enabled. |
|
|
| `/v1/resume` | `POST` | none | Resume request generation and cache transfer. |
|
|
| `/v1/is_paused` | `GET` | none | Return `{"is_paused": bool}`. |
|
|
| `/v1/sleep` | `POST` | `?tags=weight,kv_cache` | Offload selected GPU memory objects. Supported tags are `weight` and `kv_cache`. If omitted, both are used. |
|
|
| `/v1/wakeup` | `POST` | `?tags=weight,kv_cache` | Reload previously offloaded weights and/or KV cache. On success, the engine resumes automatically. |
|
|
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "verify_checksum": false}` | Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially `load_strategy=rsync`. |
|
|
|
|
### Compatibility Notes
|
|
|
|
The optimized communication path also applies to the legacy APIs. By setting `FD_ENABLE_V1_UPDATE_WEIGHTS=1`, the legacy APIs can be switched to the new control path while keeping the original API form.
|
|
|
|
- `FD_ENABLE_V1_UPDATE_WEIGHTS=0`: use the legacy shared-memory-based control path.
|
|
- `FD_ENABLE_V1_UPDATE_WEIGHTS=1`: `/clear_load_weight` is effectively handled through `/v1/sleep`, and `/update_model_weight` is effectively handled through `/v1/wakeup`. The corresponding pause/resume actions are handled internally by `sleep` and `wakeup`.
|
|
|
|
**Note**: regardless of whether V1 is enabled, the legacy APIs are not the recommended standard interface for RLHF scenarios and may be gradually deprecated in future releases. The `/v1/*` control APIs are recommended.
|
|
|
|
## Interface Semantics
|
|
|
|
### `/v1/pause`
|
|
|
|
`/v1/pause` is the safe boundary before changing model state.
|
|
|
|
It does the following:
|
|
|
|
- stops new request generation;
|
|
- aborts running and inflight requests;
|
|
- resets scheduler state;
|
|
- pauses cache transfer when multi-level cache or KV cache storage is enabled.
|
|
|
|
When a clear boundary is required between one rollout round and the next training stage, this API should be called first.
|
|
|
|
### `/v1/sleep`
|
|
|
|
`/v1/sleep` offloads selected runtime state from GPU memory.
|
|
|
|
Supported tags:
|
|
|
|
- `weight`: clear model weights from device memory; if enabled, communication groups and DeepEP buffers may also be released.
|
|
- `kv_cache`: clear KV cache; MTP cache is also cleared when speculative decoding uses MTP.
|
|
|
|
If the `tags` parameter is omitted, FastDeploy defaults to:
|
|
|
|
```bash
|
|
/v1/sleep?tags=weight,kv_cache
|
|
```
|
|
|
|
In the current implementation, `sleep` automatically performs a `pause` first. New integrations should not rely on this implicit behavior.
|
|
|
|
### `/v1/wakeup`
|
|
|
|
`/v1/wakeup` restores the state offloaded by `/v1/sleep`.
|
|
|
|
Depending on tags and configuration, FastDeploy may:
|
|
|
|
- restart communication groups;
|
|
- recreate DeepEP buffers;
|
|
- reload model weights from the configured source;
|
|
- rebuild KV cache;
|
|
- recapture CUDA Graph.
|
|
|
|
After `wakeup` succeeds, FastDeploy automatically calls `resume`.
|
|
|
|
### `/v1/update_weights`
|
|
|
|
`/v1/update_weights` refreshes model parameters directly, without unloading the GPU memory occupied by model weights.
|
|
|
|
Current request fields:
|
|
|
|
- `version`: optional string. Used to choose a target checkpoint version.
|
|
- `verify_checksum`: optional boolean. Defaults to `false`. Set to `true` to verify data integrity during weight synchronization.
|
|
|
|
Important semantics:
|
|
|
|
- the engine must already be paused, otherwise the request fails;
|
|
- the update is executed on workers only;
|
|
- this API is meant for explicit weight refresh, especially the `rsync` path;
|
|
- it does not implicitly call `resume`.
|
|
|
|
Recommended sequence:
|
|
|
|
1. `POST /v1/pause`
|
|
2. `POST /v1/update_weights`
|
|
3. `POST /v1/resume`
|
|
|
|
If GPU memory also needs to be reclaimed between rollout rounds, the `sleep` / `wakeup` workflow is more appropriate.
|
|
|
|
## Example Requests
|
|
|
|
### Basic APIs
|
|
|
|
Pause the engine:
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8000/v1/pause
|
|
```
|
|
|
|
Resume the engine:
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8000/v1/resume
|
|
```
|
|
|
|
### Sleep / Wakeup APIs
|
|
|
|
**Offload weights and KV cache**
|
|
|
|
```bash
|
|
# Offload both weights and KV cache
|
|
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight,kv_cache"
|
|
|
|
# Offload only weights
|
|
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight"
|
|
|
|
# Omit parameter, defaults to both
|
|
curl -X POST "http://127.0.0.1:8000/v1/sleep"
|
|
```
|
|
|
|
**Restore weights and KV cache**
|
|
|
|
```bash
|
|
# Restore both weights and KV cache
|
|
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight,kv_cache"
|
|
|
|
# Restore only weights
|
|
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight"
|
|
|
|
# Omit parameter, defaults to both
|
|
curl -X POST "http://127.0.0.1:8000/v1/wakeup"
|
|
```
|
|
|
|
**Note**: When `use_cudagraph=True`, KV cache must be restored before weights. This means `/v1/wakeup` with the `kv_cache` tag must be called before calling `/v1/wakeup` with the `weight` tag. If weights are restored without KV cache, an error will be raised. It is recommended to keep the `tags` parameter consistent between `/v1/sleep` and `/v1/wakeup`.
|
|
|
|
### Update Weights API
|
|
|
|
Refresh to a new remotely published version:
|
|
|
|
```bash
|
|
curl -X POST http://127.0.0.1:8000/v1/update_weights \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"version": "global_step_1200",
|
|
"verify_checksum": false
|
|
}'
|
|
```
|
|
|
|
## RLHF Usage
|
|
|
|
### Recommended Rollout Service Setup
|
|
|
|
In RLHF scenarios, FastDeploy rollout services are typically configured as follows:
|
|
|
|
- `dynamic_load_weight=True`
|
|
- `load_strategy=ipc_snapshot` for local snapshot-based refresh;
|
|
- or `load_strategy=rsync` for versioned remote refresh.
|
|
|
|
The rollout utilities in the repository already follow this pattern. A typical example is:
|
|
|
|
```python
|
|
from fastdeploy.rl.rollout_config import RolloutModelConfig
|
|
from fastdeploy.rl.rollout_model import RolloutModel
|
|
|
|
rollout_config = RolloutModelConfig(
|
|
model_name_or_path=model_path,
|
|
tensor_parallel_size=ranks,
|
|
dynamic_load_weight=True,
|
|
load_strategy="ipc_snapshot",
|
|
)
|
|
rollout_model = RolloutModel(rollout_config)
|
|
```
|
|
|
|
### Training-Side Integration Support
|
|
|
|
In addition to serving endpoints, FastDeploy provides the following training-side integration capabilities for RLHF:
|
|
|
|
- `RolloutModel.state_dict()`: exposes the rollout-side inference parameters.
|
|
- `RolloutModel.get_name_mappings_to_training()`: exposes the mapping from inference parameter names to training parameter names.
|
|
|
|
These interfaces can be used to align training checkpoints with rollout-side parameter layouts, especially when inference-side and training-side parameter names are not fully identical.
|
|
|
|
### Common RLHF workflows
|
|
|
|
The following examples assume the service endpoint is `http://127.0.0.1:8000`.
|
|
|
|
**Workflow 1: clear and restore**
|
|
|
|
This workflow is suitable when the rollout service stays resident, but GPU memory should be released before training and restored afterward. The recommended sequence is `(pause) -> sleep -> wakeup -> (resume)`, where the steps in parentheses are optional.
|
|
|
|
```bash
|
|
# Optional: explicitly pause the engine to establish a clear transition boundary
|
|
curl -X POST http://127.0.0.1:8000/v1/pause
|
|
|
|
# Offload both weights and KV cache
|
|
curl -X POST "http://127.0.0.1:8000/v1/sleep?tags=weight,kv_cache"
|
|
|
|
# Restore both weights and KV cache after training completes
|
|
curl -X POST "http://127.0.0.1:8000/v1/wakeup?tags=weight,kv_cache"
|
|
|
|
# Optional: explicitly resume if required by the integration
|
|
curl -X POST http://127.0.0.1:8000/v1/resume
|
|
```
|
|
|
|
**Workflow 2: in-place refresh to a new checkpoint**
|
|
|
|
This workflow is suitable when the service remains resident and only needs to switch to a new checkpoint version. The recommended sequence is `pause -> update_weights -> resume`.
|
|
|
|
```bash
|
|
# Pause the engine first
|
|
curl -X POST http://127.0.0.1:8000/v1/pause
|
|
|
|
# Refresh to a new checkpoint version in place
|
|
curl -X POST http://127.0.0.1:8000/v1/update_weights \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"version": "global_step_1200",
|
|
"verify_checksum": false
|
|
}'
|
|
|
|
# Resume the service after the update completes
|
|
curl -X POST http://127.0.0.1:8000/v1/resume
|
|
```
|
|
|
|
**Workflow 3: legacy compatibility APIs**
|
|
|
|
Legacy RL clients can continue to use the compatibility flow `clear_load_weight -> update_model_weight`.
|
|
|
|
```bash
|
|
# Clear or offload the current weights
|
|
curl -X GET http://127.0.0.1:8000/clear_load_weight
|
|
|
|
# Reload weights after the trainer updates the checkpoint
|
|
curl -X GET http://127.0.0.1:8000/update_model_weight
|
|
```
|
|
|
|
For new integrations, the `/v1/*` APIs are recommended because their control path is more explicit and easier to trace.
|
|
|
|
## Other Related Configuration
|
|
|
|
### Communication Group Clear and Rebuild
|
|
|
|
FastDeploy provides `--shutdown-comm-group-if-worker-idle` and `--no-shutdown-comm-group-if-worker-idle` to explicitly control whether communication groups should also be torn down when weights are offloaded.
|
|
|
|
Keeping communication groups alive generally improves the stability of weight clearing and reloading. The tradeoff is that more GPU memory remains allocated after weight offload, and the execution time of `sleep` / `wakeup` may also increase.
|
|
|
|
By default:
|
|
|
|
- in EP scenarios, communication groups are kept;
|
|
- in non-EP scenarios, communication groups are torn down.
|
|
|
|
### CPU Cache Clear and Rebuild
|
|
|
|
After `--swap-space` is enabled, the following environment variable can be used to control whether CPU-side cache should also be cleared when `/v1/sleep` is executed, in order to reduce memory pressure during training.
|
|
|
|
By default, FastDeploy does not actively clear CPU cache. To clear it together with `sleep`, set:
|
|
|
|
```bash
|
|
export FD_ENABLE_SWAP_SPACE_CLEARING=1
|
|
```
|