[RL] Adapt async rollout checkpoint update flow (#7042)

* update checkpoint-transfer flow and control update_weights params

* test: add update_weights route validation
This commit is contained in:
jackyYang6
2026-03-30 19:19:34 +08:00
committed by GitHub
parent 8789329457
commit 05f2d95729
9 changed files with 58 additions and 88 deletions
+4 -8
View File
@@ -50,7 +50,7 @@ In FastDeploy >= 2.6, the underlying control-signal communication path is optimi
| `/v1/is_paused` | `GET` | none | Return `{"is_paused": bool}`. |
| `/v1/sleep` | `POST` | `?tags=weight,kv_cache` | Offload selected GPU memory objects. Supported tags are `weight` and `kv_cache`. If omitted, both are used. |
| `/v1/wakeup` | `POST` | `?tags=weight,kv_cache` | Reload previously offloaded weights and/or KV cache. On success, the engine resumes automatically. |
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "rsync_config": {...}}` | Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially `load_strategy=rsync`. |
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "verify_checksum": false}` | Refresh weights in place through the worker control path. This API is intended for remote versioned updates, especially `load_strategy=rsync`. |
### Compatibility Notes
@@ -114,7 +114,7 @@ After `wakeup` succeeds, FastDeploy automatically calls `resume`.
Current request fields:
- `version`: optional string. Used to choose a target checkpoint version.
- `rsync_config`: optional dictionary. Must contain `etcd_server` when provided.
- `verify_checksum`: optional boolean. Defaults to `false`. Set to `true` to verify data integrity during weight synchronization.
Important semantics:
@@ -186,9 +186,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
-H "Content-Type: application/json" \
-d '{
"version": "global_step_1200",
"rsync_config": {
"etcd_server": "127.0.0.1:2379"
}
"verify_checksum": false
}'
```
@@ -261,9 +259,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
-H "Content-Type: application/json" \
-d '{
"version": "global_step_1200",
"rsync_config": {
"etcd_server": "127.0.0.1:2379"
}
"verify_checksum": false
}'
# Resume the service after the update completes
+4 -8
View File
@@ -50,7 +50,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
| `/v1/is_paused` | `GET` | 无 | 返回 `{"is_paused": bool}`。 |
| `/v1/sleep` | `POST` | `?tags=weight,kv_cache` | 卸载指定 GPU 内存对象。支持 `weight``kv_cache`;不传时默认同时处理两者。 |
| `/v1/wakeup` | `POST` | `?tags=weight,kv_cache` | 重新加载之前被卸载的权重和/或 KV Cache。成功后会自动 `resume`。 |
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "rsync_config": {...}}` | 通过 worker 控制链路原地刷新模型权重。该接口主要面向 `load_strategy=rsync` 的远端版本更新。 |
| `/v1/update_weights` | `POST` | JSON `{"version":"...", "verify_checksum": false}` | 通过 worker 控制链路原地刷新模型权重。该接口主要面向 `load_strategy=rsync` 的远端版本更新。 |
### 兼容性说明
@@ -113,7 +113,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
当前支持的请求字段:
- `version`:可选字符串,用于指定目标 checkpoint 版本。
- `rsync_config`:可选字典;如果传入,必须包含 `etcd_server`
- `verify_checksum`:可选布尔值;默认为 `false`。设置为 `true` 时,会在权重同步过程中校验数据完整性
关键语义:
@@ -185,9 +185,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
-H "Content-Type: application/json" \
-d '{
"version": "global_step_1200",
"rsync_config": {
"etcd_server": "127.0.0.1:2379"
}
"verify_checksum": false
}'
```
@@ -260,9 +258,7 @@ curl -X POST http://127.0.0.1:8000/v1/update_weights \
-H "Content-Type: application/json" \
-d '{
"version": "global_step_1200",
"rsync_config": {
"etcd_server": "127.0.0.1:2379"
}
"verify_checksum": false
}'
# 更新完成后恢复服务