[RL] add pause, update_weights, resume interface for async RL (#6052)

* support dynamic run_control_request through zmq from apiserver to common_engine

* support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method

* change /is_puased from HTTP POST method to GET method

* add pause、resume、is_paused implementation

* support engine <==> worker communication(request&response)

* support sync weights through RDMA from checkpoint_transfer

* support specified version, rsync_config in update_weights rpc call

* add pause, update_weights, resume interface for async RL

* bug fix: update_weights support using default arguments

* fix typo

* typo fix

* typo fix

* typo fix

* add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all

* add "rsync" to LoadConfig.load_strategy Literal type hints

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* typo fix

* typo fix

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* check version/rsync params

* add error log when version.txt not exists

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* raise specified ValueError when paramters check failed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* tp barrier after run_control_method

* encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue

* typo fix

* typo fix

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This commit is contained in:
wangyifei
2026-01-23 10:18:07 +08:00
committed by GitHub
parent 96b2cf2c20
commit b7c5daa316
18 changed files with 1170 additions and 16 deletions
+10
View File
@@ -195,6 +195,10 @@ class EngineArgs:
"""
dynamic load weight strategy
"""
rsync_config: Optional[Dict[str, Any]] = None
"""
rsync weights config info
"""
quantization: Optional[Dict[str, Any]] = None
guided_decoding_backend: str = "off"
"""
@@ -812,6 +816,12 @@ class EngineArgs:
default=EngineArgs.load_strategy,
help="Flag to dynamic load strategy.",
)
model_group.add_argument(
"--rsync-config",
type=json.loads,
default=EngineArgs.rsync_config,
help="Rsync weights config",
)
model_group.add_argument(
"--engine-worker-queue-port",
type=lambda s: s.split(",") if s else None,