FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
Yonghua Li	a7f52c300d	[Feature] support v1 update/clear api for RL (#6761 ) * [Feature] support v1 update/clear api for RL * [fix] fix execute_model and add sleep/wakeup api * [fix] fix mtp and key_prefix * [chore] move _update_key_prefix to resume method * [fix] make the interface safe to call multiple times * [fix] fix some tiny bugs * [chore] make small changes against pr review * [docs] add docs for weight update * [test] add some tests and update docs * [style] fix code style check * [test] fix ci * [fix] fix stale control responses when control method timed out * [chore] remove unused code * [chore] fix code style * [chore] optimize tags and key_prefix * [test] fix ci * [chore] fix code style * [test] fix ci * [fix] fix ep control * [fix] fix ep control for engine cache queue	2026-03-25 19:18:46 +08:00
wikilsh	5e469fc901	[RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy (#6852 ) * [RL] Support chunked part files loading in IPC snapshot strategy ## Motivation When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike. ## Modifications Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority: 1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike. 2. Single full file (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility). 3. Shared fallback directory (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility). Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`. ## Checklist - [ ] Add at least a tag in the PR title. - [ ] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. Please write the reason in this PR if no unit tests. - [ ] Provide accuracy results. - [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL] Support chunked part files loading in IPC snapshot strategy ## Motivation When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike. ## Modifications Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority: 1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike. 2. Single full file (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility). 3. Shared fallback directory (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility). Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`. ## Checklist - [ ] Add at least a tag in the PR title. - [ ] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. Please write the reason in this PR if no unit tests. - [ ] Provide accuracy results. - [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][BugFix] Fix ambiguous model path format and add legacy fallback in IPC snapshot ## Motivation The previous snapshot file naming `model_state.tp{rank}{id}` concatenated rank and id without a separator, causing ambiguity (e.g., rank=1, id=234 and rank=12, id=34 both produce `tp1234`). Additionally, after the naming format is updated, existing checkpoints saved in the old format would fail to load during elastic recovery, causing unnecessary failures. ## Modifications - Add dot separator between rank and id in snapshot file name: `model_state.tp{rank}{id}` → `model_state.tp{rank}.{id}` - Add Priority 3 legacy fallback to load old-format files (`model_state.tp0{id}.pdparams`) for backward compatibility during rolling upgrades - Update docstring and error message to reflect the new 4-level priority Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][Test] Add unit tests for DynamicWeightManager._update_ipc_snapshot Cover all 4 loading priority branches (chunked part files, single full pdparams, legacy format, shared directory fallback) with mock-based tests to verify correct behavior without filesystem or GPU dependencies. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][Test] Remove unused import 'call' in test_update_ipc_snapshot.py Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * [RL] Fix snapshot part index to match filename numbering Parse part index from filename (e.g. .part0.) instead of using enumerate index, so that logs and src_type stay consistent with the actual file naming convention. Co-Authored-By: wikilsh <wiki_hui@qq.com> --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-23 16:17:41 +08:00
bukejiyu	8b8f0c5659	fix update param (#6723 )	2026-03-10 11:35:00 +08:00
bukejiyu	598cce8545	[RL] Support SM100 FP8 quantization in RL (#6601 ) * RL SM100 Fix * update	2026-03-04 04:55:04 -08:00
wangyifei	b7c5daa316	[RL] add pause, update_weights, resume interface for async RL (#6052 ) * support dynamic run_control_request through zmq from apiserver to common_engine * support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method * change /is_puased from HTTP POST method to GET method * add pause、resume、is_paused implementation * support engine <==> worker communication(request&response) * support sync weights through RDMA from checkpoint_transfer * support specified version, rsync_config in update_weights rpc call * add pause, update_weights, resume interface for async RL * bug fix: update_weights support using default arguments * fix typo * typo fix * typo fix * typo fix * add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all * add "rsync" to LoadConfig.load_strategy Literal type hints Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * typo fix * typo fix * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * check version/rsync params * add error log when version.txt not exists Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * raise specified ValueError when paramters check failed Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * tp barrier after run_control_method * encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue * typo fix * typo fix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-23 10:18:07 +08:00
Yonghua Li	456637002d	[BugFix] fix cache transfer manager updating/clearing (#5930 ) * [fix] fix cache transfer manager updating/clearing * [fix] fix code style * [fix] fix config * [fix] fix engine client * [fix] let worker update kv cache status signal * [fix] update worker process * [fix] fix clear/update for case if comm group is shutdown * [fix] update dynamic weight manager * [fix] fix port * [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting	2026-01-13 05:09:29 -08:00
Yonghua Li	0c01cccc32	[BugFix] fix double shutdown of comm group when rank0 clears weights slower than other ranks (#5715 )	2025-12-25 21:48:53 +08:00
Yonghua Li	4f830aa505	[RL] provide options for whether shutdown comm group after weights cleared (#5663 ) Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FD Image Build (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details CE Compile Job / ce_job_pre_check (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details * [rl] provide options for whether shutdown comm group after weights cleared * [fix] fix args hardcode * [fix] change args type * [fix] add worker process args	2025-12-19 07:06:48 -08:00
gaoziyuan	5db08cc1d5	【NewFeature】support load fp8 weight (#5565 )	2025-12-16 11:23:57 +08:00
Yonghua Li	2ec76352da	[BugFix] fix instability after clearing weight (#5493 ) * [BugFix] fix instability after clearing weight * [chore] add todo	2025-12-11 10:22:35 +08:00
Yonghua Li	419b416376	[BugFix] [RL] remove shutdown_process_group/restart_process_group for RL (#5433 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FD Image Build (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details * [fix] remove shutdown_process_group/restart_process_group for RL * [chore] remove log * [chore] remove log * [chore] set log to debug level	2025-12-09 20:32:37 +08:00
Yonghua Li	b6f8069b36	[fix] update check_model_weights_status loop (#5249 )	2025-12-04 19:43:01 +08:00
freeliuzc	c63361fd1d	[Speculative Decoding][MTP]Support mtp in epdptp mode (#4614 ) * support mtp many features * support mtp reshard in rl mode * fix function * support mtp ep * support mtp in hybird-dp-tp mode * default open scheduler_v1 in mtp	2025-10-28 16:02:47 +08:00
ltd0924	fbdb056de0	[BUGFIX] clear request #4286 (#4402 ) Co-authored-by: ltd0924 <luotingdan@baidu.com>	2025-10-15 17:43:28 +08:00
chen	81959c7d88	[NewFeature]custom_allreduce support cudagraph recapture (#4305 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Run Accuracy Tests (push) Has been cancelled Details CI Images Build / Run Stable Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details * custom_allreduce support cudagraph recapture * add shut_down/restart default group	2025-09-29 15:56:54 +08:00
李泳桦	6265f4385f	[feat] support prefix cache clearing when `/clear_load_weight` is called (#4008 ) * [feat] support clearing prefix cache (cherry-picked from release/2.1) * [fix] fix ipc suffix, use port instead * [fix] fix prefix caching not enabled * [fix] fix key/value_cache_scales indent * [fix] fix ep group all-reduce * [fix] fix clear/update lock not working when workers > 1 * [chore] add preemption triggered info log * [fix] fix code style * [fix] fix max_num_seqs config * [fix] do not force enable_prefix_caching=False in dynamic loading * [fix] fix ci * Revert "[fix] fix ci" This reverts commit `0bc6d55cc8`. * [fix] initialize available_gpu_block_num with max_gpu_block_num * [fix] fix config splitwise_role * [fix] fix clearing caches synchronization and add more logs * [chore] print cache_ready_signal in log * [fix] fix scheduler_config.splitwise_role * [fix] fix cache_messager cache_ready_signal create=True * [fix] stop cache messager from launching in mixed deployment	2025-09-28 19:42:53 +08:00
ltd0924	83720da79f	[Feature] support clear data (#3601 ) * [Feature] support clear data * update * fix * fix * fix * fix * fix * fix * fix	2025-09-23 10:20:02 +08:00
gaoziyuan	896e3bb606	[NewFeture]add ep rollout model init and update/clear ep buffer (#4039 ) * fix gid * merge * fix test * fix bug * fix * fix ci	2025-09-17 20:24:53 +08:00
co63oc	8466219ec8	fix typos (#3840 ) * fix typos * ci --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2025-09-12 11:04:38 +08:00
RAM	d3e4ae3d49	[Executor] Adjust signal sending order in RL training (#3773 ) * Adjust processing order * fix bug * fix update_parameters bug * refine code	2025-09-10 13:24:20 +08:00
ltd0924	905d89e42f	[Feature] support model weight update in ep (#3765 ) * support model weight update in ep * support model weight update in ep * support model weight update in ep * support model weight update in ep * Update fused_moe_backend_base.py * Update worker_process.py * Update worker_process.py * Update dynamic_weight_manager.py	2025-09-02 17:16:03 +08:00
gaoziyuan	6fdd83da10	fix some bug (#3434 )	2025-08-18 14:39:13 +08:00
YuanRisheng	502ee92a0a	Unify server-side and model-side Config (Part3) (#3047 ) * merge model config * fix arch * fix rl	2025-07-29 17:07:44 +08:00
Zero Rains	25698d56d1	polish code with new pre-commit rule (#2923 )	2025-07-19 23:19:27 +08:00
Yuanle Liu	dda4a9f848	rl update (#2861 )	2025-07-16 00:33:10 -07:00
Jiang-Jia-Jun	9fd74f75bd	Update dynamic_weight_manager.py	2025-07-03 15:55:22 +08:00
Jiang-Jia-Jun	05c670e593	[Sync] Update to latest code (#2679 ) * [Sync] Update to latest code * Add new code files * Add new code files * update code * Try to fix build.sh * Try to fix build.sh * Update code * Update requirements.txt * Update code --------- Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>	2025-07-03 15:43:53 +08:00

27 Commits