FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
wikilsh	5e469fc901	[RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy (#6852 ) * [RL] Support chunked part files loading in IPC snapshot strategy ## Motivation When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike. ## Modifications Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority: 1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike. 2. Single full file (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility). 3. Shared fallback directory (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility). Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`. ## Checklist - [ ] Add at least a tag in the PR title. - [ ] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. Please write the reason in this PR if no unit tests. - [ ] Provide accuracy results. - [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL] Support chunked part files loading in IPC snapshot strategy ## Motivation When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike. ## Modifications Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority: 1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike. 2. Single full file (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility). 3. Shared fallback directory (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility). Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`. ## Checklist - [ ] Add at least a tag in the PR title. - [ ] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. Please write the reason in this PR if no unit tests. - [ ] Provide accuracy results. - [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][BugFix] Fix ambiguous model path format and add legacy fallback in IPC snapshot ## Motivation The previous snapshot file naming `model_state.tp{rank}{id}` concatenated rank and id without a separator, causing ambiguity (e.g., rank=1, id=234 and rank=12, id=34 both produce `tp1234`). Additionally, after the naming format is updated, existing checkpoints saved in the old format would fail to load during elastic recovery, causing unnecessary failures. ## Modifications - Add dot separator between rank and id in snapshot file name: `model_state.tp{rank}{id}` → `model_state.tp{rank}.{id}` - Add Priority 3 legacy fallback to load old-format files (`model_state.tp0{id}.pdparams`) for backward compatibility during rolling upgrades - Update docstring and error message to reflect the new 4-level priority Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][Test] Add unit tests for DynamicWeightManager._update_ipc_snapshot Cover all 4 loading priority branches (chunked part files, single full pdparams, legacy format, shared directory fallback) with mock-based tests to verify correct behavior without filesystem or GPU dependencies. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][Test] Remove unused import 'call' in test_update_ipc_snapshot.py Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * [RL] Fix snapshot part index to match filename numbering Parse part index from filename (e.g. .part0.) instead of using enumerate index, so that logs and src_type stay consistent with the actual file naming convention. Co-Authored-By: wikilsh <wiki_hui@qq.com> --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-23 16:17:41 +08:00
GoldPancake	183b8d325a	[RL] Support GLM MTP RL Model (#6267 )	2026-02-04 20:14:35 +08:00
GoldPancake	fb374238e1	Revert "[RL] Support GLM MTP RL Model (#6223 )" (#6301 ) This reverts commit `af6c84d48d`.	2026-02-02 14:08:13 +08:00
GoldPancake	af6c84d48d	[RL] Support GLM MTP RL Model (#6223 ) * support glm mtp rl model * fix * fix * fix ut * update baseline	2026-01-28 08:28:03 -08:00
kesmeey	d81341b9b3	[CI]【Hackathon 9th Sprint No.14】功能模块 fastdeploy/rl/rollout_model.py 单测补充 (#5552 ) * Add rollout model unit tests * test: update rl rollout_model tests * test: fix cache_type_branches unsupported platform case * test: fix rl rollout_model test indent * Delete tests/spec_decode/test_mtp_proposer.py * chore: format test_rollout_model * chore: translate rollout test comments to English * test: guard rollout_model import by disabling auto registry * chore: reorder imports in rl rollout test * test: isolate env for RL rollout tests * style: format rollout RL tests with black * update * test: remove RL rollout unit tests causing collection issues * test: add lightweight rollout_model RL unit tests * fix(coverage): filter test file paths and handle collection failures - Only extract real test file paths (tests/.../test_.py) from pytest collect output - Filter out ERROR/collecting prefixes to prevent garbage in failed_tests.log - Add proper error handling for pytest collection failures - Exit early if no test files can be extracted - Preserve collection error output for debugging update * style: fix code style issues in test_rollout_model.py - Remove unused 'os' import - Remove trailing blank lines --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2025-12-18 10:57:53 +08:00

5 Commits