Commit Graph

5 Commits

Author SHA1 Message Date
wikilsh 5e469fc901 [RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy (#6852)
* [RL] Support chunked part files loading in IPC snapshot strategy

## Motivation

When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike.

## Modifications

Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority:

1. **Chunked part files** (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike.
2. **Single full file** (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility).
3. **Shared fallback directory** (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility).

Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`.

## Checklist

- [ ] Add at least a tag in the PR title.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL] Support chunked part files loading in IPC snapshot strategy

## Motivation

When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike.

## Modifications

Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority:

1. **Chunked part files** (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike.
2. **Single full file** (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility).
3. **Shared fallback directory** (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility).

Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`.

## Checklist

- [ ] Add at least a tag in the PR title.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][BugFix] Fix ambiguous model path format and add legacy fallback in IPC snapshot

## Motivation
The previous snapshot file naming `model_state.tp{rank}{id}` concatenated
rank and id without a separator, causing ambiguity (e.g., rank=1, id=234
and rank=12, id=34 both produce `tp1234`). Additionally, after the naming
format is updated, existing checkpoints saved in the old format would fail
to load during elastic recovery, causing unnecessary failures.

## Modifications
- Add dot separator between rank and id in snapshot file name:
  `model_state.tp{rank}{id}` → `model_state.tp{rank}.{id}`
- Add Priority 3 legacy fallback to load old-format files
  (`model_state.tp0{id}.pdparams`) for backward compatibility during
  rolling upgrades
- Update docstring and error message to reflect the new 4-level priority

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][Test] Add unit tests for DynamicWeightManager._update_ipc_snapshot

Cover all 4 loading priority branches (chunked part files, single full
pdparams, legacy format, shared directory fallback) with mock-based
tests to verify correct behavior without filesystem or GPU dependencies.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][Test] Remove unused import 'call' in test_update_ipc_snapshot.py

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* [RL] Fix snapshot part index to match filename numbering

Parse part index from filename (e.g. .part0.) instead of using
enumerate index, so that logs and src_type stay consistent with
the actual file naming convention.

Co-Authored-By: wikilsh <wiki_hui@qq.com>

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-23 16:17:41 +08:00
GoldPancake 183b8d325a [RL] Support GLM MTP RL Model (#6267) 2026-02-04 20:14:35 +08:00
GoldPancake fb374238e1 Revert "[RL] Support GLM MTP RL Model (#6223)" (#6301)
This reverts commit af6c84d48d.
2026-02-02 14:08:13 +08:00
GoldPancake af6c84d48d [RL] Support GLM MTP RL Model (#6223)
* support glm mtp rl model

* fix

* fix

* fix ut

* update baseline
2026-01-28 08:28:03 -08:00
kesmeey d81341b9b3 [CI]【Hackathon 9th Sprint No.14】功能模块 fastdeploy/rl/rollout_model.py 单测补充 (#5552)
* Add rollout model unit tests

* test: update rl rollout_model tests

* test: fix cache_type_branches unsupported platform case

* test: fix rl rollout_model test indent

* Delete tests/spec_decode/test_mtp_proposer.py

* chore: format test_rollout_model

* chore: translate rollout test comments to English

* test: guard rollout_model import by disabling auto registry

* chore: reorder imports in rl rollout test

* test: isolate env for RL rollout tests

* style: format rollout RL tests with black

* update

* test: remove RL rollout unit tests causing collection issues

* test: add lightweight rollout_model RL unit tests

* fix(coverage): filter test file paths and handle collection failures

- Only extract real test file paths (tests/.../test_*.py) from pytest collect output

- Filter out ERROR/collecting prefixes to prevent garbage in failed_tests.log

- Add proper error handling for pytest collection failures

- Exit early if no test files can be extracted

- Preserve collection error output for debugging

* update

* style: fix code style issues in test_rollout_model.py

- Remove unused 'os' import

- Remove trailing blank lines

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-12-18 10:57:53 +08:00