Cherry-pick from release/2.5: the original assertion only checked
`prefill_kvcache_block_num >= max_block_num_per_seq`, but for
encoder-decoder models the kvcache must also reserve blocks for the
encoder side (`enc_dec_block_num`). Without this check, the service
could silently allocate insufficient blocks for enc-dec sequences.
- `CacheConfig.postprocess`: tighten assertion to
`prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num`,
error message guides user to reduce `max_model_len` or increase
`num_gpu_blocks_override`
- `CacheConfig.reset`: same tightening, error message guides user to
reduce `max_model_len` or replace with larger GPU cards (override
is not applicable here)
No change to launch command. If the assertion fires, adjust:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--max-model-len <smaller_value> ...
python -m fastdeploy.entrypoints.openai.api_server \
--num-gpu-blocks-override <larger_value> ...
```
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* [RL][BugFix] Fix incorrect rank in IPC snapshot model path
## Motivation
During elastic recovery, each rank should load its own model shard.
The hardcoded `tp0` caused all ranks to load rank-0's shard, leading
to incorrect weight initialization in multi-process scenarios.
## Modifications
- Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both
the primary model path and the fallback `/shared_ipc_meta/` path
inside `_update_ipc_snapshot`, so each rank correctly loads its own
shard during elastic recovery.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* [RL][BugFix] Fix incorrect rank in IPC snapshot model path
## Motivation
During elastic recovery, each rank should load its own model shard.
The hardcoded `tp0` caused all ranks to load rank-0's shard, leading
to incorrect weight initialization in multi-process scenarios.
## Modifications
- Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both
the primary model path and the fallback `/shared_ipc_meta/` path
inside `_update_ipc_snapshot`, so each rank correctly loads its own
shard during elastic recovery.
Co-Authored-By: lishuaihui <lishuaihui@baidu.com>
* [RL] Support chunked part-file loading in IPC snapshot to reduce memory spike
Refactor _update_ipc_snapshot with 4-level loading priority:
1. Chunked part files (with gc.collect per part to reduce peak memory)
2. Single full pdparams file (new naming: tp{rank}.{id})
3. Legacy format (tp0{id})
4. Shared fallback directory (/shared_ipc_meta/)
Add unit tests covering all priority branches and error path.
Co-Authored-By: lishuaihui <lishuaihui@baidu.com>
* [RL] Fix snapshot part index and add validation for part file naming
- Parse part index from filename instead of using enumerate index,
keeping logs and src_type consistent with actual file naming.
- Add validation for part file naming pattern; skip and warn on
files that do not match the expected .partN. convention.
Co-Authored-By: wikilsh <wiki_hui@qq.com>
---------
Co-authored-by: zhangjie83 <zhangjie83@baidu.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Update PaddlePaddle installation command in script
* Update PaddlePaddle installation in run_xpu_ci_pytest.sh
Replace PaddlePaddle installation method with a direct wheel file.
When `is_dummy_run=True`, calling `empty_input_forward` can cause
unexpected behavior. Add `and not is_dummy_run` guard for both
`_propose_cuda` and `_propose_xpu` paths.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix mtp acceptance rate decline
* [BugFix] Fix AttributeError in recycle_gpu_blocks when prefix_tree_status_signal not initialized
- Add hasattr check before accessing prefix_tree_status_signal
- The signal is only initialized in launch_cache_messager, not in __init__
- Fixes CI test failure in test_prefix_cache_manager.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [BugFix] Reset prefix cache when model weights are updating
- Call self.reset() before setting status to NORMAL in UPDATING state
- Ensure cache consistency when model weights change
- Consistent with CLEARING state handling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix mtp acceptance rate decline
* [BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation
Fix the calculation of can_schedule_block_num_threshold in
ResourceManagerV1. The original formula using need_prefill_tokens
could lead to incorrect threshold values. Now directly use
num_chunk_new_block for accurate block scheduling.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>