Commit Graph

862 Commits

Author SHA1 Message Date
fxyfxy777 9f3b3ce7f5 [Optimization] merge_allreduce (#7039) 2026-04-02 19:52:13 +08:00
Yuanle Liu 1af7f80811 Revert "[BugFix][Speculative Decoding] Correct index calculation in speculate…" (#7133)
This reverts commit ba1aa1edff.
2026-04-01 06:54:23 -07:00
luukunn fa7a84926d [Optimization]Fix tool parser (#7079)
* fix tool parser
2026-04-01 21:20:34 +08:00
lonelygsh ba1aa1edff [BugFix][Speculative Decoding] Correct index calculation in speculate decoding operators (#7121)
- Fix accept_idx calculation in spec_set_value_by_stop_seqs
- Fix condition check from < to <= for token matching
- Fix accept_tokens indexing logic
- Remove unnecessary -1 in current_step comparison for max_think_len

Co-authored-by: guanshihui] <guanshihui@baidu.com>
2026-04-01 05:36:53 -07:00
cmcamdy 7a2e33098f [XPU] Refactor pre process (#6993)
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
2026-04-01 20:29:55 +08:00
luukunn fdfc908e2f [Others] reuse unit test (#7127) 2026-04-01 18:36:00 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
YuBaoku c6f0c5c3a6 [CI] Optimize test execution with single-GPU parallelism (#7085)
* [CI] Optimize test execution with single-GPU parallelism and log collection

* remove export CUDA_VISIBLE_DEVICES

* fix path error

* fix log_* path and debug

* [CI] Optimize test execution with single-GPU parallelism and log collection
2026-04-01 14:18:40 +08:00
zhouchong 91c832f607 [Feature] Add logging parameters and error output to terminal (#7098) 2026-04-01 13:18:42 +08:00
luukunn 3651113ee5 [DataProcessor]Remove ENABLE_V1_DATA_PROCESSOR (#7052)
* remove ENABLE_V1_DATA_PROCESSOR

* fix unit test

* fix unit test
2026-04-01 09:53:41 +08:00
qwes5s5 ee2b965f5f adjust config info (#7054) 2026-03-31 21:26:05 +08:00
cloudforge1 5c5dc66aa7 [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充 (#6731)
* [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充

* [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充
---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-31 15:29:35 +08:00
qwes5s5 daa95244f7 abort requests (#6992) 2026-03-31 11:02:26 +08:00
Yonghua Li 6d9739f360 [BugFix] fix speculative gauge metrics in multi api server (#7082) 2026-03-31 10:52:50 +08:00
chenjian 6727df8286 [Optimization] Optimize ttft for prefill pd (#6680)
* optimize ttft

* fix

* fix

* fix ci

* fix ci

* fix

* fix bug

* fix

* add comments

* fix ci

* fix

* fix ci

* fix format

* update according to review

* add comment

* fix

* fix format
2026-03-30 20:36:23 +08:00
jackyYang6 05f2d95729 [RL] Adapt async rollout checkpoint update flow (#7042)
* update checkpoint-transfer flow and control update_weights params

* test: add update_weights route validation
2026-03-30 19:19:34 +08:00
yzwu 8789329457 [Iluvatar] Support wi4a16 group_gemm (#7078) 2026-03-30 19:03:51 +08:00
kevin 18062c55bb [BugFix][KVCache] Fix mm hash boundary comparison in get_block_hash_extra_keys (#6929)
* [BugFix][KVCache] Fix mm hash boundary comparison in get_block_hash_extra_keys

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [BugFix][KVCache] Fix test_get_block_hash_extra_keys_boundary_cases assertions

## Motivation

测试用例 `test_get_block_hash_extra_keys_boundary_cases` 中,Block [4,8) 的
调用错误地传入了 `mm_idx=1`,跳过了 img0[2,5);但 img0 覆盖 token 4,token 4
属于 block [4,8),应被包含在 hash_keys 中。此外,所有 assertEqual 只校验了
hash_keys,未校验返回的 mm_idx 游标。

## Modifications

- `test_get_block_hash_extra_keys_boundary_cases`:
  - 改为链式调用,用上一次返回的 mm_idx 作为下一次入参,模拟真实调用循环
  - Block [4,8) 入参从 `mm_idx=1` 改为沿用上次返回的 `mm_idx=0`,期望值从 `[]` 改为 `["hash-0"]`
  - 所有断言改为 `assertEqual((mm_idx, hash_keys), (...))` 同时校验游标
- `test_get_block_hash_extra_keys_no_overlap_at_boundaries`:
  - Case B 入参从 `mm_idx=1` 改为 `mm_idx=0`(从头遍历,img-a 走 continue)
  - 所有断言增加 mm_idx 校验
- `test_get_block_hash_extra_keys_image_crosses_block_boundary`:
  - 所有断言增加 mm_idx 校验
- `test_get_block_hash_extra_keys_no_mm_inputs`:
  - 断言增加 mm_idx 校验
- `test_get_block_hash_extra_keys_handles_multimodal_segments`:
  - call2、call3 断言增加 mm_idx 校验

## Usage or Command

```bash
python -m pytest tests/cache_manager/test_prefix_cache_manager.py::TestPrefixCacheManagerCoverage -v -k "get_block_hash_extra_keys"
```

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: chengyanfu <chengyanfu@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 17:13:31 +08:00
luukunn b9f8873367 [Optimization]Merge Text processor (#7030)
* merge text processor

* update

* fix unit test

* merge messages2ids

* fix unit test

* 删除重复代码

* remove redundant code

* delete code

* fix unit test
2026-03-30 15:02:35 +08:00
mpgemm 1a1d048774 [Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963) 2026-03-30 11:37:04 +08:00
Longzhi Wang 2eea6fa97a [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028)
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend

* add constexpr and code style clean

* add test

* fix code style

* fix test
2026-03-30 11:17:15 +08:00
mpgemm 7a20eaebe8 [Feature] Support cute cpp Encoder FA4 (#7016)
* add cute cpp fa4

* 删掉注释

* 修正合并错误

* sm_version放到函数内

* ci错误
2026-03-30 10:54:56 +08:00
YuBaoku 842c60809a [CI] Align with Paddle layer_norm kernel update (#7056) 2026-03-27 22:58:01 +08:00
cloudforge1 11ad95ba91 [CI]【Hackathon 10th Spring No.43】ernie4_5_mtp 单测补充 (#6738)
* [CI]【Hackathon 10th Spring No.43】ernie4_5_mtp 单测补充

* [CI]【Hackathon 10th Spring No.43】add mapping and forward branch coverage

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-27 17:15:53 +08:00
YuBaoku 10c59f78d6 [CI] disable tests/e2e/test_Qwen3VLMoe_serving.py in unit_test (#7044) 2026-03-27 14:15:14 +08:00
Jiaxin Sui c3ed7db28d [XPU] [CI] Fix xpu ci bug (#7014)
* fix xpu ci bug

* Remove unnecessary blank line in conftest.py

* Update upload-artifact action to version 6

* Update _xpu_8cards_case_test.yml

* fix ci bug

* Change exit code on test failure to 1

* fix ci bug

* fix ci bug

* fix ci bug

* fix ci bug

* Update conftest.py
2026-03-27 10:29:34 +08:00
Zhang Yulong a31d4bfbdf [CI] update mtp case (#7031) 2026-03-27 10:21:37 +08:00
huicongyao 25d64efdc4 [Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812)
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states

* readibility

* fix xpu bug

* fix coverage failure

* change luanch params & parallelize position_map compute

* Fix MTP-related bugs in FastDeploy centralized inference

* fix

* refactor mtp hidden_states process

* fix

* add unittest & optimize kernel

* remove useless code

* fix
2026-03-25 22:54:31 -07:00
YuBaoku 61ebac49ef [CI] Fix test_communication.py and add port cleanup (#7021) 2026-03-26 10:56:40 +08:00
luukunn e6804ba97d [Optimization]Streaming requests return complete special tokens. (#6998)
* return special token

* add completions

* update

* fix

* add prompt_token_ids&                        completion_token_ids=None,

* fix unite test
2026-03-26 09:49:43 +08:00
luukunn d5cb2767d7 [Optimization] Deduplicate shared image/video utilities across VL processors (#6988)
* step1~3

* fix import path

* 删除重复代码

* 删除重复代码

* 删除重复代码

* fix import path

* update

* fix import path

* add unit test

* fix

* update

* fix unit test
2026-03-26 09:49:33 +08:00
YuBaoku b8bb34c7dd [CI] disable tests/distributed/test_communication.py in unit_test (#7019) 2026-03-25 20:54:55 +08:00
Yonghua Li a7f52c300d [Feature] support v1 update/clear api for RL (#6761)
* [Feature] support v1 update/clear api for RL

* [fix] fix execute_model and add sleep/wakeup api

* [fix] fix mtp and key_prefix

* [chore] move _update_key_prefix to resume method

* [fix] make the interface safe to call multiple times

* [fix] fix some tiny bugs

* [chore] make small changes against pr review

* [docs] add docs for weight update

* [test] add some tests and update docs

* [style] fix code style check

* [test] fix ci

* [fix] fix stale control responses when control method timed out

* [chore] remove unused code

* [chore] fix code style

* [chore] optimize tags and key_prefix

* [test] fix ci

* [chore] fix code style

* [test] fix ci

* [fix] fix ep control

* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
gongweibao 48cfb608aa [FDConfig] Reduce FD_CUSTOM_AR_MAX_SIZE_MB default from 64 to 8 (#6997)
Most single-GPU and small-model deployments do not need 64MB custom
all-reduce buffers. Lowering the default to 8MB reduces unnecessary
shared memory allocation. Tests that require larger buffers now
explicitly set the value.

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-25 17:40:01 +08:00
freeliuzc 7a6c28781b [Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005)
* optimize attn_mask_offset and optimize mtp usage

* delete useless branch

* fix kernel format

* fix kernel runner
2026-03-25 01:52:06 -07:00
YuBaoku aee293be0f [CI] Optimize: add vl swap_test and remove useless code (#7000) 2026-03-25 11:33:56 +08:00
YuBaoku 4e8d503e3c Revert "add deepep precision test (#6984)" (#7004)
This reverts commit 522d12c25a.
2026-03-25 10:50:40 +08:00
周周周 522d12c25a add deepep precision test (#6984) 2026-03-24 19:51:33 +08:00
SUN Dong 6cff780fdb [RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850)
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization

* update

* update

* update

* support custom topk inDeepGemmFusedMoeMethod  apply_tp

* apply_ep_prefill support moe_topk_select

* update

* add ut

* add ut

* add ut

* modity doc

* fix env and docs

* add ut

---------

Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>
2026-03-24 11:12:39 +08:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
bukejiyu c62f6b4ea5 [Others] Fix PD reorder for MTP (#6792)
* fix pd reorder in mtp

* add ut

* update

* fix mtp
2026-03-23 21:10:22 +08:00
wikilsh 5e469fc901 [RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy (#6852)
* [RL] Support chunked part files loading in IPC snapshot strategy

## Motivation

When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike.

## Modifications

Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority:

1. **Chunked part files** (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike.
2. **Single full file** (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility).
3. **Shared fallback directory** (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility).

Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`.

## Checklist

- [ ] Add at least a tag in the PR title.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL] Support chunked part files loading in IPC snapshot strategy

## Motivation

When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike.

## Modifications

Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority:

1. **Chunked part files** (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike.
2. **Single full file** (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility).
3. **Shared fallback directory** (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility).

Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`.

## Checklist

- [ ] Add at least a tag in the PR title.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][BugFix] Fix ambiguous model path format and add legacy fallback in IPC snapshot

## Motivation
The previous snapshot file naming `model_state.tp{rank}{id}` concatenated
rank and id without a separator, causing ambiguity (e.g., rank=1, id=234
and rank=12, id=34 both produce `tp1234`). Additionally, after the naming
format is updated, existing checkpoints saved in the old format would fail
to load during elastic recovery, causing unnecessary failures.

## Modifications
- Add dot separator between rank and id in snapshot file name:
  `model_state.tp{rank}{id}` → `model_state.tp{rank}.{id}`
- Add Priority 3 legacy fallback to load old-format files
  (`model_state.tp0{id}.pdparams`) for backward compatibility during
  rolling upgrades
- Update docstring and error message to reflect the new 4-level priority

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][Test] Add unit tests for DynamicWeightManager._update_ipc_snapshot

Cover all 4 loading priority branches (chunked part files, single full
pdparams, legacy format, shared directory fallback) with mock-based
tests to verify correct behavior without filesystem or GPU dependencies.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][Test] Remove unused import 'call' in test_update_ipc_snapshot.py

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* [RL] Fix snapshot part index to match filename numbering

Parse part index from filename (e.g. .part0.) instead of using
enumerate index, so that logs and src_type stay consistent with
the actual file naming convention.

Co-Authored-By: wikilsh <wiki_hui@qq.com>

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-23 16:17:41 +08:00
jc bb881c2c0a [PD Disaggregation] pd + cache_storage support vl model (#6906)
* pd + cache_storage support vl model

* support vl model

* fix test
2026-03-23 15:35:20 +08:00
jackyYang6 634d23a38a [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934)
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

* [Docs] Fix thinking_budget markdown formatting

* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
YuBaoku 0b4c1cba9b [CI] Change 21b ep4 to tp1_dp4 in 4_cards_tests (#6745)
* [CI] Change 21b ep4 to tp1_dp4 in 4_cards_tests
2026-03-20 20:42:23 +08:00
jackyYang6 00eb12f656 [BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555)
* [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching

* [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests
2026-03-20 16:37:58 +08:00
AIbin bf7e2424d0 [Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930)
* support DSA_MUTI_BATCH

* update test topk

* update dsk-dsa
2026-03-20 15:59:22 +08:00
cloudforge1 aca733b95c [CI]【Hackathon 10th Spring No.32】load_weight_utils unit test (#6740)
* 【Hackathon 10th Spring No.32】Unit test for load_weight_utils.py

* [CI]【Hackathon 10th Spring No.32】rewrite load_weight_utils unit test

* [CI]【Hackathon 10th Spring No.32】improve load_weight_utils coverage to 83%

- Add test_load_ep_checkpoint_basic: exercises EP checkpoint loading with minimal fixture
- Add test_composite_ep_branch: covers EP path in load_composite_checkpoint
- Add test_get_weight_iterator_unordered: covers unordered sharded safetensors path

* [CI]【Hackathon 10th Spring No.32】align load_weight_utils test with gold standard (tmp_path, split tests)

* [CI]【Hackathon 10th Spring No.32】add coverage tests for load_weight_utils

- Add test_is_layers_grouped: test layers_are_grouped() with grouped, interleaved, and no-layer keys
- Add test_save_model_bf16_cache: exercise save_model decorator with is_checkpoint_bf16=True
- Add test_composite_checkpoint_ep: test load_composite_checkpoint use_ep=True branch
- Add test_composite_checkpoint_rank_mismatch: test tp_size != rank_dirs ValueError
- Add test_composite_checkpoint_kv_quant: test float8_e4m3fn kv_cache path
- Add __main__ block for direct execution

* [CI]【Hackathon 10th Spring No.32】raise load_weight_utils test delta

* [CI]【Hackathon 10th Spring No.32】cover TP sequence-parallel MoE load branches

* test: add load_reordered_experts, pre-sharded, and empty-state tests


---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
2026-03-20 13:14:30 +08:00
luukunn f4a79d4c00 [Optimization]Unified data processing for online and offline (#6891)
* remove process_request

* fix chat

* fix unit test

* remove process response

* fix unit test

* fix offline decode

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix sampling_params

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-19 21:56:09 +08:00
luukunn c3d8db85c4 [Optimization] Update ZMQ server (#6735)
* add batch zmq send reaponse

* update

* Revert "update"

This reverts commit 0234a25b47.

* update

* remove lock

* fix unit test

* add unit test

* add unit test

* pre commit

* add unit test

* fix unit test

* add unit test

* fix worker>1

* update zmq_worker_pid

* fix unit test

* fix unit test

* fix unit test

* add unit test

* fix unit test

* fix first token time

* fix logprobs

* add unit test

* op

* remore debug log

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-03-19 21:53:16 +08:00