Commit Graph

4274 Commits

Author SHA1 Message Date
YuBaoku f5e3d75ca2 [BugFix] fix tool call parser (#7369) (#7417)
* fix tool call parser

* add unit test

* fix unit test

* add unit test

Co-authored-by: luukunn <981429396@qq.com>
2026-04-16 17:14:50 +08:00
Zero Rains 2a0bfdc8a6 [KSM] fix the bug in topk (#7395) 2026-04-14 20:36:29 -07:00
Yuanle Liu 3a1449318c Revert "[KSM] fix logz when top_k (#7225)" (#7385)
This reverts commit f83673daac.
2026-04-14 02:59:42 -07:00
YuBaoku 19b0038234 [Cherry-Pick][CI] Sync dev optimizations to 2.4(#7335) (#7346)
* [Cherry-Pick][CI] Sync dev optimizations to 2.4(#7335)
2026-04-12 20:21:17 +08:00
YuBaoku cdc5fce1b6 [BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221) (#7293)
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
2026-04-12 14:01:00 +08:00
Yuanle Liu f83673daac [KSM] fix logz when top_k (#7225) 2026-04-07 23:14:27 -07:00
luukunn 6955182309 [Cherry-Pick][Optimization]Fix tool parser (#7079) (#7149)
* [Optimization]Fix tool parser (#7079)

* fix tool parser

* fix unit test
2026-04-07 16:52:10 +08:00
RAM 93d7cdc061 [RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel (#7158)
* Fix int32 overflow

* refine code
2026-04-02 04:19:47 -07:00
YuBaoku 92abc143d5 [Cherry-Pick][CI] Remove skip logic for *.txt-only changes (#7104) (#7117) 2026-04-02 16:41:56 +08:00
Nyakku Shigure 57b97d3a1a [Cherry-Pick][Optimization] Use a separate driver when using Triton with Paddle (#6897) (#7114)
---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-01 15:22:34 +08:00
Yuanle Liu 6051d12385 [KSM] fix sampling mask (#7106) v2.4.1_20260331_0 2026-03-30 23:35:26 -07:00
kevin 02e2918500 [Cherry-Pick][BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check (#7090)
Cherry-pick from release/2.5: the original assertion only checked
`prefill_kvcache_block_num >= max_block_num_per_seq`, but for
encoder-decoder models the kvcache must also reserve blocks for the
encoder side (`enc_dec_block_num`). Without this check, the service
could silently allocate insufficient blocks for enc-dec sequences.

- `CacheConfig.postprocess`: tighten assertion to
  `prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num`,
  error message guides user to reduce `max_model_len` or increase
  `num_gpu_blocks_override`
- `CacheConfig.reset`: same tightening, error message guides user to
  reduce `max_model_len` or replace with larger GPU cards (override
  is not applicable here)

No change to launch command. If the assertion fires, adjust:

```bash
python -m fastdeploy.entrypoints.openai.api_server \
  --max-model-len <smaller_value> ...

python -m fastdeploy.entrypoints.openai.api_server \
  --num-gpu-blocks-override <larger_value> ...
```

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 14:31:52 +08:00
kevin ed71a6028d [Cherry-Pick][Refactor] Replace skip_mm_profiling with deploy_modality=text to skip mm profiling (#ORIG) (#7087)
## Motivation

原 `--skip-mm-profiling` 参数与已有的 `deploy_modality` 参数功能存在语义重叠:
当以纯文本模式(`deploy_modality=text`)部署时,本就不需要为多模态 token 预留显存。
引入独立参数增加了配置复杂度,复用 `deploy_modality` 更加直观和一致。

release/2.4 适配说明:本分支原无 `DeployModality` 和 `deploy_modality`,
cherry-pick 时一并引入 `DeployModality` 枚举类及 `FDConfig.deploy_modality` 属性。

## Modifications

- `fastdeploy/config.py`
  - 新增 `DeployModality` 枚举类(TEXT / MIXED)
  - `FDConfig.__init__` 新增 `deploy_modality` 参数,默认 MIXED
  - `get_max_chunk_tokens` 中将条件改为
    `self.deploy_modality != DeployModality.TEXT`,
    当 deploy_modality 为 text 时直接返回 `max_num_batched_tokens`,跳过 mm token 叠加
- `fastdeploy/engine/args_utils.py`:删除 `EngineArgs.skip_mm_profiling` 字段及
  `--skip-mm-profiling` 启动参数

## Usage or Command

```bash
# 以文本模式部署,跳过 mm token profiling 开销(替代原 --skip-mm-profiling)
python -m fastdeploy.entrypoints.openai.api_server \
  --deploy-modality text \
  --model /path/to/model \
  ...
```
2026-03-30 20:17:45 -07:00
YuBaoku 5ac7a6a51e [Cherry-Pick][CI] Sync develop fix and optimizations to 2.4(#6975) (#7066)
* [Cherry-Pick][CI] Sync develop fix and optimizations to 2.4(#6975)
2026-03-29 21:57:29 +08:00
YuBaoku 936c3a05e0 [Cherry-Pick][Optimization]Streaming requests return complete special tokens.(#6998) (#7041)
* [Optimization]Streaming requests return complete special tokens. (#6998)

* return special token

* add completions

* update

* fix

* add prompt_token_ids&                        completion_token_ids=None,

* fix unite test

* fix completion_tokens error

---------

Co-authored-by: luukunn <83932082+luukunn@users.noreply.github.com>
2026-03-27 19:35:59 +08:00
luukunn f35f0c1a3f [Cherry-Pick]Allows tools and content to coexist.(#6656) (#6996)
* 支持glm tool与content同时出现

* fix unit test
2026-03-25 15:00:08 +08:00
Siming Dai 4516c58b10 [KSM][Optimization] renormalized logprobs when using keep sampling mask (#6966) 2026-03-23 05:55:48 -07:00
wikilsh 2948c2e06d [Cherry-Pick][RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy(#6852) (#6909)
* [RL][BugFix] Fix incorrect rank in IPC snapshot model path

## Motivation
During elastic recovery, each rank should load its own model shard.
The hardcoded `tp0` caused all ranks to load rank-0's shard, leading
to incorrect weight initialization in multi-process scenarios.

## Modifications
- Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both
  the primary model path and the fallback `/shared_ipc_meta/` path
  inside `_update_ipc_snapshot`, so each rank correctly loads its own
  shard during elastic recovery.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* [RL][BugFix] Fix incorrect rank in IPC snapshot model path

## Motivation
During elastic recovery, each rank should load its own model shard.
The hardcoded `tp0` caused all ranks to load rank-0's shard, leading
to incorrect weight initialization in multi-process scenarios.

## Modifications
- Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both
  the primary model path and the fallback `/shared_ipc_meta/` path
  inside `_update_ipc_snapshot`, so each rank correctly loads its own
  shard during elastic recovery.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL] Support chunked part-file loading in IPC snapshot to reduce memory spike

Refactor _update_ipc_snapshot with 4-level loading priority:
1. Chunked part files (with gc.collect per part to reduce peak memory)
2. Single full pdparams file (new naming: tp{rank}.{id})
3. Legacy format (tp0{id})
4. Shared fallback directory (/shared_ipc_meta/)

Add unit tests covering all priority branches and error path.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL] Fix snapshot part index and add validation for part file naming

- Parse part index from filename instead of using enumerate index,
  keeping logs and src_type consistent with actual file naming.
- Add validation for part file naming pattern; skip and warn on
  files that do not match the expected .partN. convention.

Co-Authored-By: wikilsh <wiki_hui@qq.com>

---------

Co-authored-by: zhangjie83 <zhangjie83@baidu.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 16:17:52 +08:00
YuBaoku 7ec091f688 fix code type (#6951) (#6953)
Co-authored-by: SunLei <sunlei5788@gmail.com>
2026-03-20 22:30:44 +08:00
Yuanle Liu 02d8e1a930 [KSM] fix mtp support top_k (#6911) 2026-03-18 07:26:05 -07:00
Yuanle Liu 7f5f2113c2 Support keep sampling mask (#6725)
* naive version

* return list(int)

* fix bug: first_token's sampling mask miss

* pre-commit

* support mtp

* pre-commit

* fix ut

* fix zmq name conflits

* fix ut

* add ut

* fix ut timeout

* optimize performance

* fix

* support top_k mask

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* update comment

* update comment

* update comment

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-17 20:07:31 -07:00
Copilot a714c1f8d4 [Cherry-Pick][Optim] Simplify available_blocks assignment logic (#6819) (#6873)
* Initial plan

* [Cherry-Pick][Optim] Simplify available_blocks assignment logic (#6819)

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Update common_engine.py

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-03-17 14:45:23 +08:00
Yonghua Li 6db6fd7567 [Cherry-Pick] [BugFix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) (#6767)
* [fix] resolve get_save_output_v1 socket name conflicts between multiple instances

* [fix] fix engine_worker_queue_port

* [fix] fix engine_worker_queue_port
2026-03-12 15:18:56 +08:00
Jiaxin Sui 1dca05fd00 [XPU][CI]Update PaddlePaddle installation command in script (#6794)
* Update PaddlePaddle installation command in script

* Update PaddlePaddle installation in run_xpu_ci_pytest.sh

Replace PaddlePaddle installation method with a direct wheel file.
2026-03-12 10:17:22 +08:00
RAM 001c8539f2 [RL] Fix R3 Empty bug with TP=1 (#6777) 2026-03-10 23:37:26 -07:00
RAM 754c94f9c8 [RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604)
* Optimizate delete batch and fused put

* refine code

* refine code

* refine code

* Support suspend r3
2026-03-10 20:17:55 -07:00
Yuanle Liu 7fb8af4318 [BugFix][MTP] Skip empty_input_forward during dummy run (#6655)
When `is_dummy_run=True`, calling `empty_input_forward` can cause
unexpected behavior. Add `and not is_dummy_run` guard for both
`_propose_cuda` and `_propose_xpu` paths.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 23:07:37 -08:00
GoldPancake f3991a859d [Cherry-Pick][BugFix] fix mtp_config in rl (#6595)(#6597)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-03 16:29:46 +08:00
Yonghua Li ad1ea43d7b [Cherry-Pick] [BugFix] fix prefix tree updating timeout (#6615)(#6617) 2026-03-03 14:35:54 +08:00
RAM 7d76fd4398 [RL] Clear Requests status of R3 (#6569) 2026-03-01 23:16:59 -08:00
kevin f26e3de077 [Cherry-Pick][BugFix] Fix AttributeError in recycle_gpu_blocks when prefix_tree_status_signal not initialized(#6531) (#6559)
* fix mtp acceptance rate decline

* [BugFix] Fix AttributeError in recycle_gpu_blocks when prefix_tree_status_signal not initialized

- Add hasattr check before accessing prefix_tree_status_signal
- The signal is only initialized in launch_cache_messager, not in __init__
- Fixes CI test failure in test_prefix_cache_manager.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [BugFix] Reset prefix cache when model weights are updating

- Call self.reset() before setting status to NORMAL in UPDATING state
- Ensure cache consistency when model weights change
- Consistent with CLEARING state handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 13:12:04 +08:00
Copilot b12713a67f [Cherry-Pick][BugFix][APIServer] Enable control socket disable option in API server (#6551) (#6554)
* Initial plan

* [BugFix][APIServer] Add control_socket_disable to gunicorn options (cherry-pick of #6551)

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-03-01 13:47:55 +08:00
kevin dc095eaa56 [Cherry-Pick][BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation(#6541) (#6540)
* fix mtp acceptance rate decline

* [BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation

Fix the calculation of can_schedule_block_num_threshold in
ResourceManagerV1. The original formula using need_prefill_tokens
could lead to incorrect threshold values. Now directly use
num_chunk_new_block for accurate block scheduling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 11:53:27 +08:00
Jiang-Jia-Jun db8e39f48a Set GPU flags for paddle in cache transfer manager (#6534) 2026-02-28 10:27:37 +08:00
Yuanle Liu 2b79d971f1 [Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 (#6506)
* Initial plan

* feat: migrate core PR6493 changes to release 2.4

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix ci

* fix ci

* fix ci

* fix ci

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
2026-02-25 18:02:01 -08:00
kevin 2bd6263f82 fix mtp acceptance rate decline (#6469) 2026-02-12 19:56:36 +08:00
sunxin 88fd9bac27 [Cherry-Pick 2.4][BugFix] Fix get_padding_offset in empty run (#6461)
* fix empty get_padding_offset

* fix mtp padding
2026-02-11 20:19:56 +08:00
chen 436846e89c check (#6376) 2026-02-10 19:13:33 +08:00
GoldPancake ef316a6080 [Cherry-Pick][BugFix] Fix rebuild padding bug (#6422) (#6414) 2026-02-09 21:49:22 -08:00
RAM 041df5d8c0 [RL] R3 Fix the bug for determining the end of a request (#6388)
* 1.move put routing to postprocess 2.extend async put task queue

* fix speculate eos token bug

* delete code

* delete code

* refine code

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-10 13:36:01 +08:00
kevin 3cecb83765 [Cherry-Pick][Feature] consider multimodal model when dummy run(#6045) (#6349)
* add mm profile

* update code

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-10 12:59:02 +08:00
Yonghua Li 4e49b890f6 [fix] fix prefix_cache_status_signal usage (#6378)
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-09 21:39:10 +08:00
Copilot 7a7f54b354 [Cherry-Pick][Metrics] Support cpu-cache-block-num (#6390) (#6391)
* Initial plan

* [Metrics] Support cpu-cache-block-num

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: root <root@szzj-bcc-offline-1487319.szzj.baidu.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-09 21:25:33 +08:00
YuBaoku 603aab0489 [CI] Fix import issue caused by flash_mask in shared CI env (#6406) 2026-02-09 18:59:21 +08:00
周周周 ee6f02e4c6 commit (#6386)
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2026-02-07 12:10:00 +08:00
luukunn 5ed439cf88 [Cherry-Pick]FD statistical#5646 (#6277)
* cherry FD statistical

* add envs

* fix unit test
2026-02-06 14:52:51 +08:00
YuBaoku 252a7945e0 [Cherry-Pick][CI] Update stable_test/ce_job/run.sh script and CI_TEAM_MEMBERS(#6352) (#6371) 2026-02-06 10:41:06 +08:00
Yonghua Li 1d1194e15b [BugFix] fix cache manager hang when clearing prefix cache (#6239)
* [fix] fix cache manager hang when clearing prefix cache

* [fix] fix list_proxy has no clear method

* [fix] fix barrier

* [fix] add barrier0

* [fix] add cache_task_is_paused_signal

* [fix] fix condition

* [fix] fix cache transfer  sync and delay prefix cache tree clearing

* [fix] fix typo

* [chore] polish code

* [fix] revert only rank0 write kv_cache_status_signal

* [fix] fix thread pool and prefix cache managerh hang

* [fix] add timeout for task_swapping_event

* [fix] tolerate prefix cache manager error while prefix tree is cleared

* [chore] add more log
2026-02-05 21:24:42 +08:00
kevin ca99a6a005 add reset dummy run when update weight (#6329) 2026-02-05 21:00:30 +08:00
K11OntheBoat f649af9f43 [Cherry-Pick]Support Norm before Rope(#6332) (#6369)
* Support norm before rope

* [Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel (#5880)

* qk rmsnorm fused

* inplace

* glm

* fix

* add qknorm layer

* fix

* update

* fix qwen3 vl

* update rl baseline

* fix qwen3 vl moe

* test

* fix qwen vl moe rl

* fix

* fix opt qknorm (#6080)

* Support Norm before Rope

* remove extra file

---------

Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: sunxin <68891411+Sunny-bot1@users.noreply.github.com>
2026-02-05 20:28:32 +08:00