FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
YuBaoku	f5e3d75ca2	[BugFix] fix tool call parser (#7369 ) (#7417 ) * fix tool call parser * add unit test * fix unit test * add unit test Co-authored-by: luukunn <981429396@qq.com>	2026-04-16 17:14:50 +08:00
Zero Rains	2a0bfdc8a6	[KSM] fix the bug in topk (#7395 )	2026-04-14 20:36:29 -07:00
Yuanle Liu	3a1449318c	Revert "[KSM] fix logz when top_k (#7225 )" (#7385 ) This reverts commit `f83673daac`.	2026-04-14 02:59:42 -07:00
YuBaoku	19b0038234	[Cherry-Pick][CI] Sync dev optimizations to 2.4(#7335 ) (#7346 ) * [Cherry-Pick][CI] Sync dev optimizations to 2.4(#7335)	2026-04-12 20:21:17 +08:00
YuBaoku	cdc5fce1b6	[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221 ) (#7293 ) Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>	2026-04-12 14:01:00 +08:00
Yuanle Liu	f83673daac	[KSM] fix logz when top_k (#7225 )	2026-04-07 23:14:27 -07:00
luukunn	6955182309	[Cherry-Pick][Optimization]Fix tool parser (#7079 ) (#7149 ) * [Optimization]Fix tool parser (#7079) * fix tool parser * fix unit test	2026-04-07 16:52:10 +08:00
RAM	93d7cdc061	[RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel (#7158 ) * Fix int32 overflow * refine code	2026-04-02 04:19:47 -07:00
YuBaoku	92abc143d5	[Cherry-Pick][CI] Remove skip logic for *.txt-only changes (#7104 ) (#7117 )	2026-04-02 16:41:56 +08:00
Nyakku Shigure	57b97d3a1a	[Cherry-Pick][Optimization] Use a separate driver when using Triton with Paddle (#6897 ) (#7114 ) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-04-01 15:22:34 +08:00
Yuanle Liu	6051d12385	[KSM] fix sampling mask (#7106 ) v2.4.1_20260331_0	2026-03-30 23:35:26 -07:00
kevin	02e2918500	[Cherry-Pick][BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check (#7090 ) Cherry-pick from release/2.5: the original assertion only checked `prefill_kvcache_block_num >= max_block_num_per_seq`, but for encoder-decoder models the kvcache must also reserve blocks for the encoder side (`enc_dec_block_num`). Without this check, the service could silently allocate insufficient blocks for enc-dec sequences. - `CacheConfig.postprocess`: tighten assertion to `prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num`, error message guides user to reduce `max_model_len` or increase `num_gpu_blocks_override` - `CacheConfig.reset`: same tightening, error message guides user to reduce `max_model_len` or replace with larger GPU cards (override is not applicable here) No change to launch command. If the assertion fires, adjust: ```bash python -m fastdeploy.entrypoints.openai.api_server \ --max-model-len <smaller_value> ... python -m fastdeploy.entrypoints.openai.api_server \ --num-gpu-blocks-override <larger_value> ... ``` Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-31 14:31:52 +08:00
kevin	ed71a6028d	[Cherry-Pick][Refactor] Replace skip_mm_profiling with deploy_modality=text to skip mm profiling (#ORIG) (#7087 ) ## Motivation 原 `--skip-mm-profiling` 参数与已有的 `deploy_modality` 参数功能存在语义重叠：当以纯文本模式（`deploy_modality=text`）部署时，本就不需要为多模态 token 预留显存。引入独立参数增加了配置复杂度，复用 `deploy_modality` 更加直观和一致。 release/2.4 适配说明：本分支原无 `DeployModality` 和 `deploy_modality`， cherry-pick 时一并引入 `DeployModality` 枚举类及 `FDConfig.deploy_modality` 属性。 ## Modifications - `fastdeploy/config.py` - 新增 `DeployModality` 枚举类（TEXT / MIXED） - `FDConfig.__init__` 新增 `deploy_modality` 参数，默认 MIXED - `get_max_chunk_tokens` 中将条件改为 `self.deploy_modality != DeployModality.TEXT`，当 deploy_modality 为 text 时直接返回 `max_num_batched_tokens`，跳过 mm token 叠加 - `fastdeploy/engine/args_utils.py`：删除 `EngineArgs.skip_mm_profiling` 字段及 `--skip-mm-profiling` 启动参数 ## Usage or Command ```bash # 以文本模式部署，跳过 mm token profiling 开销（替代原 --skip-mm-profiling） python -m fastdeploy.entrypoints.openai.api_server \ --deploy-modality text \ --model /path/to/model \ ... ```	2026-03-30 20:17:45 -07:00
YuBaoku	5ac7a6a51e	[Cherry-Pick][CI] Sync develop fix and optimizations to 2.4(#6975 ) (#7066 ) * [Cherry-Pick][CI] Sync develop fix and optimizations to 2.4(#6975)	2026-03-29 21:57:29 +08:00
YuBaoku	936c3a05e0	[Cherry-Pick][Optimization]Streaming requests return complete special tokens.(#6998 ) (#7041 ) * [Optimization]Streaming requests return complete special tokens. (#6998) * return special token * add completions * update * fix * add prompt_token_ids& completion_token_ids=None, * fix unite test * fix completion_tokens error --------- Co-authored-by: luukunn <83932082+luukunn@users.noreply.github.com>	2026-03-27 19:35:59 +08:00
luukunn	f35f0c1a3f	[Cherry-Pick]Allows tools and content to coexist.(#6656 ) (#6996 ) * 支持glm tool与content同时出现 * fix unit test	2026-03-25 15:00:08 +08:00
Siming Dai	4516c58b10	[KSM][Optimization] renormalized logprobs when using keep sampling mask (#6966 )	2026-03-23 05:55:48 -07:00
wikilsh	2948c2e06d	[Cherry-Pick][RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy(#6852 ) (#6909 ) * [RL][BugFix] Fix incorrect rank in IPC snapshot model path ## Motivation During elastic recovery, each rank should load its own model shard. The hardcoded `tp0` caused all ranks to load rank-0's shard, leading to incorrect weight initialization in multi-process scenarios. ## Modifications - Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both the primary model path and the fallback `/shared_ipc_meta/` path inside `_update_ipc_snapshot`, so each rank correctly loads its own shard during elastic recovery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [RL][BugFix] Fix incorrect rank in IPC snapshot model path ## Motivation During elastic recovery, each rank should load its own model shard. The hardcoded `tp0` caused all ranks to load rank-0's shard, leading to incorrect weight initialization in multi-process scenarios. ## Modifications - Replace hardcoded `tp0` with `paddle.distributed.get_rank()` in both the primary model path and the fallback `/shared_ipc_meta/` path inside `_update_ipc_snapshot`, so each rank correctly loads its own shard during elastic recovery. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL] Support chunked part-file loading in IPC snapshot to reduce memory spike Refactor _update_ipc_snapshot with 4-level loading priority: 1. Chunked part files (with gc.collect per part to reduce peak memory) 2. Single full pdparams file (new naming: tp{rank}.{id}) 3. Legacy format (tp0{id}) 4. Shared fallback directory (/shared_ipc_meta/) Add unit tests covering all priority branches and error path. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL] Fix snapshot part index and add validation for part file naming - Parse part index from filename instead of using enumerate index, keeping logs and src_type consistent with actual file naming. - Add validation for part file naming pattern; skip and warn on files that do not match the expected .partN. convention. Co-Authored-By: wikilsh <wiki_hui@qq.com> --------- Co-authored-by: zhangjie83 <zhangjie83@baidu.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-23 16:17:52 +08:00
YuBaoku	7ec091f688	fix code type (#6951 ) (#6953 ) Co-authored-by: SunLei <sunlei5788@gmail.com>	2026-03-20 22:30:44 +08:00
Yuanle Liu	02d8e1a930	[KSM] fix mtp support top_k (#6911 )	2026-03-18 07:26:05 -07:00
Yuanle Liu	7f5f2113c2	Support keep sampling mask (#6725 ) * naive version * return list(int) * fix bug: first_token's sampling mask miss * pre-commit * support mtp * pre-commit * fix ut * fix zmq name conflits * fix ut * add ut * fix ut timeout * optimize performance * fix * support top_k mask * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * update comment * update comment * update comment --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-17 20:07:31 -07:00
Copilot	a714c1f8d4	[Cherry-Pick][Optim] Simplify available_blocks assignment logic (#6819 ) (#6873 ) * Initial plan * [Cherry-Pick][Optim] Simplify available_blocks assignment logic (#6819) Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Update common_engine.py --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-03-17 14:45:23 +08:00
Yonghua Li	6db6fd7567	[Cherry-Pick] [BugFix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758 ) (#6767 ) * [fix] resolve get_save_output_v1 socket name conflicts between multiple instances * [fix] fix engine_worker_queue_port * [fix] fix engine_worker_queue_port	2026-03-12 15:18:56 +08:00
Jiaxin Sui	1dca05fd00	[XPU][CI]Update PaddlePaddle installation command in script (#6794 ) * Update PaddlePaddle installation command in script * Update PaddlePaddle installation in run_xpu_ci_pytest.sh Replace PaddlePaddle installation method with a direct wheel file.	2026-03-12 10:17:22 +08:00
RAM	001c8539f2	[RL] Fix R3 Empty bug with TP=1 (#6777 )	2026-03-10 23:37:26 -07:00
RAM	754c94f9c8	[RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604 ) * Optimizate delete batch and fused put * refine code * refine code * refine code * Support suspend r3	2026-03-10 20:17:55 -07:00
Yuanle Liu	7fb8af4318	[BugFix][MTP] Skip empty_input_forward during dummy run (#6655 ) When `is_dummy_run=True`, calling `empty_input_forward` can cause unexpected behavior. Add `and not is_dummy_run` guard for both `_propose_cuda` and `_propose_xpu` paths. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-04 23:07:37 -08:00
GoldPancake	f3991a859d	[Cherry-Pick][BugFix] fix mtp_config in rl (#6595 )(#6597 ) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-03 16:29:46 +08:00
Yonghua Li	ad1ea43d7b	[Cherry-Pick] [BugFix] fix prefix tree updating timeout (#6615 )(#6617 )	2026-03-03 14:35:54 +08:00
RAM	7d76fd4398	[RL] Clear Requests status of R3 (#6569 )	2026-03-01 23:16:59 -08:00
kevin	f26e3de077	[Cherry-Pick][BugFix] Fix AttributeError in recycle_gpu_blocks when prefix_tree_status_signal not initialized(#6531 ) (#6559 ) * fix mtp acceptance rate decline * [BugFix] Fix AttributeError in recycle_gpu_blocks when prefix_tree_status_signal not initialized - Add hasattr check before accessing prefix_tree_status_signal - The signal is only initialized in launch_cache_messager, not in __init__ - Fixes CI test failure in test_prefix_cache_manager.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [BugFix] Reset prefix cache when model weights are updating - Call self.reset() before setting status to NORMAL in UPDATING state - Ensure cache consistency when model weights change - Consistent with CLEARING state handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 13:12:04 +08:00
Copilot	b12713a67f	[Cherry-Pick][BugFix][APIServer] Enable control socket disable option in API server (#6551 ) (#6554 ) * Initial plan * [BugFix][APIServer] Add control_socket_disable to gunicorn options (cherry-pick of #6551) Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-03-01 13:47:55 +08:00
kevin	dc095eaa56	[Cherry-Pick][BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation(#6541 ) (#6540 ) * fix mtp acceptance rate decline * [BugFix][Scheduler] Fix can_schedule_block_num_threshold calculation Fix the calculation of can_schedule_block_num_threshold in ResourceManagerV1. The original formula using need_prefill_tokens could lead to incorrect threshold values. Now directly use num_chunk_new_block for accurate block scheduling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 11:53:27 +08:00
Jiang-Jia-Jun	db8e39f48a	Set GPU flags for paddle in cache transfer manager (#6534 )	2026-02-28 10:27:37 +08:00
Yuanle Liu	2b79d971f1	[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6506 ) * Initial plan * feat: migrate core PR6493 changes to release 2.4 Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2026-02-25 18:02:01 -08:00
kevin	2bd6263f82	fix mtp acceptance rate decline (#6469 )	2026-02-12 19:56:36 +08:00
sunxin	88fd9bac27	[Cherry-Pick 2.4][BugFix] Fix get_padding_offset in empty run (#6461 ) * fix empty get_padding_offset * fix mtp padding	2026-02-11 20:19:56 +08:00
chen	436846e89c	check (#6376 )	2026-02-10 19:13:33 +08:00
GoldPancake	ef316a6080	[Cherry-Pick][BugFix] Fix rebuild padding bug (#6422 ) (#6414 )	2026-02-09 21:49:22 -08:00
RAM	041df5d8c0	[RL] R3 Fix the bug for determining the end of a request (#6388 ) * 1.move put routing to postprocess 2.extend async put task queue * fix speculate eos token bug * delete code * delete code * refine code --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-10 13:36:01 +08:00
kevin	3cecb83765	[Cherry-Pick][Feature] consider multimodal model when dummy run(#6045 ) (#6349 ) * add mm profile * update code --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-10 12:59:02 +08:00
Yonghua Li	4e49b890f6	[fix] fix prefix_cache_status_signal usage (#6378 ) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-09 21:39:10 +08:00
Copilot	7a7f54b354	[Cherry-Pick][Metrics] Support cpu-cache-block-num (#6390 ) (#6391 ) * Initial plan * [Metrics] Support cpu-cache-block-num --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: root <root@szzj-bcc-offline-1487319.szzj.baidu.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-09 21:25:33 +08:00
YuBaoku	603aab0489	[CI] Fix import issue caused by flash_mask in shared CI env (#6406 )	2026-02-09 18:59:21 +08:00
周周周	ee6f02e4c6	commit (#6386 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2026-02-07 12:10:00 +08:00
luukunn	5ed439cf88	[Cherry-Pick]FD statistical#5646 (#6277 ) * cherry FD statistical * add envs * fix unit test	2026-02-06 14:52:51 +08:00
YuBaoku	252a7945e0	[Cherry-Pick][CI] Update stable_test/ce_job/run.sh script and CI_TEAM_MEMBERS(#6352 ) (#6371 )	2026-02-06 10:41:06 +08:00
Yonghua Li	1d1194e15b	[BugFix] fix cache manager hang when clearing prefix cache (#6239 ) * [fix] fix cache manager hang when clearing prefix cache * [fix] fix list_proxy has no clear method * [fix] fix barrier * [fix] add barrier0 * [fix] add cache_task_is_paused_signal * [fix] fix condition * [fix] fix cache transfer sync and delay prefix cache tree clearing * [fix] fix typo * [chore] polish code * [fix] revert only rank0 write kv_cache_status_signal * [fix] fix thread pool and prefix cache managerh hang * [fix] add timeout for task_swapping_event * [fix] tolerate prefix cache manager error while prefix tree is cleared * [chore] add more log	2026-02-05 21:24:42 +08:00
kevin	ca99a6a005	add reset dummy run when update weight (#6329 )	2026-02-05 21:00:30 +08:00
K11OntheBoat	f649af9f43	[Cherry-Pick]Support Norm before Rope(#6332 ) (#6369 ) * Support norm before rope * [Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel (#5880) * qk rmsnorm fused * inplace * glm * fix * add qknorm layer * fix * update * fix qwen3 vl * update rl baseline * fix qwen3 vl moe * test * fix qwen vl moe rl * fix * fix opt qknorm (#6080) * Support Norm before Rope * remove extra file --------- Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: sunxin <68891411+Sunny-bot1@users.noreply.github.com>	2026-02-05 20:28:32 +08:00

1 2 3 4 5 ...

4274 Commits