Commit Graph

5000 Commits

Author SHA1 Message Date
copilot-swe-agent[bot] 46e14f88f9 Merge origin/release/2.6 and resolve worker_process conflict
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-04-16 11:01:28 +00:00
YuBaoku 72ce56b10b [BugFix] fix tool call parser (#7369) (#7419)
* fix tool call parser

* add unit test

* fix unit test

* add unit test

Co-authored-by: luukunn <981429396@qq.com>
2026-04-16 17:15:03 +08:00
jc b8e8a6253f PD deployment support without router (#7412) (#7424) 2026-04-16 14:02:10 +08:00
GoldPancake 26674bbbb6 [Cherry-Pick][RL] Add clear_graph_opt_backend for glm4_mtp (#7378) (#7379)
* add clear_grpah func

* fix spell
2026-04-15 19:45:09 +08:00
Bingoo 61bfe6e5b3 modify flashmask version (#7414) 2026-04-15 18:19:21 +08:00
chen 2ee1cc3d0a check init_flash_attn_version log (#7401) 2026-04-15 11:05:20 +08:00
sunxin 5f7524eb85 fix rl moe gate type (#7394) 2026-04-14 20:04:09 +08:00
freeliuzc f6c066fb9d Revert "[Optimization] Optimize ttft for prefill pd (#6680)" (#7386)
* Revert "[Optimization] Optimize ttft for prefill pd (#6680)"

This reverts commit 6727df8286.

* fix revert pr
2026-04-14 20:01:39 +08:00
YuBaoku 8a8beca548 [BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364) (#7387)
## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。

Co-authored-by: kevin <chengyf112@gmail.com>
2026-04-14 19:25:12 +08:00
lonelygsh e7c8dc2fe9 [Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7370)
- speculate_limit_thinking_content_length: update current_base_step to
  step_idx+1 (step_idx now records history count before current round);
  remove incorrect step_idx decrement on accept_num truncation; mark
  step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
  step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
  formula (remove stale -accept_num offset); use <= condition so accept_idx
  maps directly to the accepted token that ends the stop sequence; fix
  accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
2026-04-14 12:54:22 +08:00
chen 144dc17b14 update attn_mask_q 2 (#7373) 2026-04-13 23:06:16 +08:00
JYChen 9823d63220 remove fa4 requirements (#7354) 2026-04-13 19:24:24 +08:00
chenjian d9a008f3c8 [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159) (#7351)
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
2026-04-13 15:24:01 +08:00
sunxin b2997f3aad fix overlap mtp empty run (#7314) 2026-04-13 15:20:11 +08:00
liuruyan 9cb82d79a0 [Cherry-Pick][TI-consistent] support quant use pow2scale(#7308) (#7310)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:02:08 -07:00
Jiang-Jia-Jun 6ee354f2c8 Update worker_process.py 2026-04-12 06:03:21 +00:00
Jiang-Jia-Jun 19b3b203d5 Update envs.py 2026-04-12 06:03:21 +00:00
jiang-jia-jun 63eaccd6c2 [Optim] Remove IPCLock between CacheManager and WorkerProcess 2026-04-12 06:03:21 +00:00
YuBaoku 9e8ea7db14 [Cherry-Pick][CI] Sync dev optimizations to 2.6(#7335) (#7343) 2026-04-12 13:22:52 +08:00
chen 7446665676 [Cherry-Pick][RL]moe bf16 ep support paddle batch_gemm(#7337) (#7339)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:26 +08:00
JYChen 42b0f59b9e [Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7318)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:38:37 +08:00
YuBaoku 65c6e726f5 [Cherry-Pick][Docs] Update Release Note(#7302) (#7341) 2026-04-11 16:48:06 +08:00
YuBaoku 2ac9b89409 [XPU][CI]Update xtdk version in download_dependencies.sh (#7320) (#7322)
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-11 00:27:54 +08:00
GoldPancake c7560383ab [Cherry-Pick][FDConfig] Auto-scale CUDA Graph Capture & CLI Quantization Params + CUDAGraph Validation (#7215,#7281) (#7301)
* refactor cudagraph args

* refactor quant cli param

* fix

* fix

* tmp skip xpu

* fix
2026-04-10 16:10:31 +08:00
zhangbo9674 4f36346e14 [Cherry-Pick] change rms norm for glm #7269 (#7276)
* fix

* refine code

* refine code

* refine code

* refine code

* refine code
2026-04-10 01:03:00 -07:00
YuBaoku dd0863b076 [BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221) (#7296)
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
2026-04-10 13:54:02 +08:00
fxyfxy777 dea9d35171 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) (#7279) 2026-04-09 21:37:42 +08:00
YuBaoku 921a0ae60b [Docs] Update docs for release/2.5 (#7267) (#7277)
* Update docs for release/2.5

* Update English docs for release/2.5

- Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link
- Update docs/get_started/installation/nvidia_gpu.md:
  - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support
  - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives
  - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option
- Update docs/zh/get_started/installation/nvidia_gpu.md:
  - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f



* Clarify --extra-index-url usage in installation docs

Add note explaining that --extra-index-url is only for downloading
fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed
from the Paddle source specified by -i. Applied to both Chinese and
English nvidia_gpu.md installation guides.

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c



* Update nvidia_gpu.md

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-09 21:03:19 +08:00
Jiaxin Sui 6fcc25f3f6 Update ci_metax.yml (#7286) 2026-04-09 17:31:20 +08:00
Bingoo 849eb3df65 [Cherry-Pick][Optimization] merge matmul and add (#6986) (#7191)
* merge matmul and add

* modify format

* using paddle.nn.functional.linear

* using _C_ops.linear

* using paddle.nn.functional.linear

* add FLAGS_use_legacy_linear env var in test case

* fix format

* add assert and remove env

* modify format

* using matmul for no bias

* modify accurate baseline
2026-04-09 14:15:43 +08:00
YuBaoku 098dd2c251 [XPU][CI] lock xvllm version for fix bug (#7264) (#7266)
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-09 12:46:13 +08:00
xiaoxiaohehe001 5fd8020363 [Cherry-Pick][BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7216) 2026-04-09 11:05:43 +08:00
JYChen 9c65655cb3 [Cherry-Pick][RL] support moe-topk use topk_reduce_func #7218 (#7256)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:10 +08:00
Bingoo 01818844b4 support moe for sm103 (#7240) 2026-04-08 20:56:23 +08:00
YuBaoku 84d62712c9 [Feature]distinguish whl version (#7204) (#7224)
* [Feature]whl version

* [Feature]whl version,set root_is_pure = false

* [Feature]code style

Co-authored-by: ChowMingSing <610208940@qq.com>
2026-04-08 17:32:38 +08:00
YuBaoku 6b78981dde Split enable_mm (#7183) (#7233)
Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com>
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 16:32:04 +08:00
GoldPancake 403ce139c7 remove arctic_inference deps (#7236) 2026-04-08 15:25:21 +08:00
huicongyao 36909bf27d [Cherry-Pick][BugFix] fix MTP bugs in TP and overlap(#7172) (#7192)
* fix MTP bugs in TP and overlap

* fix
2026-04-08 10:24:38 +08:00
YuBaoku 7ab48c4760 [Cherry-Pick][CI] Use GPU-Build-RL runner for _build_linux_rl.yml (#7186) (#7195) 2026-04-03 20:55:53 +08:00
Yonghua Li 55dbc83310 [Cherry-Pick][BugFix] prevent requests from entering running state without a slot(#7141) (#7181)
* [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (#7163)

* Set MC_MAX_MR_SIZE to avoid register hang

* up

* [fix] prevent requests from entering running state without a slot

* [fix] count abort set

* [fix] count preempted task in waiting list

---------

Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com>
2026-04-03 17:46:13 +08:00
Jiang-Jia-Jun b24765a746 Update setup.py 2026-04-03 11:29:22 +08:00
jackyYang6 e3aed6de2f fix oom bug, optimize async weight loading and update read step by yaml (#7171) 2026-04-03 11:05:24 +08:00
jc 1cc0cf23c2 [BugFix] Set MC_MAX_MR_SIZE to avoid register hang in default (#7161)
* Set MC_MAX_MR_SIZE to avoid register hang

* Set MC_MAX_MR_SIZE to avoid register hang
2026-04-03 10:51:15 +08:00
chenjian 2632e6cf32 [Feature] Support chunk prefill disabled in scheduler v1 (#7152) 2026-04-03 10:18:14 +08:00
luukunn 562fa31791 [BugFix]fix extract_tool_calls (#7154)
* fix extract_tool_calls
2026-04-02 21:18:37 +08:00
Yonghua Li 98f3fc9267 [RL] [KVCache] let cache transfer managers update key prefix after weight update and add unit tests (#7083)
* [test] add a few unit tests

* [feat] update key prefix when model weights are updated

* [test] try to fix test_worker_process
2026-04-02 19:58:41 +08:00
fxyfxy777 9f3b3ce7f5 [Optimization] merge_allreduce (#7039) 2026-04-02 19:52:13 +08:00
bukejiyu f142b486c9 update (#7101) 2026-04-02 16:07:26 +08:00
Longzhi Wang 938e7dd881 [Other] support video_fps args for video bench (#7077) 2026-04-02 10:40:15 +08:00
YuBaoku 7aa213bba9 [CI] Replace ipc=host with shm-size and sysctl configuration (#7138) 2026-04-02 10:33:55 +08:00