Commit Graph

5010 Commits

Author SHA1 Message Date
qwes5s5 9c91ecb1ec [Cherry-Pick][BugFix] Fix bugs in /v1/abort_requests interface from PR(#6992) (#7176) (#7551)
* abort api bug fix

* bug fix

* bug fix
2026-04-22 15:49:51 +08:00
GoldPancake 2961400190 [Cherry-Pick][BugFix] Fix clear_parameters hang issue in MTP during weight cleanup in RL (#7522) (#7523)
* fix mtp clear graph bugs in rl
2026-04-22 15:24:10 +08:00
Jiang-Jia-Jun b0fde163a6 Enable output caching by default 2026-04-22 11:01:54 +08:00
Jiang-Jia-Jun 86df2a9e86 Update args_utils.py (#7549) 2026-04-22 10:59:52 +08:00
jc d5518463ce Mooncake storage register local buffer by chunk (#7416) (#7540) 2026-04-22 10:46:57 +08:00
YuBaoku 13034ef0ca [BugFix] Fix skip_x_record_stream incompatibility across deep_ep versions (#7542) (#7546)
* fix skip_x_record_stream

* fix

* optim

Co-authored-by: Yuanle Liu <yuanlehome@163.com>
2026-04-21 06:31:45 -07:00
chen be2fd17e7d add m_grouped_bf16_gemm_nn_contiguous(#7536) 2026-04-21 20:20:03 +08:00
RAM 74ddb20a73 [RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel (#7496)
* [RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604)

* Optimizate delete batch and fused put

* refine code

* refine code

* refine code

* Support suspend r3

* [RL] Fix R3 Empty bug with TP=1 (#6777)

* Fix int32 overflow

* refine code

* fix seq_lens_decoder bug

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-04-21 01:51:45 -07:00
zhouchong 95261f098b Unify num_experts_per_tok to moe_k in ModelConfig for MoE model compatibility (#7517) 2026-04-21 15:21:47 +08:00
YuBaoku f4f7760925 [CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 (#7486) (#7519) 2026-04-20 21:09:21 +08:00
jackyYang6 fc801f8387 [Bugfix][RL] fix control request timeout in async update weights pipeline (#7470) 2026-04-20 11:23:44 +08:00
freeliuzc 56b761de3f [Cherry-Pick][Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy(#7467) (#7468)
* fix repeat_time kernel and change default spec verify strategy

* fix unit_test
2026-04-18 00:07:34 +08:00
GoldPancake 650d1e49aa [Cherry-Pick][Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442) (#7464)
* support mtp logprob in pd

* fix

* fix

* fix

* fix xpu bugs
2026-04-17 21:37:42 +08:00
freeliuzc 185708b566 [Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438) (#7439)
* fix max_num_batched_tokens error compute

* add temperatory solution

* fix bug
2026-04-17 16:17:59 +08:00
YuBaoku 72ce56b10b [BugFix] fix tool call parser (#7369) (#7419)
* fix tool call parser

* add unit test

* fix unit test

* add unit test

Co-authored-by: luukunn <981429396@qq.com>
2026-04-16 17:15:03 +08:00
jc b8e8a6253f PD deployment support without router (#7412) (#7424) 2026-04-16 14:02:10 +08:00
GoldPancake 26674bbbb6 [Cherry-Pick][RL] Add clear_graph_opt_backend for glm4_mtp (#7378) (#7379)
* add clear_grpah func

* fix spell
2026-04-15 19:45:09 +08:00
Bingoo 61bfe6e5b3 modify flashmask version (#7414) 2026-04-15 18:19:21 +08:00
chen 2ee1cc3d0a check init_flash_attn_version log (#7401) 2026-04-15 11:05:20 +08:00
sunxin 5f7524eb85 fix rl moe gate type (#7394) 2026-04-14 20:04:09 +08:00
freeliuzc f6c066fb9d Revert "[Optimization] Optimize ttft for prefill pd (#6680)" (#7386)
* Revert "[Optimization] Optimize ttft for prefill pd (#6680)"

This reverts commit 6727df8286.

* fix revert pr
2026-04-14 20:01:39 +08:00
YuBaoku 8a8beca548 [BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364) (#7387)
## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。

Co-authored-by: kevin <chengyf112@gmail.com>
2026-04-14 19:25:12 +08:00
lonelygsh e7c8dc2fe9 [Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7370)
- speculate_limit_thinking_content_length: update current_base_step to
  step_idx+1 (step_idx now records history count before current round);
  remove incorrect step_idx decrement on accept_num truncation; mark
  step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
  step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
  formula (remove stale -accept_num offset); use <= condition so accept_idx
  maps directly to the accepted token that ends the stop sequence; fix
  accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
2026-04-14 12:54:22 +08:00
chen 144dc17b14 update attn_mask_q 2 (#7373) 2026-04-13 23:06:16 +08:00
JYChen 9823d63220 remove fa4 requirements (#7354) 2026-04-13 19:24:24 +08:00
chenjian d9a008f3c8 [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159) (#7351)
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
2026-04-13 15:24:01 +08:00
sunxin b2997f3aad fix overlap mtp empty run (#7314) 2026-04-13 15:20:11 +08:00
liuruyan 9cb82d79a0 [Cherry-Pick][TI-consistent] support quant use pow2scale(#7308) (#7310)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:02:08 -07:00
YuBaoku 9e8ea7db14 [Cherry-Pick][CI] Sync dev optimizations to 2.6(#7335) (#7343) 2026-04-12 13:22:52 +08:00
chen 7446665676 [Cherry-Pick][RL]moe bf16 ep support paddle batch_gemm(#7337) (#7339)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:26 +08:00
JYChen 42b0f59b9e [Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7318)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:38:37 +08:00
YuBaoku 65c6e726f5 [Cherry-Pick][Docs] Update Release Note(#7302) (#7341) 2026-04-11 16:48:06 +08:00
YuBaoku 2ac9b89409 [XPU][CI]Update xtdk version in download_dependencies.sh (#7320) (#7322)
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-11 00:27:54 +08:00
GoldPancake c7560383ab [Cherry-Pick][FDConfig] Auto-scale CUDA Graph Capture & CLI Quantization Params + CUDAGraph Validation (#7215,#7281) (#7301)
* refactor cudagraph args

* refactor quant cli param

* fix

* fix

* tmp skip xpu

* fix
2026-04-10 16:10:31 +08:00
zhangbo9674 4f36346e14 [Cherry-Pick] change rms norm for glm #7269 (#7276)
* fix

* refine code

* refine code

* refine code

* refine code

* refine code
2026-04-10 01:03:00 -07:00
YuBaoku dd0863b076 [BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221) (#7296)
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
2026-04-10 13:54:02 +08:00
fxyfxy777 dea9d35171 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) (#7279) 2026-04-09 21:37:42 +08:00
YuBaoku 921a0ae60b [Docs] Update docs for release/2.5 (#7267) (#7277)
* Update docs for release/2.5

* Update English docs for release/2.5

- Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link
- Update docs/get_started/installation/nvidia_gpu.md:
  - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support
  - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives
  - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option
- Update docs/zh/get_started/installation/nvidia_gpu.md:
  - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f



* Clarify --extra-index-url usage in installation docs

Add note explaining that --extra-index-url is only for downloading
fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed
from the Paddle source specified by -i. Applied to both Chinese and
English nvidia_gpu.md installation guides.

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c



* Update nvidia_gpu.md

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-09 21:03:19 +08:00
Jiaxin Sui 6fcc25f3f6 Update ci_metax.yml (#7286) 2026-04-09 17:31:20 +08:00
Bingoo 849eb3df65 [Cherry-Pick][Optimization] merge matmul and add (#6986) (#7191)
* merge matmul and add

* modify format

* using paddle.nn.functional.linear

* using _C_ops.linear

* using paddle.nn.functional.linear

* add FLAGS_use_legacy_linear env var in test case

* fix format

* add assert and remove env

* modify format

* using matmul for no bias

* modify accurate baseline
2026-04-09 14:15:43 +08:00
YuBaoku 098dd2c251 [XPU][CI] lock xvllm version for fix bug (#7264) (#7266)
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-09 12:46:13 +08:00
xiaoxiaohehe001 5fd8020363 [Cherry-Pick][BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7216) 2026-04-09 11:05:43 +08:00
JYChen 9c65655cb3 [Cherry-Pick][RL] support moe-topk use topk_reduce_func #7218 (#7256)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:10 +08:00
Bingoo 01818844b4 support moe for sm103 (#7240) 2026-04-08 20:56:23 +08:00
YuBaoku 84d62712c9 [Feature]distinguish whl version (#7204) (#7224)
* [Feature]whl version

* [Feature]whl version,set root_is_pure = false

* [Feature]code style

Co-authored-by: ChowMingSing <610208940@qq.com>
2026-04-08 17:32:38 +08:00
YuBaoku 6b78981dde Split enable_mm (#7183) (#7233)
Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com>
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 16:32:04 +08:00
GoldPancake 403ce139c7 remove arctic_inference deps (#7236) 2026-04-08 15:25:21 +08:00
huicongyao 36909bf27d [Cherry-Pick][BugFix] fix MTP bugs in TP and overlap(#7172) (#7192)
* fix MTP bugs in TP and overlap

* fix
2026-04-08 10:24:38 +08:00
YuBaoku 7ab48c4760 [Cherry-Pick][CI] Use GPU-Build-RL runner for _build_linux_rl.yml (#7186) (#7195) 2026-04-03 20:55:53 +08:00
Yonghua Li 55dbc83310 [Cherry-Pick][BugFix] prevent requests from entering running state without a slot(#7141) (#7181)
* [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (#7163)

* Set MC_MAX_MR_SIZE to avoid register hang

* up

* [fix] prevent requests from entering running state without a slot

* [fix] count abort set

* [fix] count preempted task in waiting list

---------

Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com>
2026-04-03 17:46:13 +08:00