qwes5s5
9c91ecb1ec
[Cherry-Pick][BugFix] Fix bugs in /v1/abort_requests interface from PR( #6992 ) ( #7176 ) ( #7551 )
...
* abort api bug fix
* bug fix
* bug fix
2026-04-22 15:49:51 +08:00
GoldPancake
2961400190
[Cherry-Pick][BugFix] Fix clear_parameters hang issue in MTP during weight cleanup in RL ( #7522 ) ( #7523 )
...
* fix mtp clear graph bugs in rl
2026-04-22 15:24:10 +08:00
Jiang-Jia-Jun
b0fde163a6
Enable output caching by default
2026-04-22 11:01:54 +08:00
Jiang-Jia-Jun
86df2a9e86
Update args_utils.py ( #7549 )
2026-04-22 10:59:52 +08:00
jc
d5518463ce
Mooncake storage register local buffer by chunk ( #7416 ) ( #7540 )
2026-04-22 10:46:57 +08:00
YuBaoku
13034ef0ca
[BugFix] Fix skip_x_record_stream incompatibility across deep_ep versions ( #7542 ) ( #7546 )
...
* fix skip_x_record_stream
* fix
* optim
Co-authored-by: Yuanle Liu <yuanlehome@163.com >
2026-04-21 06:31:45 -07:00
chen
be2fd17e7d
add m_grouped_bf16_gemm_nn_contiguous( #7536 )
2026-04-21 20:20:03 +08:00
RAM
74ddb20a73
[RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel ( #7496 )
...
* [RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604 )
* Optimizate delete batch and fused put
* refine code
* refine code
* refine code
* Support suspend r3
* [RL] Fix R3 Empty bug with TP=1 (#6777 )
* Fix int32 overflow
* refine code
* fix seq_lens_decoder bug
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-04-21 01:51:45 -07:00
zhouchong
95261f098b
Unify num_experts_per_tok to moe_k in ModelConfig for MoE model compatibility ( #7517 )
2026-04-21 15:21:47 +08:00
YuBaoku
f4f7760925
[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 ( #7486 ) ( #7519 )
2026-04-20 21:09:21 +08:00
jackyYang6
fc801f8387
[Bugfix][RL] fix control request timeout in async update weights pipeline ( #7470 )
2026-04-20 11:23:44 +08:00
freeliuzc
56b761de3f
[Cherry-Pick][Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy( #7467 ) ( #7468 )
...
* fix repeat_time kernel and change default spec verify strategy
* fix unit_test
2026-04-18 00:07:34 +08:00
GoldPancake
650d1e49aa
[Cherry-Pick][Speculative Decoding] Add MTP logprob support for PD disaggregation ( #7442 ) ( #7464 )
...
* support mtp logprob in pd
* fix
* fix
* fix
* fix xpu bugs
2026-04-17 21:37:42 +08:00
freeliuzc
185708b566
[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit( #7438 ) ( #7439 )
...
* fix max_num_batched_tokens error compute
* add temperatory solution
* fix bug
2026-04-17 16:17:59 +08:00
YuBaoku
72ce56b10b
[BugFix] fix tool call parser ( #7369 ) ( #7419 )
...
* fix tool call parser
* add unit test
* fix unit test
* add unit test
Co-authored-by: luukunn <981429396@qq.com >
2026-04-16 17:15:03 +08:00
jc
b8e8a6253f
PD deployment support without router ( #7412 ) ( #7424 )
2026-04-16 14:02:10 +08:00
GoldPancake
26674bbbb6
[Cherry-Pick][RL] Add clear_graph_opt_backend for glm4_mtp ( #7378 ) ( #7379 )
...
* add clear_grpah func
* fix spell
2026-04-15 19:45:09 +08:00
Bingoo
61bfe6e5b3
modify flashmask version ( #7414 )
2026-04-15 18:19:21 +08:00
chen
2ee1cc3d0a
check init_flash_attn_version log ( #7401 )
2026-04-15 11:05:20 +08:00
sunxin
5f7524eb85
fix rl moe gate type ( #7394 )
2026-04-14 20:04:09 +08:00
freeliuzc
f6c066fb9d
Revert "[Optimization] Optimize ttft for prefill pd ( #6680 )" ( #7386 )
...
* Revert "[Optimization] Optimize ttft for prefill pd (#6680 )"
This reverts commit 6727df8286 .
* fix revert pr
2026-04-14 20:01:39 +08:00
YuBaoku
8a8beca548
[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario ( #7364 ) ( #7387 )
...
## Motivation
在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。
## Modifications
1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
`update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
确保 decode 节点能正确感知已命中的 prefix cache。
Co-authored-by: kevin <chengyf112@gmail.com >
2026-04-14 19:25:12 +08:00
lonelygsh
e7c8dc2fe9
[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels ( #7370 )
...
- speculate_limit_thinking_content_length: update current_base_step to
step_idx+1 (step_idx now records history count before current round);
remove incorrect step_idx decrement on accept_num truncation; mark
step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
formula (remove stale -accept_num offset); use <= condition so accept_idx
maps directly to the accepted token that ends the stop sequence; fix
accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
2026-04-14 12:54:22 +08:00
chen
144dc17b14
update attn_mask_q 2 ( #7373 )
2026-04-13 23:06:16 +08:00
JYChen
9823d63220
remove fa4 requirements ( #7354 )
2026-04-13 19:24:24 +08:00
chenjian
d9a008f3c8
[Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 ( #7159 ) ( #7351 )
...
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1
* fix
2026-04-13 15:24:01 +08:00
sunxin
b2997f3aad
fix overlap mtp empty run ( #7314 )
2026-04-13 15:20:11 +08:00
liuruyan
9cb82d79a0
[Cherry-Pick][TI-consistent] support quant use pow2scale( #7308 ) ( #7310 )
...
* support quant use pow2scale
* fix
* fix
2026-04-13 00:02:08 -07:00
YuBaoku
9e8ea7db14
[Cherry-Pick][CI] Sync dev optimizations to 2.6( #7335 ) ( #7343 )
2026-04-12 13:22:52 +08:00
chen
7446665676
[Cherry-Pick][RL]moe bf16 ep support paddle batch_gemm( #7337 ) ( #7339 )
...
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:26 +08:00
JYChen
42b0f59b9e
[Cherry-Pick][RL] change glm rope_emb calculation #7316 ( #7318 )
...
* change glm rope_emb calculation
* glm without EnforceFmulRN
* fix ci
2026-04-11 18:38:37 +08:00
YuBaoku
65c6e726f5
[Cherry-Pick][Docs] Update Release Note( #7302 ) ( #7341 )
2026-04-11 16:48:06 +08:00
YuBaoku
2ac9b89409
[XPU][CI]Update xtdk version in download_dependencies.sh ( #7320 ) ( #7322 )
...
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-04-11 00:27:54 +08:00
GoldPancake
c7560383ab
[Cherry-Pick][FDConfig] Auto-scale CUDA Graph Capture & CLI Quantization Params + CUDAGraph Validation (#7215,#7281) ( #7301 )
...
* refactor cudagraph args
* refactor quant cli param
* fix
* fix
* tmp skip xpu
* fix
2026-04-10 16:10:31 +08:00
zhangbo9674
4f36346e14
[Cherry-Pick] change rms norm for glm #7269 ( #7276 )
...
* fix
* refine code
* refine code
* refine code
* refine code
* refine code
2026-04-10 01:03:00 -07:00
YuBaoku
dd0863b076
[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug ( #7221 ) ( #7296 )
...
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com >
2026-04-10 13:54:02 +08:00
fxyfxy777
dea9d35171
[OP]Unify MoE op with moe_permute path for bf16 GLM ( #7164 ) ( #7279 )
2026-04-09 21:37:42 +08:00
YuBaoku
921a0ae60b
[Docs] Update docs for release/2.5 ( #7267 ) ( #7277 )
...
* Update docs for release/2.5
* Update English docs for release/2.5
- Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link
- Update docs/get_started/installation/nvidia_gpu.md:
- Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support
- paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives
- fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option
- Update docs/zh/get_started/installation/nvidia_gpu.md:
- Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1
Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f
* Clarify --extra-index-url usage in installation docs
Add note explaining that --extra-index-url is only for downloading
fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed
from the Paddle source specified by -i. Applied to both Chinese and
English nvidia_gpu.md installation guides.
Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c
* Update nvidia_gpu.md
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-04-09 21:03:19 +08:00
Jiaxin Sui
6fcc25f3f6
Update ci_metax.yml ( #7286 )
2026-04-09 17:31:20 +08:00
Bingoo
849eb3df65
[Cherry-Pick][Optimization] merge matmul and add (#6986) ( #7191 )
...
* merge matmul and add
* modify format
* using paddle.nn.functional.linear
* using _C_ops.linear
* using paddle.nn.functional.linear
* add FLAGS_use_legacy_linear env var in test case
* fix format
* add assert and remove env
* modify format
* using matmul for no bias
* modify accurate baseline
2026-04-09 14:15:43 +08:00
YuBaoku
098dd2c251
[XPU][CI] lock xvllm version for fix bug ( #7264 ) ( #7266 )
...
* Remove duplicate NICs from environment variables
* Update version for xvllm in download_dependencies.sh
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-04-09 12:46:13 +08:00
xiaoxiaohehe001
5fd8020363
[Cherry-Pick][BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn ( #7216 )
2026-04-09 11:05:43 +08:00
JYChen
9c65655cb3
[Cherry-Pick][RL] support moe-topk use topk_reduce_func #7218 ( #7256 )
...
* support moe-topk use topk_reduce_func
* fix ep error
* fix ut
* fix ut
2026-04-09 11:01:10 +08:00
Bingoo
01818844b4
support moe for sm103 ( #7240 )
2026-04-08 20:56:23 +08:00
YuBaoku
84d62712c9
[Feature]distinguish whl version ( #7204 ) ( #7224 )
...
* [Feature]whl version
* [Feature]whl version,set root_is_pure = false
* [Feature]code style
Co-authored-by: ChowMingSing <610208940@qq.com >
2026-04-08 17:32:38 +08:00
YuBaoku
6b78981dde
Split enable_mm ( #7183 ) ( #7233 )
...
Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com >
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 16:32:04 +08:00
GoldPancake
403ce139c7
remove arctic_inference deps ( #7236 )
2026-04-08 15:25:21 +08:00
huicongyao
36909bf27d
[Cherry-Pick][BugFix] fix MTP bugs in TP and overlap( #7172 ) ( #7192 )
...
* fix MTP bugs in TP and overlap
* fix
2026-04-08 10:24:38 +08:00
YuBaoku
7ab48c4760
[Cherry-Pick][CI] Use GPU-Build-RL runner for _build_linux_rl.yml ( #7186 ) ( #7195 )
2026-04-03 20:55:53 +08:00
Yonghua Li
55dbc83310
[Cherry-Pick][BugFix] prevent requests from entering running state without a slot( #7141 ) ( #7181 )
...
* [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (#7163 )
* Set MC_MAX_MR_SIZE to avoid register hang
* up
* [fix] prevent requests from entering running state without a slot
* [fix] count abort set
* [fix] count preempted task in waiting list
---------
Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com >
2026-04-03 17:46:13 +08:00