FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
qwes5s5	9c91ecb1ec	[Cherry-Pick][BugFix] Fix bugs in /v1/abort_requests interface from PR(#6992 ) (#7176 ) (#7551 ) * abort api bug fix * bug fix * bug fix	2026-04-22 15:49:51 +08:00
GoldPancake	2961400190	[Cherry-Pick][BugFix] Fix clear_parameters hang issue in MTP during weight cleanup in RL (#7522 ) (#7523 ) * fix mtp clear graph bugs in rl	2026-04-22 15:24:10 +08:00
Jiang-Jia-Jun	b0fde163a6	Enable output caching by default	2026-04-22 11:01:54 +08:00
Jiang-Jia-Jun	86df2a9e86	Update args_utils.py (#7549 )	2026-04-22 10:59:52 +08:00
jc	d5518463ce	Mooncake storage register local buffer by chunk (#7416 ) (#7540 )	2026-04-22 10:46:57 +08:00
YuBaoku	13034ef0ca	[BugFix] Fix skip_x_record_stream incompatibility across deep_ep versions (#7542 ) (#7546 ) * fix skip_x_record_stream * fix * optim Co-authored-by: Yuanle Liu <yuanlehome@163.com>	2026-04-21 06:31:45 -07:00
chen	be2fd17e7d	add m_grouped_bf16_gemm_nn_contiguous(#7536 )	2026-04-21 20:20:03 +08:00
RAM	74ddb20a73	[RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel (#7496 ) * [RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604) * Optimizate delete batch and fused put * refine code * refine code * refine code * Support suspend r3 * [RL] Fix R3 Empty bug with TP=1 (#6777) * Fix int32 overflow * refine code * fix seq_lens_decoder bug --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-04-21 01:51:45 -07:00
zhouchong	95261f098b	Unify num_experts_per_tok to moe_k in ModelConfig for MoE model compatibility (#7517 )	2026-04-21 15:21:47 +08:00
YuBaoku	f4f7760925	[CI] Temporarily pin paddlepaddle-gpu to 3.5.0.dev20260417 (#7486 ) (#7519 )	2026-04-20 21:09:21 +08:00
jackyYang6	fc801f8387	[Bugfix][RL] fix control request timeout in async update weights pipeline (#7470 )	2026-04-20 11:23:44 +08:00
freeliuzc	56b761de3f	[Cherry-Pick][Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy(#7467 ) (#7468 ) * fix repeat_time kernel and change default spec verify strategy * fix unit_test	2026-04-18 00:07:34 +08:00
GoldPancake	650d1e49aa	[Cherry-Pick][Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442 ) (#7464 ) * support mtp logprob in pd * fix * fix * fix * fix xpu bugs	2026-04-17 21:37:42 +08:00
freeliuzc	185708b566	[Cherry-Pick][BugFix] Fix real token exceeding max_batched_tokens limit(#7438 ) (#7439 ) * fix max_num_batched_tokens error compute * add temperatory solution * fix bug	2026-04-17 16:17:59 +08:00
YuBaoku	72ce56b10b	[BugFix] fix tool call parser (#7369 ) (#7419 ) * fix tool call parser * add unit test * fix unit test * add unit test Co-authored-by: luukunn <981429396@qq.com>	2026-04-16 17:15:03 +08:00
jc	b8e8a6253f	PD deployment support without router (#7412 ) (#7424 )	2026-04-16 14:02:10 +08:00
GoldPancake	26674bbbb6	[Cherry-Pick][RL] Add clear_graph_opt_backend for glm4_mtp (#7378 ) (#7379 ) * add clear_grpah func * fix spell	2026-04-15 19:45:09 +08:00
Bingoo	61bfe6e5b3	modify flashmask version (#7414 )	2026-04-15 18:19:21 +08:00
chen	2ee1cc3d0a	check init_flash_attn_version log (#7401 )	2026-04-15 11:05:20 +08:00
sunxin	5f7524eb85	fix rl moe gate type (#7394 )	2026-04-14 20:04:09 +08:00
freeliuzc	f6c066fb9d	Revert "[Optimization] Optimize ttft for prefill pd (#6680 )" (#7386 ) * Revert "[Optimization] Optimize ttft for prefill pd (#6680)" This reverts commit `6727df8286`. * fix revert pr	2026-04-14 20:01:39 +08:00
YuBaoku	8a8beca548	[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364 ) (#7387 ) ## Motivation 在 PD 分离场景下，decode 节点在接收 prefill 节点转发的请求后，没有及时更新 cache block 的命中信息，导致 prefix cache 命中率低，影响推理性能。 ## Modifications 1. 在 `_free_blocks_when_stop` 方法中，额外排除 prefill 节点（`splitwise_role == "prefill"`）的 cache block 更新，避免 prefill 节点重复更新 cache 导致状态混乱。 2. 在 decode 节点分配请求（`_alloc_requests_with_cache`）成功后，主动调用 `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息，确保 decode 节点能正确感知已命中的 prefix cache。 Co-authored-by: kevin <chengyf112@gmail.com>	2026-04-14 19:25:12 +08:00
lonelygsh	e7c8dc2fe9	[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7370 ) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.	2026-04-14 12:54:22 +08:00
chen	144dc17b14	update attn_mask_q 2 (#7373 )	2026-04-13 23:06:16 +08:00
JYChen	9823d63220	remove fa4 requirements (#7354 )	2026-04-13 19:24:24 +08:00
chenjian	d9a008f3c8	[Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159 ) (#7351 ) * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * fix	2026-04-13 15:24:01 +08:00
sunxin	b2997f3aad	fix overlap mtp empty run (#7314 )	2026-04-13 15:20:11 +08:00
liuruyan	9cb82d79a0	[Cherry-Pick][TI-consistent] support quant use pow2scale(#7308 ) (#7310 ) * support quant use pow2scale * fix * fix	2026-04-13 00:02:08 -07:00
YuBaoku	9e8ea7db14	[Cherry-Pick][CI] Sync dev optimizations to 2.6(#7335 ) (#7343 )	2026-04-12 13:22:52 +08:00
chen	7446665676	[Cherry-Pick][RL]moe bf16 ep support paddle batch_gemm(#7337 ) (#7339 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:26 +08:00
JYChen	42b0f59b9e	[Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7318 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:38:37 +08:00
YuBaoku	65c6e726f5	[Cherry-Pick][Docs] Update Release Note(#7302 ) (#7341 )	2026-04-11 16:48:06 +08:00
YuBaoku	2ac9b89409	[XPU][CI]Update xtdk version in download_dependencies.sh (#7320 ) (#7322 ) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-04-11 00:27:54 +08:00
GoldPancake	c7560383ab	[Cherry-Pick][FDConfig] Auto-scale CUDA Graph Capture & CLI Quantization Params + CUDAGraph Validation (#7215,#7281) (#7301 ) * refactor cudagraph args * refactor quant cli param * fix * fix * tmp skip xpu * fix	2026-04-10 16:10:31 +08:00
zhangbo9674	4f36346e14	[Cherry-Pick] change rms norm for glm #7269 (#7276 ) * fix * refine code * refine code * refine code * refine code * refine code	2026-04-10 01:03:00 -07:00
YuBaoku	dd0863b076	[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221 ) (#7296 ) Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>	2026-04-10 13:54:02 +08:00
fxyfxy777	dea9d35171	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 ) (#7279 )	2026-04-09 21:37:42 +08:00
YuBaoku	921a0ae60b	[Docs] Update docs for release/2.5 (#7267 ) (#7277 ) * Update docs for release/2.5 * Update English docs for release/2.5 - Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link - Update docs/get_started/installation/nvidia_gpu.md: - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option - Update docs/zh/get_started/installation/nvidia_gpu.md: - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1 Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f * Clarify --extra-index-url usage in installation docs Add note explaining that --extra-index-url is only for downloading fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed from the Paddle source specified by -i. Applied to both Chinese and English nvidia_gpu.md installation guides. Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c * Update nvidia_gpu.md --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>	2026-04-09 21:03:19 +08:00
Jiaxin Sui	6fcc25f3f6	Update ci_metax.yml (#7286 )	2026-04-09 17:31:20 +08:00
Bingoo	849eb3df65	[Cherry-Pick][Optimization] merge matmul and add （#6986） (#7191 ) * merge matmul and add * modify format * using paddle.nn.functional.linear * using _C_ops.linear * using paddle.nn.functional.linear * add FLAGS_use_legacy_linear env var in test case * fix format * add assert and remove env * modify format * using matmul for no bias * modify accurate baseline	2026-04-09 14:15:43 +08:00
YuBaoku	098dd2c251	[XPU][CI] lock xvllm version for fix bug (#7264 ) (#7266 ) * Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-04-09 12:46:13 +08:00
xiaoxiaohehe001	5fd8020363	[Cherry-Pick][BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7216 )	2026-04-09 11:05:43 +08:00
JYChen	9c65655cb3	[Cherry-Pick][RL] support moe-topk use topk_reduce_func #7218 (#7256 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:10 +08:00
Bingoo	01818844b4	support moe for sm103 (#7240 )	2026-04-08 20:56:23 +08:00
YuBaoku	84d62712c9	[Feature]distinguish whl version (#7204 ) (#7224 ) * [Feature]whl version * [Feature]whl version,set root_is_pure = false * [Feature]code style Co-authored-by: ChowMingSing <610208940@qq.com>	2026-04-08 17:32:38 +08:00
YuBaoku	6b78981dde	Split enable_mm (#7183 ) (#7233 ) Co-authored-by: K11OntheBoat <ruianmaidanglao@163.com> Co-authored-by: liuruian <liuruian@MacBook-Pro.local>	2026-04-08 16:32:04 +08:00
GoldPancake	403ce139c7	remove arctic_inference deps (#7236 )	2026-04-08 15:25:21 +08:00
huicongyao	36909bf27d	[Cherry-Pick][BugFix] fix MTP bugs in TP and overlap(#7172 ) (#7192 ) * fix MTP bugs in TP and overlap * fix	2026-04-08 10:24:38 +08:00
YuBaoku	7ab48c4760	[Cherry-Pick][CI] Use GPU-Build-RL runner for _build_linux_rl.yml (#7186 ) (#7195 )	2026-04-03 20:55:53 +08:00
Yonghua Li	55dbc83310	[Cherry-Pick][BugFix] prevent requests from entering running state without a slot(#7141 ) (#7181 ) * [BugFix] Set MC_MAX_MR_SIZE to avoid register hang (#7163) * Set MC_MAX_MR_SIZE to avoid register hang * up * [fix] prevent requests from entering running state without a slot * [fix] count abort set * [fix] count preempted task in waiting list --------- Co-authored-by: jc <52520497+juncaipeng@users.noreply.github.com>	2026-04-03 17:46:13 +08:00

1 2 3 4 5 ...

5010 Commits