FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
RichardWooSJTU	d2d633b05c	allow parallel dp starting (#7426 )	2026-04-16 18:43:09 +08:00
RichardWooSJTU	420a8c1af5	fix deep gemm import (#7425 )	2026-04-16 17:56:56 +08:00
ddchenhao66	e9527208d9	[BugFix][XPU] Fix kv_cache management bug (#7420 )	2026-04-16 15:45:45 +08:00
zhouchong	6e16438a57	[Feature] implement log channel separation and request log level system (#7190 ) * feat: implement log channel separation and request log level system * fix: log system improvements based on review * add request_id to error logs, use RequestLogLevel enum, and unify logger implementation from utils to logger module	2026-04-16 15:13:05 +08:00
Jiajun Ji	29495b2cf1	[XPU] Unify Spec and non-spec branch.(#6947 ) (#7180 ) * [XPU] cherry-pick PR-6947 * [XPU] use unified_update_model_status. * refactor xpu_model_runner. * refactor sampler. * fix codestyle. * Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path. * fix codestyle. * replace output_padding_offset with is_speculative flag in gather_next_token. * rename hiddden_states. * unify cu_seqlens_q_output and batch_id_per_token_output init. --------- Co-authored-by: cmcamdy <1027740945@qq.com>	2026-04-16 14:58:38 +08:00
YuBaoku	17002edc47	[CI] Add approval check for logging-related modifications (#7429 )	2026-04-16 14:50:22 +08:00
RuohengMa	de0c5e68fb	[XPU] Split the block_attn operator into smaller operators (#6798 ) * spliced block_attn * adapt to latest vllm * fix unit tests * delete mtp+cudagraph 4 cards test * fix vl model * fix mtp * fix slot mapping	2026-04-16 14:28:40 +08:00
Bingoo	6b891da02b	[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660 ) * enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues	2026-04-16 14:10:19 +08:00
jc	e53f5184ac	PD deployment support without router (#7412 )	2026-04-15 20:13:07 +08:00
GoldPancake	a498720a75	[RL] Add clear_graph_opt_backend for glm4_mtp (#7378 ) * add clear_grpah func * fix spell	2026-04-15 19:44:15 +08:00
RichardWooSJTU	dec0b060fc	[Optimization] Auto set num_max_dispatch_tokens_per_rank (#7237 ) * auto set num_max_dispatch_tokens_per_rank * fix ci * fix ci * fix ci	2026-04-15 19:13:38 +08:00
luukunn	3f84d8d893	[DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298 ) * merge mm processor	2026-04-15 19:01:06 +08:00
Bingoo	a218d29488	modify flash_mask version (#7413 )	2026-04-15 18:16:58 +08:00
luukunn	14d556692b	[BugFix] fix tool call parser (#7369 ) * fix tool call parser * add unit test * fix unit test * add unit test	2026-04-15 16:21:46 +08:00
AIbin	8eebbcaf15	[BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit (#7407 ) * fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len * fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len	2026-04-15 15:55:11 +08:00
周周周	5e54770b2e	[Feature] 添加 MoE 层 latent mode 支持 (#7382 )	2026-04-15 13:57:07 +08:00
lonelygsh	f7a2418ce2	[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402 )	2026-04-15 12:45:23 +08:00
AIbin	8995a38fa4	fix dsa indexer norm to layernorm (#7398 )	2026-04-15 11:42:45 +08:00
AIbin	bb30f88f1a	[Models] support MLA gate attention (#7404 ) * support mla gate attn * support mla gate attn	2026-04-15 11:42:34 +08:00
chen	616b29ce08	check init_flash_attn_version log (#7399 )	2026-04-15 11:05:10 +08:00
cmcamdy	13b9fe7299	[XPU] add verify draft tokens (#6947 ) * [XPU] add verify draft tokens * fix test * fix code style * use sync cpy * fix code style * fix kernel check * fix ramdom seed * fix test * fix check * fix eos set * fix verify * fix verify	2026-04-15 10:18:33 +08:00
lonelygsh	e0a1653b26	[Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349 ) Co-authored-by: guanshihui] <guanshihui@baidu.com>	2026-04-14 20:57:11 +08:00
sunxin	7b0baced17	fix rl moe gate type (#7393 )	2026-04-14 20:04:04 +08:00
Echo-Nie	8819a039c9	[Others] Fix typo (#7280 ) * typo * typo * typo * typo	2026-04-14 17:28:22 +08:00
luukunn	9d9d79c457	[DataProcessor] add strict (#7307 ) * add strict * fix	2026-04-14 17:25:38 +08:00
kevin	ff47701f31	[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364 ) ## Motivation 在 PD 分离场景下，decode 节点在接收 prefill 节点转发的请求后，没有及时更新 cache block 的命中信息，导致 prefix cache 命中率低，影响推理性能。 ## Modifications 1. 在 `_free_blocks_when_stop` 方法中，额外排除 prefill 节点（`splitwise_role == "prefill"`）的 cache block 更新，避免 prefill 节点重复更新 cache 导致状态混乱。 2. 在 decode 节点分配请求（`_alloc_requests_with_cache`）成功后，主动调用 `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息，确保 decode 节点能正确感知已命中的 prefix cache。	2026-04-14 16:15:43 +08:00
Bingoo	9c23e6154c	[Others] replace tool_helpers to fast_dataindex (#7353 ) * replace tool_helpers to fast_dataindex * modify others requirement	2026-04-14 15:13:54 +08:00
xiaoxiaohehe001	abba29b348	[BugFix] fix mm rope (#7274 )	2026-04-14 11:36:08 +08:00
Yuanle Liu	8f21c9caa6	[BugFix] fix gitignore claude (#7381 )	2026-04-13 20:32:45 -07:00
zhupengyang	27b00cf385	[XPU] glm-4.5-air (#7071 )	2026-04-14 11:31:49 +08:00
chen	26c47c2afc	update attn_mask_q 2 (#7371 )	2026-04-13 23:06:04 +08:00
Yuanle Liu	0ddb6e461c	[Optimization] 移除 num_blocks 上限限制 (#7241 )	2026-04-13 07:07:41 -07:00
lonelygsh	e83d45833f	[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166 ) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.	2026-04-13 20:53:42 +08:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
YuBaoku	1e08ee74e5	[CI] Modify 4-card container startup config and move test case (#7363 )	2026-04-13 05:23:49 -07:00
freeliuzc	31e2a8bbad	[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323 ) * support mtp overlap in pd-split mode with insert_task overlap	2026-04-13 19:41:17 +08:00
JYChen	5ddd1af756	remove fa4 requirements (#7143 )	2026-04-13 19:24:20 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
Zhang Yulong	738c658c54	[Benchmark] Update seed argument handling in benchmark_serving.py (#7356 )	2026-04-13 16:05:50 +08:00
周周周	a6f0055d51	add ips check (#7352 ) * commit * commit --------- Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-13 15:24:22 +08:00
liuruyan	b34708604c	[TI-consistent] support quant use pow2scale (#7308 ) * support quant use pow2scale * fix * fix	2026-04-13 00:01:53 -07:00
AIbin	6213ad5340	[Docs][BugFix] fix mla log (#7243 ) * [Docs] Fix Chinese punctuation issues	2026-04-13 12:15:43 +08:00
Nyako Shigure	d659099415	[Cleanup] Replace torch proxy alias with public compat API (#7348 )	2026-04-13 11:43:26 +08:00
Jiajun Ji	cb03958b52	[XPU] Refactor get_padding_offset to single kernel. (#7029 ) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len.	2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun	26d6a20c2f	[Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299 ) * [Optim] Remove IPCLock between CacheManager and WorkerProcess * Update envs.py * Update worker_process.py --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>	2026-04-12 13:59:34 +08:00
周周周	225fc8d222	use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340 )	2026-04-11 22:39:43 +08:00
chen	4982aa000e	[RL]moe bf16 ep support paddle batch_gemm (#7337 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:12 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
JYChen	076ab07528	[RL] change glm rope_emb calculation (#7316 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:36:28 +08:00
YuBaoku	fcf8b1336d	[CI] Fix nightly test error and add container cleanup in build_rl (#7335 ) * [CI] Fix nightly test error and add container cleanup in build_rl	2026-04-11 12:14:46 +08:00

1 2 3 4 5 ...

5069 Commits