FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
jc	6847891241	Mooncake storage register local buffer by chunk (#7416 )	2026-04-17 10:39:34 +08:00
AIbin	6ce4854714	[Feature] Support MOE Cutlass backend for latent MOE (#7428 ) * support moe cutlass backend latent moe	2026-04-16 22:11:49 +08:00
ShaneGZhu	2d8338f9e4	[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. (#7367 )	2026-04-16 19:54:12 +08:00
RichardWooSJTU	d2d633b05c	allow parallel dp starting (#7426 )	2026-04-16 18:43:09 +08:00
RichardWooSJTU	420a8c1af5	fix deep gemm import (#7425 )	2026-04-16 17:56:56 +08:00
ddchenhao66	e9527208d9	[BugFix][XPU] Fix kv_cache management bug (#7420 )	2026-04-16 15:45:45 +08:00
zhouchong	6e16438a57	[Feature] implement log channel separation and request log level system (#7190 ) * feat: implement log channel separation and request log level system * fix: log system improvements based on review * add request_id to error logs, use RequestLogLevel enum, and unify logger implementation from utils to logger module	2026-04-16 15:13:05 +08:00
Jiajun Ji	29495b2cf1	[XPU] Unify Spec and non-spec branch.(#6947 ) (#7180 ) * [XPU] cherry-pick PR-6947 * [XPU] use unified_update_model_status. * refactor xpu_model_runner. * refactor sampler. * fix codestyle. * Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path. * fix codestyle. * replace output_padding_offset with is_speculative flag in gather_next_token. * rename hiddden_states. * unify cu_seqlens_q_output and batch_id_per_token_output init. --------- Co-authored-by: cmcamdy <1027740945@qq.com>	2026-04-16 14:58:38 +08:00
RuohengMa	de0c5e68fb	[XPU] Split the block_attn operator into smaller operators (#6798 ) * spliced block_attn * adapt to latest vllm * fix unit tests * delete mtp+cudagraph 4 cards test * fix vl model * fix mtp * fix slot mapping	2026-04-16 14:28:40 +08:00
Bingoo	6b891da02b	[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660 ) * enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues	2026-04-16 14:10:19 +08:00
jc	e53f5184ac	PD deployment support without router (#7412 )	2026-04-15 20:13:07 +08:00
GoldPancake	a498720a75	[RL] Add clear_graph_opt_backend for glm4_mtp (#7378 ) * add clear_grpah func * fix spell	2026-04-15 19:44:15 +08:00
RichardWooSJTU	dec0b060fc	[Optimization] Auto set num_max_dispatch_tokens_per_rank (#7237 ) * auto set num_max_dispatch_tokens_per_rank * fix ci * fix ci * fix ci	2026-04-15 19:13:38 +08:00
luukunn	3f84d8d893	[DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298 ) * merge mm processor	2026-04-15 19:01:06 +08:00
luukunn	14d556692b	[BugFix] fix tool call parser (#7369 ) * fix tool call parser * add unit test * fix unit test * add unit test	2026-04-15 16:21:46 +08:00
AIbin	8eebbcaf15	[BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit (#7407 ) * fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len * fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len	2026-04-15 15:55:11 +08:00
周周周	5e54770b2e	[Feature] 添加 MoE 层 latent mode 支持 (#7382 )	2026-04-15 13:57:07 +08:00
lonelygsh	f7a2418ce2	[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402 )	2026-04-15 12:45:23 +08:00
AIbin	8995a38fa4	fix dsa indexer norm to layernorm (#7398 )	2026-04-15 11:42:45 +08:00
AIbin	bb30f88f1a	[Models] support MLA gate attention (#7404 ) * support mla gate attn * support mla gate attn	2026-04-15 11:42:34 +08:00
chen	616b29ce08	check init_flash_attn_version log (#7399 )	2026-04-15 11:05:10 +08:00
sunxin	7b0baced17	fix rl moe gate type (#7393 )	2026-04-14 20:04:04 +08:00
Echo-Nie	8819a039c9	[Others] Fix typo (#7280 ) * typo * typo * typo * typo	2026-04-14 17:28:22 +08:00
luukunn	9d9d79c457	[DataProcessor] add strict (#7307 ) * add strict * fix	2026-04-14 17:25:38 +08:00
kevin	ff47701f31	[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364 ) ## Motivation 在 PD 分离场景下，decode 节点在接收 prefill 节点转发的请求后，没有及时更新 cache block 的命中信息，导致 prefix cache 命中率低，影响推理性能。 ## Modifications 1. 在 `_free_blocks_when_stop` 方法中，额外排除 prefill 节点（`splitwise_role == "prefill"`）的 cache block 更新，避免 prefill 节点重复更新 cache 导致状态混乱。 2. 在 decode 节点分配请求（`_alloc_requests_with_cache`）成功后，主动调用 `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息，确保 decode 节点能正确感知已命中的 prefix cache。	2026-04-14 16:15:43 +08:00
xiaoxiaohehe001	abba29b348	[BugFix] fix mm rope (#7274 )	2026-04-14 11:36:08 +08:00
zhupengyang	27b00cf385	[XPU] glm-4.5-air (#7071 )	2026-04-14 11:31:49 +08:00
Yuanle Liu	0ddb6e461c	[Optimization] 移除 num_blocks 上限限制 (#7241 )	2026-04-13 07:07:41 -07:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
freeliuzc	31e2a8bbad	[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323 ) * support mtp overlap in pd-split mode with insert_task overlap	2026-04-13 19:41:17 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
周周周	a6f0055d51	add ips check (#7352 ) * commit * commit --------- Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-13 15:24:22 +08:00
liuruyan	b34708604c	[TI-consistent] support quant use pow2scale (#7308 ) * support quant use pow2scale * fix * fix	2026-04-13 00:01:53 -07:00
AIbin	6213ad5340	[Docs][BugFix] fix mla log (#7243 ) * [Docs] Fix Chinese punctuation issues	2026-04-13 12:15:43 +08:00
Nyako Shigure	d659099415	[Cleanup] Replace torch proxy alias with public compat API (#7348 )	2026-04-13 11:43:26 +08:00
Jiajun Ji	cb03958b52	[XPU] Refactor get_padding_offset to single kernel. (#7029 ) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len.	2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun	26d6a20c2f	[Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299 ) * [Optim] Remove IPCLock between CacheManager and WorkerProcess * Update envs.py * Update worker_process.py --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>	2026-04-12 13:59:34 +08:00
周周周	225fc8d222	use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340 )	2026-04-11 22:39:43 +08:00
chen	4982aa000e	[RL]moe bf16 ep support paddle batch_gemm (#7337 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:12 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
JYChen	076ab07528	[RL] change glm rope_emb calculation (#7316 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:36:28 +08:00
sunxin	00005c92e0	[BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300 )	2026-04-10 03:29:45 -07:00
zhangbo9674	627f0d9cc8	[RL] change rms norm for glm (#7269 ) * change rms norm for glm * refine code * refine code * refine code	2026-04-10 01:02:37 -07:00
K11OntheBoat	870dbac370	Use triton qk_norm both in Prefill and Decode (#7213 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-10 15:44:01 +08:00
bukejiyu	14d46181b8	[Loader] add multi-thread model loading (#6877 ) * multi-thread-loader * fix ut	2026-04-09 23:40:15 -07:00
GoldPancake	c1fb3112f8	[FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281 ) * refactor quant cli param	2026-04-10 14:13:42 +08:00
lizexu123	613f92ee8f	[Feature] support nvfp4 tbo (#7259 )	2026-04-09 17:29:39 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
xiaoxiaohehe001	51efe27d76	[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210 ) * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn	2026-04-09 11:05:10 +08:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00

1 2 3 4 5 ...

2024 Commits