FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
Yuanle Liu	0ddb6e461c	[Optimization] 移除 num_blocks 上限限制 (#7241 )	2026-04-13 07:07:41 -07:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
freeliuzc	31e2a8bbad	[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323 ) * support mtp overlap in pd-split mode with insert_task overlap	2026-04-13 19:41:17 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
周周周	a6f0055d51	add ips check (#7352 ) * commit * commit --------- Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-13 15:24:22 +08:00
liuruyan	b34708604c	[TI-consistent] support quant use pow2scale (#7308 ) * support quant use pow2scale * fix * fix	2026-04-13 00:01:53 -07:00
AIbin	6213ad5340	[Docs][BugFix] fix mla log (#7243 ) * [Docs] Fix Chinese punctuation issues	2026-04-13 12:15:43 +08:00
Nyako Shigure	d659099415	[Cleanup] Replace torch proxy alias with public compat API (#7348 )	2026-04-13 11:43:26 +08:00
Jiajun Ji	cb03958b52	[XPU] Refactor get_padding_offset to single kernel. (#7029 ) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len.	2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun	26d6a20c2f	[Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299 ) * [Optim] Remove IPCLock between CacheManager and WorkerProcess * Update envs.py * Update worker_process.py --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>	2026-04-12 13:59:34 +08:00
周周周	225fc8d222	use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340 )	2026-04-11 22:39:43 +08:00
chen	4982aa000e	[RL]moe bf16 ep support paddle batch_gemm (#7337 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:12 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
JYChen	076ab07528	[RL] change glm rope_emb calculation (#7316 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:36:28 +08:00
sunxin	00005c92e0	[BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300 )	2026-04-10 03:29:45 -07:00
zhangbo9674	627f0d9cc8	[RL] change rms norm for glm (#7269 ) * change rms norm for glm * refine code * refine code * refine code	2026-04-10 01:02:37 -07:00
K11OntheBoat	870dbac370	Use triton qk_norm both in Prefill and Decode (#7213 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-10 15:44:01 +08:00
bukejiyu	14d46181b8	[Loader] add multi-thread model loading (#6877 ) * multi-thread-loader * fix ut	2026-04-09 23:40:15 -07:00
GoldPancake	c1fb3112f8	[FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281 ) * refactor quant cli param	2026-04-10 14:13:42 +08:00
lizexu123	613f92ee8f	[Feature] support nvfp4 tbo (#7259 )	2026-04-09 17:29:39 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
xiaoxiaohehe001	51efe27d76	[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210 ) * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn	2026-04-09 11:05:10 +08:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00
ShaneGZhu	7005404ce3	[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253 ) * Delete contiguous ops. * fix scale * Delete unnecessary comments * fix style	2026-04-09 11:00:13 +08:00
AIbin	48d2bbeb74	fix dsa (#7252 )	2026-04-08 20:21:38 +08:00
Longzhi Wang	b262419db1	Revert "[Other] support video_fps args for video bench (#7077 )" (#7254 ) This reverts commit `938e7dd881`. Co-authored-by: TBD1 <798934910@qq.com>	2026-04-08 20:13:57 +08:00
chenjian	427efadaee	[Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159 ) * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 * fix	2026-04-08 19:30:54 +08:00
Jiajun Ji	9b970de029	[XPU] Add TP broadcast after sampling in XPU model runner to ensure consistent results across ranks. (#7096 )	2026-04-08 19:26:53 +08:00
3em0	3749457476	[BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs (#7185 ) numpy tobytes() only serializes raw element bytes without encoding shape or dtype metadata. This means arrays with identical raw bytes but different shapes (e.g. (6,4) vs (4,6)) or different dtypes (e.g. float32 vs uint8 reinterpretation of same memory) produce the same SHA-256 digest, leading to silent cache collisions in ProcessorCacheManager / EncoderCacheManager / PrefixCacheManager. Prepend a "{shape}\|{dtype}\|" header to the byte payload before hashing so that shape and dtype participate in the digest. Added test cases for shape and dtype sensitivity.	2026-04-08 04:26:02 -07:00
RichardWooSJTU	771d42c90b	[TBO] Apply tbo to gpu_model_runner (#7165 ) * apply tbo in gpu_model_runner * fix	2026-04-08 16:55:17 +08:00
guozhuangzhuang	757bafe3bd	[Engine][DataProcessor] fix decode token (#7102 )	2026-04-08 15:41:32 +08:00
GoldPancake	aa23e0f966	remove arctic_inference deps (#7231 )	2026-04-08 15:25:14 +08:00
K11OntheBoat	bb48bcbaa2	Split enable_mm (#7183 ) Co-authored-by: liuruian <liuruian@MacBook-Pro.local>	2026-04-08 11:25:41 +08:00
luukunn	8496ec71a6	[DataProcessor] Move image_processor to unified directory and add MultiModalProcessor (#7109 ) * first commit * step 9~10 * update multimodal * update multimodal * fix load tokenizer * add unit test * fix unit test & AdaptiveImageProcessor * Delete unused code	2026-04-08 10:16:27 +08:00
GoldPancake	9d4fd19c3f	[Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding (#7215 )	2026-04-07 20:22:28 +08:00
lizhenyun01	446b26bbc0	[Feature] support blackwell gemm in ht (#7053 ) * [Feature] support blackwell gemm in ht * [Feature] support ops for convert * fix cuda error 716 * fix cuda error * opt memory * remove unused code	2026-04-07 19:52:51 +08:00
cloudforge1	c529c2ad98	[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized (#7136 ) * Port ngram_match and hybrid_mtp_ngram kernels to CUDA Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate CPU↔GPU data transfer overhead in speculative decoding. Key changes: - ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving sequential threshold semantics across batch items - ngram_match_mixed.cu: Replace CPU function with __global__ kernel - ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly - mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM). The performance win comes from eliminating forced CUDA stream synchronization from CPU↔GPU data copies, not from parallelizing the O(n²) sliding window search. * Add correctness + latency test for GPU ngram kernels * Fix test data: step_idx semantics and ngram-matchable patterns * fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops Restore backward compatibility with existing CPU-only operator tests (test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original C++ implementation. * fix(test): wrap imported ops with staticmethod to prevent self-binding Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments. * fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test * fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl Reverts line 39 to match develop (keeps .cpu()) so diff-cover no longer flags it as an uncovered changed line. The tensor is moved to GPU via .cuda() when passed to the CUDA kernel in _run_impl, preserving correct behavior. * Extract shared ngram search into __device__ helper (ngram_match_common.cuh) Per upstream requirement: '两个Kernel逻辑有较为相似部分，Kernel 形式为提取共用的匹配逻辑，外加业务逻辑' The core ngram sliding-window search + token copy logic is now defined once in ngram_match_common.cuh as two __device__ __forceinline__ functions: - ngram_search_and_copy: single-haystack sliding window match - ngram_search_batch_item: two-phase search (input_ids then pre_ids) Both kernels call ngram_search_batch_item with their business-specific parameters: - ngram_match_kernel: write_offset=1, min_ngram_size=1 - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time, min_ngram_size=configurable No functional change. CPU fallback paths unchanged. * refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search) Two-phase parallel architecture addressing reviewer feedback: - Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search using atomicMin64 CAS loop for leftmost-match semantics - Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch dependency via running sum of seq_lens_this_time) Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256 threads. Phase 2 is O(bsz × max_draft_tokens) — negligible. Shared code extracted into ngram_match_common.cuh: NgramMatchResult struct, atomicMin64, parallel_ngram_search, 4 kernel functions (search+gather for both kernel types) Tests: 6 new large-scale correctness tests with env-var threshold override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k for both ngram_match and hybrid_mtp_ngram. * fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error) Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh. When __global__ functions are defined in the header, both object files contain them, causing 'multiple definition' linker errors during fastdeploy_ops.so link. Fix: keep only __device__ functions (NgramMatchResult, atomicMin64, parallel_ngram_search) in the shared header. Move __global__ kernel definitions into each respective .cu file. Net code change: +304/-304 (zero net lines). * fix: align mixed kernel signatures with host function tensors Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu: - Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time (host function does not have seq_lens_encoder tensor) - Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time per-batch from seq_lens_this_time (matches CPU path logic) - Fix max_draft_tokens computation to match CPU path formula - Fix skip condition to match CPU path: ori_seq_len_this_time==0 \|\| max_draft_tokens<=0 * 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB BlockScan prefix-sum for parallel threshold enforcement, replacing the serial <<<1,1>>> loop. Architecture: - Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies matched draft tokens to scratch buffers (draft_tokens_copy) and writes tentative seq_lens_this_time to a copy buffer. - Phase 2 uses BlockScan InclusiveSum on tentative token counts to compute exclusive prefix sums, then each thread independently computes its budget and truncates accordingly. Both ngram_match.cu and ngram_match_mixed.cu updated. Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers are allocated internally in the host function. * fix: resolve Copilot/bot review comments on PR #7136 - Remove dead NgramMatchResult writes from both Phase 1 kernels - Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active - Add remaining_active budget deduction to mixed gather kernel (parity) - Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions - Remove unused match_buf/match_results allocation from both host functions - Pass seq_lens_encoder to Phase 2 gather for encoder-active skip - clang-format applied * test: add multi-scale latency benchmark (batch 32→1024) Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at batch sizes 32, 128, 256, 512, 1024 with input_len=512. Shows Phase 2 BlockScan scaling and per-batch-item amortization. * cleanup: remove unused kernel params, dead struct, add benchmark env gate - Remove unused max_draft_tokens_param from ngram_match_search_kernel (draft_token_num[batch_idx] already covers the constraint) - Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel (only used in gather kernel, not search kernel) - Remove dead NgramMatchResult struct from ngram_match_common.cuh - Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling (prevents benchmark tests from inflating CI runtime) * revert: remove benchmark env gate — let CI run benchmarks * fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation * fix: correct stale comment in mixed gather (at-least-ori → 1-token) * bench: add 5-group benchmark matching NKNaN methodology Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch. Data creation outside timing loop. GPU kernel vs CPU-copy path. * fix: rename benchmark for CI discovery, bump to 10k iterations - Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py so pytest discovers it (test_.py pattern) - Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling - Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI) fix: correct stale filename in benchmark docstring * fix: move PD_CHECK before Phase 1 launch (fail-fast) * bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000 Benchmark groups 1-5 now run unconditionally in CI (~9s total). Env-gates moved to separate PR #7170. * fix: address Copilot review — conditional return, defensive guards, GPU placement - ngram_match.cu: add remaining<=0 early return, conditional return only when tokens produced (matches CPU continue behavior), include encoder-active items in Phase 2 threshold-budget scan - ngram_match_mixed.cu: split max_draft_tokens into explicit steps to prevent negative intermediates, conditional return only when tokens produced, add seq_lens_decoder invariant comment - ngram.py: explicit .cuda() on input_ids_len_gpu creation - test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to measure actual D2H/H2D roundtrip * fix: clarify CAS comment, fix negative intermediate in CPU fallback - Add CAS non-atomic initial read comment in atomicMin64 (#3031826678) - Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456) * perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match - NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block - Add early-exit break when position exceeds current best match - Fix __ballot_sync UB: was inside divergent if(match) + loop break, revert to plain atomicMin64 (contention-free since matches are rare) - Update stale '256 threads' comments in both .cu files * perf: template-specialize ngram search + cache scratch buffers + fix benchmark Kernel optimizations: - Template-specialize parallel_ngram_search for ngram_size 1,2,3: register-cached ngram tokens, #pragma unroll, __restrict__ hints - Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor) to eliminate per-call paddle::empty allocation overhead Benchmark fix: - Pre-allocate output tensors once, use fill_() in timing loop instead of creating new paddle.zeros/ones each iteration (removes ~20-40µs measurement noise per iteration) --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>	2026-04-07 01:36:25 -07:00
Nana	367d37b523	fix typo (#7147 )	2026-04-07 16:30:32 +08:00
sunxin	ae2f9f4d22	[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130 ) * rl gate fp32 * clean	2026-04-06 21:07:38 -07:00
周周周	18f012457d	[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201 )	2026-04-07 11:21:57 +08:00
Bingoo	2068656a85	[Optimization] merge matmul and add (#6986 ) * merge matmul and add * modify format * using paddle.nn.functional.linear * using _C_ops.linear * using paddle.nn.functional.linear * add FLAGS_use_legacy_linear env var in test case * fix format * add assert and remove env * modify format * using matmul for no bias * modify accurate baseline	2026-04-03 18:02:03 +08:00
AIbin	1090f8b123	[Models]support GLM4.7 Flash && Ernie_MLA (#7139 ) * support GLM4.7 Flash && Ernie_MLA	2026-04-03 17:41:33 +08:00
lizexu123	5f612a348d	[BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120 ) * fix nvfp4 * fix * add document * fix nvfp4 * support eb5 * support bka * support eb5 * support xpu * fix * fix * add import cutedsl * fix * fix * fix test * fix H卡 * update document * fix * update document * update document * fix	2026-04-03 15:43:19 +08:00
huicongyao	095a11d932	fix MTP bugs in TP and overlap (#7172 ) * fix MTP bugs in TP and overlap * fix	2026-04-03 14:19:11 +08:00
Yonghua Li	3b8dac3b97	[BugFix] prevent requests from entering running state without a slot (#7141 ) * [fix] prevent requests from entering running state without a slot * [fix] count abort set * [fix] count preempted task in waiting list	2026-04-03 14:07:57 +08:00
jackyYang6	e3aed6de2f	fix oom bug, optimize async weight loading and update read step by yaml (#7171 )	2026-04-03 11:05:24 +08:00
jc	1cc0cf23c2	[BugFix] Set MC_MAX_MR_SIZE to avoid register hang in default (#7161 ) * Set MC_MAX_MR_SIZE to avoid register hang * Set MC_MAX_MR_SIZE to avoid register hang	2026-04-03 10:51:15 +08:00
chenjian	2632e6cf32	[Feature] Support chunk prefill disabled in scheduler v1 (#7152 )	2026-04-03 10:18:14 +08:00
luukunn	562fa31791	[BugFix]fix extract_tool_calls (#7154 ) * fix extract_tool_calls	2026-04-02 21:18:37 +08:00
Yonghua Li	98f3fc9267	[RL] [KVCache] let cache transfer managers update key prefix after weight update and add unit tests (#7083 ) * [test] add a few unit tests * [feat] update key prefix when model weights are updated * [test] try to fix test_worker_process	2026-04-02 19:58:41 +08:00

1 2 3 4 5 ...

1997 Commits