FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
RuohengMa	9d3551cfbb	[XPU] add support for rope3d (#7518 ) * [XPU] add support for rope3d * support decoder --------- Co-authored-by: yinwei <yinwei_hust@163.com>	2026-04-21 13:39:00 +08:00
RuohengMa	cf5bc5e510	[XPU] fix bug and teporary fix for rope 3d (#7465 )	2026-04-20 09:51:27 +08:00
freeliuzc	22a4f6019d	[Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy (#7467 ) * fix repeat_time kernel and change default spec verify strategy * fix unit_test	2026-04-18 00:38:01 +08:00
GoldPancake	df3b4e12f4	[Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442 ) * support mtp logprob in pd * fix * fix * fix * fix xpu bugs	2026-04-17 21:37:38 +08:00
ShaneGZhu	2d8338f9e4	[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. (#7367 )	2026-04-16 19:54:12 +08:00
Jiajun Ji	29495b2cf1	[XPU] Unify Spec and non-spec branch.(#6947 ) (#7180 ) * [XPU] cherry-pick PR-6947 * [XPU] use unified_update_model_status. * refactor xpu_model_runner. * refactor sampler. * fix codestyle. * Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path. * fix codestyle. * replace output_padding_offset with is_speculative flag in gather_next_token. * rename hiddden_states. * unify cu_seqlens_q_output and batch_id_per_token_output init. --------- Co-authored-by: cmcamdy <1027740945@qq.com>	2026-04-16 14:58:38 +08:00
RuohengMa	de0c5e68fb	[XPU] Split the block_attn operator into smaller operators (#6798 ) * spliced block_attn * adapt to latest vllm * fix unit tests * delete mtp+cudagraph 4 cards test * fix vl model * fix mtp * fix slot mapping	2026-04-16 14:28:40 +08:00
cmcamdy	13b9fe7299	[XPU] add verify draft tokens (#6947 ) * [XPU] add verify draft tokens * fix test * fix code style * use sync cpy * fix code style * fix kernel check * fix ramdom seed * fix test * fix check * fix eos set * fix verify * fix verify	2026-04-15 10:18:33 +08:00
lonelygsh	e0a1653b26	[Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349 ) Co-authored-by: guanshihui] <guanshihui@baidu.com>	2026-04-14 20:57:11 +08:00
Echo-Nie	8819a039c9	[Others] Fix typo (#7280 ) * typo * typo * typo * typo	2026-04-14 17:28:22 +08:00
zhupengyang	27b00cf385	[XPU] glm-4.5-air (#7071 )	2026-04-14 11:31:49 +08:00
chen	26c47c2afc	update attn_mask_q 2 (#7371 )	2026-04-13 23:06:04 +08:00
lonelygsh	e83d45833f	[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166 ) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.	2026-04-13 20:53:42 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
Jiajun Ji	cb03958b52	[XPU] Refactor get_padding_offset to single kernel. (#7029 ) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len.	2026-04-13 11:04:50 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
JYChen	076ab07528	[RL] change glm rope_emb calculation (#7316 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:36:28 +08:00
Jiaxin Sui	6e5de2fd6d	[XPU][CI]Update xtdk version in download_dependencies.sh (#7320 )	2026-04-11 00:26:48 +08:00
ming1753	734fbcffde	[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221 )	2026-04-10 11:31:51 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
Jiaxin Sui	80d5d9fd32	[XPU][CI] lock xvllm version for fix bug (#7264 ) * Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh	2026-04-09 12:44:27 +08:00
Bingoo	3d2326c1b9	[BugFix] detection jinja2 (#7251 ) * detection jinja2 * format	2026-04-09 11:30:16 +08:00
xiaoxiaohehe001	51efe27d76	[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210 ) * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn	2026-04-09 11:05:10 +08:00
AIbin	48d2bbeb74	fix dsa (#7252 )	2026-04-08 20:21:38 +08:00
Bingoo	043f2a16e3	support moe for sm103 (#7238 )	2026-04-08 15:52:39 +08:00
MingkunZhang	bb1f977c89	[Metax][Fix] add compilation option (#7209 )	2026-04-07 02:43:43 -07:00
cloudforge1	c529c2ad98	[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized (#7136 ) * Port ngram_match and hybrid_mtp_ngram kernels to CUDA Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate CPU↔GPU data transfer overhead in speculative decoding. Key changes: - ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving sequential threshold semantics across batch items - ngram_match_mixed.cu: Replace CPU function with __global__ kernel - ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly - mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM). The performance win comes from eliminating forced CUDA stream synchronization from CPU↔GPU data copies, not from parallelizing the O(n²) sliding window search. * Add correctness + latency test for GPU ngram kernels * Fix test data: step_idx semantics and ngram-matchable patterns * fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops Restore backward compatibility with existing CPU-only operator tests (test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original C++ implementation. * fix(test): wrap imported ops with staticmethod to prevent self-binding Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments. * fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test * fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl Reverts line 39 to match develop (keeps .cpu()) so diff-cover no longer flags it as an uncovered changed line. The tensor is moved to GPU via .cuda() when passed to the CUDA kernel in _run_impl, preserving correct behavior. * Extract shared ngram search into __device__ helper (ngram_match_common.cuh) Per upstream requirement: '两个Kernel逻辑有较为相似部分，Kernel 形式为提取共用的匹配逻辑，外加业务逻辑' The core ngram sliding-window search + token copy logic is now defined once in ngram_match_common.cuh as two __device__ __forceinline__ functions: - ngram_search_and_copy: single-haystack sliding window match - ngram_search_batch_item: two-phase search (input_ids then pre_ids) Both kernels call ngram_search_batch_item with their business-specific parameters: - ngram_match_kernel: write_offset=1, min_ngram_size=1 - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time, min_ngram_size=configurable No functional change. CPU fallback paths unchanged. * refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search) Two-phase parallel architecture addressing reviewer feedback: - Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search using atomicMin64 CAS loop for leftmost-match semantics - Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch dependency via running sum of seq_lens_this_time) Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256 threads. Phase 2 is O(bsz × max_draft_tokens) — negligible. Shared code extracted into ngram_match_common.cuh: NgramMatchResult struct, atomicMin64, parallel_ngram_search, 4 kernel functions (search+gather for both kernel types) Tests: 6 new large-scale correctness tests with env-var threshold override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k for both ngram_match and hybrid_mtp_ngram. * fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error) Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh. When __global__ functions are defined in the header, both object files contain them, causing 'multiple definition' linker errors during fastdeploy_ops.so link. Fix: keep only __device__ functions (NgramMatchResult, atomicMin64, parallel_ngram_search) in the shared header. Move __global__ kernel definitions into each respective .cu file. Net code change: +304/-304 (zero net lines). * fix: align mixed kernel signatures with host function tensors Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu: - Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time (host function does not have seq_lens_encoder tensor) - Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time per-batch from seq_lens_this_time (matches CPU path logic) - Fix max_draft_tokens computation to match CPU path formula - Fix skip condition to match CPU path: ori_seq_len_this_time==0 \|\| max_draft_tokens<=0 * 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB BlockScan prefix-sum for parallel threshold enforcement, replacing the serial <<<1,1>>> loop. Architecture: - Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies matched draft tokens to scratch buffers (draft_tokens_copy) and writes tentative seq_lens_this_time to a copy buffer. - Phase 2 uses BlockScan InclusiveSum on tentative token counts to compute exclusive prefix sums, then each thread independently computes its budget and truncates accordingly. Both ngram_match.cu and ngram_match_mixed.cu updated. Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers are allocated internally in the host function. * fix: resolve Copilot/bot review comments on PR #7136 - Remove dead NgramMatchResult writes from both Phase 1 kernels - Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active - Add remaining_active budget deduction to mixed gather kernel (parity) - Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions - Remove unused match_buf/match_results allocation from both host functions - Pass seq_lens_encoder to Phase 2 gather for encoder-active skip - clang-format applied * test: add multi-scale latency benchmark (batch 32→1024) Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at batch sizes 32, 128, 256, 512, 1024 with input_len=512. Shows Phase 2 BlockScan scaling and per-batch-item amortization. * cleanup: remove unused kernel params, dead struct, add benchmark env gate - Remove unused max_draft_tokens_param from ngram_match_search_kernel (draft_token_num[batch_idx] already covers the constraint) - Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel (only used in gather kernel, not search kernel) - Remove dead NgramMatchResult struct from ngram_match_common.cuh - Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling (prevents benchmark tests from inflating CI runtime) * revert: remove benchmark env gate — let CI run benchmarks * fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation * fix: correct stale comment in mixed gather (at-least-ori → 1-token) * bench: add 5-group benchmark matching NKNaN methodology Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch. Data creation outside timing loop. GPU kernel vs CPU-copy path. * fix: rename benchmark for CI discovery, bump to 10k iterations - Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py so pytest discovers it (test_.py pattern) - Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling - Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI) fix: correct stale filename in benchmark docstring * fix: move PD_CHECK before Phase 1 launch (fail-fast) * bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000 Benchmark groups 1-5 now run unconditionally in CI (~9s total). Env-gates moved to separate PR #7170. * fix: address Copilot review — conditional return, defensive guards, GPU placement - ngram_match.cu: add remaining<=0 early return, conditional return only when tokens produced (matches CPU continue behavior), include encoder-active items in Phase 2 threshold-budget scan - ngram_match_mixed.cu: split max_draft_tokens into explicit steps to prevent negative intermediates, conditional return only when tokens produced, add seq_lens_decoder invariant comment - ngram.py: explicit .cuda() on input_ids_len_gpu creation - test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to measure actual D2H/H2D roundtrip * fix: clarify CAS comment, fix negative intermediate in CPU fallback - Add CAS non-atomic initial read comment in atomicMin64 (#3031826678) - Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456) * perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match - NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block - Add early-exit break when position exceeds current best match - Fix __ballot_sync UB: was inside divergent if(match) + loop break, revert to plain atomicMin64 (contention-free since matches are rare) - Update stale '256 threads' comments in both .cu files * perf: template-specialize ngram search + cache scratch buffers + fix benchmark Kernel optimizations: - Template-specialize parallel_ngram_search for ngram_size 1,2,3: register-cached ngram tokens, #pragma unroll, __restrict__ hints - Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor) to eliminate per-call paddle::empty allocation overhead Benchmark fix: - Pre-allocate output tensors once, use fill_() in timing loop instead of creating new paddle.zeros/ones each iteration (removes ~20-40µs measurement noise per iteration) --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>	2026-04-07 01:36:25 -07:00
周周周	18f012457d	[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201 )	2026-04-07 11:21:57 +08:00
huicongyao	095a11d932	fix MTP bugs in TP and overlap (#7172 ) * fix MTP bugs in TP and overlap * fix	2026-04-03 14:19:11 +08:00
Yuanle Liu	1af7f80811	Revert "[BugFix][Speculative Decoding] Correct index calculation in speculate…" (#7133 ) This reverts commit `ba1aa1edff`.	2026-04-01 06:54:23 -07:00
lonelygsh	ba1aa1edff	[BugFix][Speculative Decoding] Correct index calculation in speculate decoding operators (#7121 ) - Fix accept_idx calculation in spec_set_value_by_stop_seqs - Fix condition check from < to <= for token matching - Fix accept_tokens indexing logic - Remove unnecessary -1 in current_step comparison for max_think_len Co-authored-by: guanshihui] <guanshihui@baidu.com>	2026-04-01 05:36:53 -07:00
cmcamdy	7a2e33098f	[XPU] Refactor pre process (#6993 ) * [XPU] support speculate_pre_process * merge develop * fix codestype * fix mtp, support cu_seqlens_q_output * fix mtp, support cu_seqlens_q_output * fix test --------- Co-authored-by: lizan1999 <lizan03@baidu.com>	2026-04-01 20:29:55 +08:00
sunxin	c29e86fc9d	[Feature] Support mtp overlap schedule (#7001 )	2026-04-01 14:24:26 +08:00
周周周	fd44bb7cbf	cpmmot (#7105 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-31 16:13:44 +08:00
huicongyao	dd2aa10ed4	fix cuda graph capture failure in CI test (#7094 )	2026-03-31 11:05:51 +08:00
yzwu	8789329457	[Iluvatar] Support wi4a16 group_gemm (#7078 )	2026-03-30 19:03:51 +08:00
周周周	76cf5e9496	[append attention] clean code (#7062 )	2026-03-30 15:07:53 +08:00
mpgemm	1a1d048774	[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963 )	2026-03-30 11:37:04 +08:00
Longzhi Wang	2eea6fa97a	[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028 ) * [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend * add constexpr and code style clean * add test * fix code style * fix test	2026-03-30 11:17:15 +08:00
cmcamdy	bf8e9bf81d	[XPU] Fix speculate schedule (#7049 ) * [BugFix] xpu fix speculate schedule cache kernel * fix code style	2026-03-27 18:28:17 +08:00
fxyfxy777	8ff8236a6f	[Optimization] optimize fused_swiglu_fp8_quant_kernel (#7007 ) * use sharemem * B card test * fix acc error	2026-03-27 16:10:16 +08:00
huicongyao	25d64efdc4	[Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812 ) * reformat eagle_get_hidden_states & eagle_get_self_hidden_states * readibility * fix xpu bug * fix coverage failure * change luanch params & parallelize position_map compute * Fix MTP-related bugs in FastDeploy centralized inference * fix * refactor mtp hidden_states process * fix * add unittest & optimize kernel * remove useless code * fix	2026-03-25 22:54:31 -07:00
chen	1502b6f43e	add instantiations for decoder rope enfore_fmul_rn=true (#7009 )	2026-03-25 22:22:10 +08:00
freeliuzc	7a6c28781b	[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005 ) * optimize attn_mask_offset and optimize mtp usage * delete useless branch * fix kernel format * fix kernel runner	2026-03-25 01:52:06 -07:00
chen	c92e277cf1	[RL] RoPE without fmad opt (#6901 ) * env FD_ENABLE_RL=1 do fmul_rn(a*b) in rope	2026-03-24 21:19:53 +08:00
zhupengyang	5780345646	[XPU] fix speculate_verify (#6985 )	2026-03-24 18:55:09 +08:00
freeliuzc	e87ce4b8cd	[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973 ) * support new mtp * refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process * fix cuda-graph for spec-decoding * fix xpu mtp and fix some note * fix unittest and optmize note * fix model status update in eos-branch	2026-03-24 10:19:01 +08:00
Ding	defaffd5fb	【Hackathon 10th Spring No.45】FastDeploy 支持在 T4/V100 硬件的编译 -part (#6488 ) * fix(custom_ops): gate unsupported ops for sm70/sm75 build * fix(custom_ops): gate deepgemm exports to sm75+ only * [BugFix][OP] deduplicate CUDA sources to avoid moe_deepgemm multiple definition * revert two custom_ops files to 352f922f9	2026-03-23 19:16:23 +08:00
AIbin	bf7e2424d0	[Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930 ) * support DSA_MUTI_BATCH * update test topk * update dsk-dsa	2026-03-20 15:59:22 +08:00
lizan1999	148eee84c6	[XPU] use quant2d_per_token for weight quant int8 && fix some XPU Kernel check (#6869 )	2026-03-17 19:44:48 +08:00

1 2 3 4 5 ...

449 Commits