FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-24 01:29:57 +08:00

Author	SHA1	Message	Date
cloudforge1	c529c2ad98	[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized (#7136 ) * Port ngram_match and hybrid_mtp_ngram kernels to CUDA Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate CPU↔GPU data transfer overhead in speculative decoding. Key changes: - ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving sequential threshold semantics across batch items - ngram_match_mixed.cu: Replace CPU function with __global__ kernel - ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly - mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM). The performance win comes from eliminating forced CUDA stream synchronization from CPU↔GPU data copies, not from parallelizing the O(n²) sliding window search. * Add correctness + latency test for GPU ngram kernels * Fix test data: step_idx semantics and ngram-matchable patterns * fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops Restore backward compatibility with existing CPU-only operator tests (test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original C++ implementation. * fix(test): wrap imported ops with staticmethod to prevent self-binding Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments. * fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test * fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl Reverts line 39 to match develop (keeps .cpu()) so diff-cover no longer flags it as an uncovered changed line. The tensor is moved to GPU via .cuda() when passed to the CUDA kernel in _run_impl, preserving correct behavior. * Extract shared ngram search into __device__ helper (ngram_match_common.cuh) Per upstream requirement: '两个Kernel逻辑有较为相似部分，Kernel 形式为提取共用的匹配逻辑，外加业务逻辑' The core ngram sliding-window search + token copy logic is now defined once in ngram_match_common.cuh as two __device__ __forceinline__ functions: - ngram_search_and_copy: single-haystack sliding window match - ngram_search_batch_item: two-phase search (input_ids then pre_ids) Both kernels call ngram_search_batch_item with their business-specific parameters: - ngram_match_kernel: write_offset=1, min_ngram_size=1 - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time, min_ngram_size=configurable No functional change. CPU fallback paths unchanged. * refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search) Two-phase parallel architecture addressing reviewer feedback: - Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search using atomicMin64 CAS loop for leftmost-match semantics - Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch dependency via running sum of seq_lens_this_time) Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256 threads. Phase 2 is O(bsz × max_draft_tokens) — negligible. Shared code extracted into ngram_match_common.cuh: NgramMatchResult struct, atomicMin64, parallel_ngram_search, 4 kernel functions (search+gather for both kernel types) Tests: 6 new large-scale correctness tests with env-var threshold override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k for both ngram_match and hybrid_mtp_ngram. * fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error) Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh. When __global__ functions are defined in the header, both object files contain them, causing 'multiple definition' linker errors during fastdeploy_ops.so link. Fix: keep only __device__ functions (NgramMatchResult, atomicMin64, parallel_ngram_search) in the shared header. Move __global__ kernel definitions into each respective .cu file. Net code change: +304/-304 (zero net lines). * fix: align mixed kernel signatures with host function tensors Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu: - Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time (host function does not have seq_lens_encoder tensor) - Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time per-batch from seq_lens_this_time (matches CPU path logic) - Fix max_draft_tokens computation to match CPU path formula - Fix skip condition to match CPU path: ori_seq_len_this_time==0 \|\| max_draft_tokens<=0 * 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB BlockScan prefix-sum for parallel threshold enforcement, replacing the serial <<<1,1>>> loop. Architecture: - Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies matched draft tokens to scratch buffers (draft_tokens_copy) and writes tentative seq_lens_this_time to a copy buffer. - Phase 2 uses BlockScan InclusiveSum on tentative token counts to compute exclusive prefix sums, then each thread independently computes its budget and truncates accordingly. Both ngram_match.cu and ngram_match_mixed.cu updated. Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers are allocated internally in the host function. * fix: resolve Copilot/bot review comments on PR #7136 - Remove dead NgramMatchResult writes from both Phase 1 kernels - Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active - Add remaining_active budget deduction to mixed gather kernel (parity) - Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions - Remove unused match_buf/match_results allocation from both host functions - Pass seq_lens_encoder to Phase 2 gather for encoder-active skip - clang-format applied * test: add multi-scale latency benchmark (batch 32→1024) Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at batch sizes 32, 128, 256, 512, 1024 with input_len=512. Shows Phase 2 BlockScan scaling and per-batch-item amortization. * cleanup: remove unused kernel params, dead struct, add benchmark env gate - Remove unused max_draft_tokens_param from ngram_match_search_kernel (draft_token_num[batch_idx] already covers the constraint) - Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel (only used in gather kernel, not search kernel) - Remove dead NgramMatchResult struct from ngram_match_common.cuh - Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling (prevents benchmark tests from inflating CI runtime) * revert: remove benchmark env gate — let CI run benchmarks * fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation * fix: correct stale comment in mixed gather (at-least-ori → 1-token) * bench: add 5-group benchmark matching NKNaN methodology Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch. Data creation outside timing loop. GPU kernel vs CPU-copy path. * fix: rename benchmark for CI discovery, bump to 10k iterations - Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py so pytest discovers it (test_.py pattern) - Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling - Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI) fix: correct stale filename in benchmark docstring * fix: move PD_CHECK before Phase 1 launch (fail-fast) * bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000 Benchmark groups 1-5 now run unconditionally in CI (~9s total). Env-gates moved to separate PR #7170. * fix: address Copilot review — conditional return, defensive guards, GPU placement - ngram_match.cu: add remaining<=0 early return, conditional return only when tokens produced (matches CPU continue behavior), include encoder-active items in Phase 2 threshold-budget scan - ngram_match_mixed.cu: split max_draft_tokens into explicit steps to prevent negative intermediates, conditional return only when tokens produced, add seq_lens_decoder invariant comment - ngram.py: explicit .cuda() on input_ids_len_gpu creation - test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to measure actual D2H/H2D roundtrip * fix: clarify CAS comment, fix negative intermediate in CPU fallback - Add CAS non-atomic initial read comment in atomicMin64 (#3031826678) - Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456) * perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match - NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block - Add early-exit break when position exceeds current best match - Fix __ballot_sync UB: was inside divergent if(match) + loop break, revert to plain atomicMin64 (contention-free since matches are rare) - Update stale '256 threads' comments in both .cu files * perf: template-specialize ngram search + cache scratch buffers + fix benchmark Kernel optimizations: - Template-specialize parallel_ngram_search for ngram_size 1,2,3: register-cached ngram tokens, #pragma unroll, __restrict__ hints - Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor) to eliminate per-call paddle::empty allocation overhead Benchmark fix: - Pre-allocate output tensors once, use fill_() in timing loop instead of creating new paddle.zeros/ones each iteration (removes ~20-40µs measurement noise per iteration) --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>	2026-04-07 01:36:25 -07:00
sunxin	c29e86fc9d	[Feature] Support mtp overlap schedule (#7001 )	2026-04-01 14:24:26 +08:00
huicongyao	25d64efdc4	[Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812 ) * reformat eagle_get_hidden_states & eagle_get_self_hidden_states * readibility * fix xpu bug * fix coverage failure * change luanch params & parallelize position_map compute * Fix MTP-related bugs in FastDeploy centralized inference * fix * refactor mtp hidden_states process * fix * add unittest & optimize kernel * remove useless code * fix	2026-03-25 22:54:31 -07:00
freeliuzc	e87ce4b8cd	[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973 ) * support new mtp * refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process * fix cuda-graph for spec-decoding * fix xpu mtp and fix some note * fix unittest and optmize note * fix model status update in eos-branch	2026-03-24 10:19:01 +08:00
Yonghua Li	7c8c0a3c02	[BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822 ) * [BugFix] replace ftok with custom_ftok in get_output/save_output ops * [Test] add unit test for custom_ftok * [Chore] create custom_ftok.h * [Chore] reorganize header file * [Fix] fix cache messager msg_queue_id+rank_id conflict	2026-03-16 14:22:18 +08:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00
wangyifei	b57c960837	cuda13.0, implement changes to CCCL (#6751 )	2026-03-10 16:47:02 +08:00
gongweibao	ddb06ff83f	init (#6642 ) Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-04 21:55:31 +08:00
周周周	2b4748de4f	[MTP] refactor MTP pre_process (#6358 )	2026-02-09 10:47:15 +08:00
freeliuzc	9018ccf74e	[Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes (#5738 ) * fix attn_mask_offset in mtp with multi-step and pd-split-mode * fix xpu operater register * update pmtp multi-step mtp strategy in d-split -mode * add note * fix xpu register	2025-12-25 01:54:59 -08:00
freeliuzc	f44f4bafd1	support mtp in splitewise and scheduler_v1 mode (#4743 )	2025-11-03 10:07:15 +08:00
co63oc	30a1c1783f	rename eagle_get_base_model_hidden_states.cu (#3753 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-09-07 10:24:58 +08:00
freeliuzc	88d44a2c93	support mtp in v1_scheduler mode (#3695 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Run Accuracy Tests (push) Has been cancelled Details CI Images Build / Run Stable Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details	2025-09-04 17:39:59 +08:00
co63oc	d6369b4d51	fix typos (#3684 )	2025-09-01 17:50:17 +08:00
freeliuzc	52eda7fdb3	[Feature][MTP]support new speculative decoding method named hybrid mtp with ngram (#3610 )	2025-08-26 14:29:22 +08:00
freeliuzc	a12d0bc549	[Feature][MTP]update multi-draft-token strategy (#3369 ) * update multi-draft-token strategy * fix format --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2025-08-18 13:59:56 +08:00
Zero Rains	25698d56d1	polish code with new pre-commit rule (#2923 )	2025-07-19 23:19:27 +08:00
Jiang-Jia-Jun	92c2cfa2e7	Sync v2.0 version of code to github repo	2025-06-29 23:29:37 +00:00
jiangjiajun	684703fd72	[LLM] First commit the llm deployment code	2025-06-09 19:20:15 +08:00

19 Commits