Commit Graph

102 Commits

Author SHA1 Message Date
GoldPancake df3b4e12f4 [Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442)
* support mtp logprob in pd

* fix

* fix

* fix

* fix xpu bugs
2026-04-17 21:37:38 +08:00
RuohengMa de0c5e68fb [XPU] Split the block_attn operator into smaller operators (#6798)
* spliced block_attn

* adapt to latest vllm

* fix unit tests

* delete mtp+cudagraph 4 cards test

* fix vl model

* fix mtp

* fix slot mapping
2026-04-16 14:28:40 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
GoldPancake aa23e0f966 remove arctic_inference deps (#7231) 2026-04-08 15:25:14 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
cloudforge1 c529c2ad98 [Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized (#7136)
* Port ngram_match and hybrid_mtp_ngram kernels to CUDA

Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate
CPU↔GPU data transfer overhead in speculative decoding.

Key changes:
- ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving
  sequential threshold semantics across batch items
- ngram_match_mixed.cu: Replace CPU function with __global__ kernel
- ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly
- mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies

Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM).
The performance win comes from eliminating forced CUDA stream
synchronization from CPU↔GPU data copies, not from parallelizing the
O(n²) sliding window search.

* Add correctness + latency test for GPU ngram kernels

* Fix test data: step_idx semantics and ngram-matchable patterns

* fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops

Restore backward compatibility with existing CPU-only operator tests
(test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based
dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original
C++ implementation.

* fix(test): wrap imported ops with staticmethod to prevent self-binding

Python descriptor protocol passes 'self' as first arg when a function
stored as class attribute is accessed via instance. Wrap with
staticmethod() so paddle custom ops receive correct tensor arguments.

* fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test

* fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl

Reverts line 39 to match develop (keeps .cpu()) so diff-cover
no longer flags it as an uncovered changed line. The tensor is
moved to GPU via .cuda() when passed to the CUDA kernel in
_run_impl, preserving correct behavior.

* Extract shared ngram search into __device__ helper (ngram_match_common.cuh)

Per upstream requirement: '两个Kernel逻辑有较为相似部分,Kernel
形式为提取共用的匹配逻辑,外加业务逻辑'

The core ngram sliding-window search + token copy logic is now defined
once in ngram_match_common.cuh as two __device__ __forceinline__
functions:
  - ngram_search_and_copy: single-haystack sliding window match
  - ngram_search_batch_item: two-phase search (input_ids then pre_ids)

Both kernels call ngram_search_batch_item with their business-specific
parameters:
  - ngram_match_kernel: write_offset=1, min_ngram_size=1
  - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time,
    min_ngram_size=configurable

No functional change. CPU fallback paths unchanged.

* refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search)

Two-phase parallel architecture addressing reviewer feedback:
- Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search
  using atomicMin64 CAS loop for leftmost-match semantics
- Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch
  dependency via running sum of seq_lens_this_time)

Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256
threads.  Phase 2 is O(bsz × max_draft_tokens) — negligible.

Shared code extracted into ngram_match_common.cuh:
  NgramMatchResult struct, atomicMin64, parallel_ngram_search,
  4 kernel functions (search+gather for both kernel types)

Tests: 6 new large-scale correctness tests with env-var threshold
override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k
for both ngram_match and hybrid_mtp_ngram.

* fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error)

Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh.
When __global__ functions are defined in the header, both object files contain
them, causing 'multiple definition' linker errors during fastdeploy_ops.so link.

Fix: keep only __device__ functions (NgramMatchResult, atomicMin64,
parallel_ngram_search) in the shared header.  Move __global__ kernel
definitions into each respective .cu file.

Net code change: +304/-304 (zero net lines).

* fix: align mixed kernel signatures with host function tensors

Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu:
- Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time
  (host function does not have seq_lens_encoder tensor)
- Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time
  per-batch from seq_lens_this_time (matches CPU path logic)
- Fix max_draft_tokens computation to match CPU path formula
- Fix skip condition to match CPU path: ori_seq_len_this_time==0 || max_draft_tokens<=0

* 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold

Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB
BlockScan prefix-sum for parallel threshold enforcement, replacing
the serial <<<1,1>>> loop.

Architecture:
- Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies
  matched draft tokens to scratch buffers (draft_tokens_copy) and
  writes tentative seq_lens_this_time to a copy buffer.
- Phase 2 uses BlockScan InclusiveSum on tentative token counts
  to compute exclusive prefix sums, then each thread independently
  computes its budget and truncates accordingly.

Both ngram_match.cu and ngram_match_mixed.cu updated.
Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers
are allocated internally in the host function.

* fix: resolve Copilot/bot review comments on PR #7136

- Remove dead NgramMatchResult writes from both Phase 1 kernels
- Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active
- Add remaining_active budget deduction to mixed gather kernel (parity)
- Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions
- Remove unused match_buf/match_results allocation from both host functions
- Pass seq_lens_encoder to Phase 2 gather for encoder-active skip
- clang-format applied

* test: add multi-scale latency benchmark (batch 32→1024)

Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at
batch sizes 32, 128, 256, 512, 1024 with input_len=512.
Shows Phase 2 BlockScan scaling and per-batch-item amortization.

* cleanup: remove unused kernel params, dead struct, add benchmark env gate

- Remove unused max_draft_tokens_param from ngram_match_search_kernel
  (draft_token_num[batch_idx] already covers the constraint)
- Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel
  (only used in gather kernel, not search kernel)
- Remove dead NgramMatchResult struct from ngram_match_common.cuh
- Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling
  (prevents benchmark tests from inflating CI runtime)

* revert: remove benchmark env gate — let CI run benchmarks

* fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation

* fix: correct stale comment in mixed gather (at-least-ori → 1-token)

* bench: add 5-group benchmark matching NKNaN methodology

Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch.
Data creation outside timing loop. GPU kernel vs CPU-copy path.

* fix: rename benchmark for CI discovery, bump to 10k iterations

- Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py
  so pytest discovers it (test_*.py pattern)
- Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling
- Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI)

* fix: correct stale filename in benchmark docstring

* fix: move PD_CHECK before Phase 1 launch (fail-fast)

* bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000

Benchmark groups 1-5 now run unconditionally in CI (~9s total).
Env-gates moved to separate PR #7170.

* fix: address Copilot review — conditional return, defensive guards, GPU placement

- ngram_match.cu: add remaining<=0 early return, conditional return
  only when tokens produced (matches CPU continue behavior), include
  encoder-active items in Phase 2 threshold-budget scan
- ngram_match_mixed.cu: split max_draft_tokens into explicit steps to
  prevent negative intermediates, conditional return only when tokens
  produced, add seq_lens_decoder invariant comment
- ngram.py: explicit .cuda() on input_ids_len_gpu creation
- test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to
  measure actual D2H/H2D roundtrip

* fix: clarify CAS comment, fix negative intermediate in CPU fallback

- Add CAS non-atomic initial read comment in atomicMin64 (#3031826678)
- Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456)

* perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match

- NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block
- Add early-exit break when position exceeds current best match
- Fix __ballot_sync UB: was inside divergent if(match) + loop break,
  revert to plain atomicMin64 (contention-free since matches are rare)
- Update stale '256 threads' comments in both .cu files

* perf: template-specialize ngram search + cache scratch buffers + fix benchmark

Kernel optimizations:
- Template-specialize parallel_ngram_search for ngram_size 1,2,3:
  register-cached ngram tokens, #pragma unroll, __restrict__ hints
- Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor)
  to eliminate per-call paddle::empty allocation overhead

Benchmark fix:
- Pre-allocate output tensors once, use fill_() in timing loop
  instead of creating new paddle.zeros/ones each iteration
  (removes ~20-40µs measurement noise per iteration)

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
2026-04-07 01:36:25 -07:00
cmcamdy 7a2e33098f [XPU] Refactor pre process (#6993)
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
2026-04-01 20:29:55 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
huicongyao 25d64efdc4 [Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812)
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states

* readibility

* fix xpu bug

* fix coverage failure

* change luanch params & parallelize position_map compute

* Fix MTP-related bugs in FastDeploy centralized inference

* fix

* refactor mtp hidden_states process

* fix

* add unittest & optimize kernel

* remove useless code

* fix
2026-03-25 22:54:31 -07:00
freeliuzc 4fd877ed43 [Speculative Decoding] Support mtp expert-parallel and support different modality deploy (#7018)
* support mtp ep and support different modality

* fix default arg
2026-03-26 13:52:16 +08:00
freeliuzc 7a6c28781b [Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005)
* optimize attn_mask_offset and optimize mtp usage

* delete useless branch

* fix kernel format

* fix kernel runner
2026-03-25 01:52:06 -07:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
bukejiyu c62f6b4ea5 [Others] Fix PD reorder for MTP (#6792)
* fix pd reorder in mtp

* add ut

* update

* fix mtp
2026-03-23 21:10:22 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
cmcamdy 7591e0d6bc fix eb5 mtp(mix) (#6800) 2026-03-13 17:36:57 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
Yuanle Liu 326b9755aa [BugFix][MTP] Skip empty_input_forward during dummy run (#6653)
When `is_dummy_run=True`, calling `empty_input_forward` can cause
unexpected behavior. Add `and not is_dummy_run` guard for both
`_propose_cuda` and `_propose_xpu` paths.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 23:51:56 -08:00
huicongyao 0f718baaf2 [Speculative Decoding]Reformat input preprocess for spec decode (#6501)
* add speculate_pre_process kernel

* reduce one slice

* make d2h async && fix mtp bug for new pre_process

* fix

* add unitest

* fix: code stype formatting

* fix

* fix: thread race in speculate_preprocess && rename d2h event
2026-03-03 10:22:07 +08:00
ming1753 344db8c8af [BugFix] Fix mtp when token_ids_all is None (#6591)
* [BugFix] Fix mtp when token_ids_all is None

* fix bug
2026-03-02 01:23:44 -08:00
ming1753 97eee75677 [Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407)
* Optim GPU Mem Usage

---------

Co-authored-by: huzesen <huzesen@baidu.com>
2026-02-28 15:07:43 +08:00
cmcamdy 13447279aa [XPU] Fix PD + MTP (#6495)
* fix pd + mtp

* fix code style

* fix PD + MTP, D get P's first token

* add anno for gpu(speculate_update)

* update draft insertv1

* fix wapper & kernel

* fix wapper

* fix code stype
2026-02-27 19:07:35 +08:00
GoldPancake 2178f2829b [Speculative Decoding] Support suffix decoding (#6403)
* support suffix decoding
2026-02-26 11:42:05 +08:00
周周周 2b4748de4f [MTP] refactor MTP pre_process (#6358) 2026-02-09 10:47:15 +08:00
bukejiyu 12d4b4cb87 [Feature]Support reorder ids to split prefill and decodes (#5779)
* support reorder ids

* perfect code

* fix

* fix unittest

* delete code

* fix

* add python api

* delete custom op

* update algorithm

* fix swap

* support condense

* support condense

* support mtp

* delete code

* update

* update

* update

* update

* update for other platfrom

* update

* fix

* fix mtp

* fix ut

* update

* fix ut

* update ut

* fix

* fix encoder_cache

* fix ci

* fix

* fix vl

* Fix performance regression

* fix

* fix

* fix mtp

* fix index->req_id mapping

* fix ut

---------

Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-03 00:28:02 -08:00
xiaozude 030647521a [Metax] adapt to the latest develop (#6282) 2026-01-29 23:21:20 -08:00
freeliuzc ce06c6dfb3 [BugFix] Fix token_penalty kernel (#6069)
* fix token_penalty kernel

* try to fix xpu

* fix xpu

* fix unit test
2026-01-28 12:03:05 +08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周 0966df78dc [Others] remove stop_nums (#6182) 2026-01-26 12:12:47 +08:00
Yonghua Li 833d00e2d7 [BugFix] move cache creation back to cache transfer process and adapt clear/update (#6144)
* [fix] move cache creation back to cache transfer process

* [fix] fix clear cache

* [chore] change some log level

* [fix] fix clear cache

* [fix] fix clear cache for blockwisefp8 and mtp

* [fix] fix c8

* [fix] fix clear_mtp_cache args

* [chore] update cache_transfer_manager

* [fix] fix update mtp cache
2026-01-24 21:59:13 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
xiaoluomi 62bd92f9ba dev_fix_mtp_forward_meta (#5976) 2026-01-10 00:40:56 +08:00
Yuanle Liu d4a386dfc4 Revert "Revert "[TSP] last_norm allgather move to model.py (#5924)" (#5961)" (#5972)
This reverts commit 8c3513a410.
2026-01-09 15:58:22 +08:00
Yuanle Liu 8c3513a410 Revert "[TSP] last_norm allgather move to model.py (#5924)" (#5961)
This reverts commit 2bb838fed9.
2026-01-09 15:20:40 +08:00
xiaoluomi 2bb838fed9 [TSP] last_norm allgather move to model.py (#5924)
* support_lastnorm_gather_split_dev

* support_lastnorm_gather_split_dev1

* support_lastnorm_gather_split_dev3

* support_lastnorm_gather_split_dev4

* support_lastnorm_gather_split_dev5
2026-01-07 23:36:33 -08:00
Yonghua Li 9fc2400e71 [BugFix] fix mtp cache attaching for pd disaggregation (#5884)
* [fix] fix mtp cache attaching for pd disaggregation

* [fix] fix test_mtp_proposer.py
2026-01-06 14:17:53 +08:00
freeliuzc ca574119e5 support multi-step draft-model with cudagraph (#5886) 2026-01-06 11:16:21 +08:00
cmcamdy 690d4bcdb0 [XPU] Speculative Decoding with PD (#5856)
* [XPU] Speculative Decoding with PD

* fix post process

* share kv cache sender

* support speculate decoding step system cache

* support speculate decoding step system cache

---------

Co-authored-by: root <root@gajl-bbc-onlinec-com-1512108.gajl.baidu.com>
2026-01-05 17:31:03 +08:00
Yonghua Li 5e4e6692a4 [BugFix] fix cache manager not launched in case of mtp or blockwise fp8 (#5840)
* [BugFix] fix cache manager not launched in case of mtp or blockwise fp8

* [fix] fix mtp cache in mtp.py

* [fix] fix gpu ops import

* [fix] fix mtp layer idx

* [fix] fix xpu model runner mtp cache

* [fix] fix mtp import
2026-01-04 04:35:37 -08:00
GoldPancake 4e10ae5d99 [Speculative Decoding] Optimize draft logprob (#5842)
* optimize draft logprob

* fix ut
2025-12-31 13:35:56 +08:00
freeliuzc 9018ccf74e [Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes (#5738)
* fix attn_mask_offset in mtp with multi-step and pd-split-mode

* fix xpu operater register

* update pmtp multi-step mtp strategy in d-split -mode

* add note

* fix xpu register
2025-12-25 01:54:59 -08:00
lizan1999 e1a9b282eb fix bug for EP+MTP (#5605)
Co-authored-by: lizan1999 <lizan03@baidu.com>
2025-12-18 14:34:54 +08:00
Lucas 888c4b992d [XPU] refactor of block_attn param 'pos_emb_type' (#5511) 2025-12-12 14:30:09 +08:00
kevin db936ab3e4 fix mtp prefix_cache dy-c8 bug (#5390) 2025-12-05 19:03:19 +08:00
kevin c9d7f9e7c3 [BugFix] fix async download bug (#5349)
* fix async download bug

* update log

* Revert "update log"

This reverts commit 5816e602f4.

* update code

* fix mtp bug
2025-12-05 18:59:12 +08:00
Longzhi Wang 5cd17fd662 [Models] Add forward_meta to moe models' forward function (#5138)
* [Models] Add forward_meta to moe models' forward function

* fix missing param

* fix

* fix

* fix forward_meta

* fix test and remove chunked MoE releated in config

* fix test

* fix

* fix
2025-12-04 13:26:58 +08:00
cmcamdy 9f4977eb74 [xpu] support mtp for xpu(mix) (#5274)
* [XPU] support kernel for mtp(base)

* [XPU] support kernel for mtp(base)

* format

* format

* format

* fix gather next token

* fix step && add test

* fix

* mv pre/post process

* add adjust batch / gather next token for mtp

* fix code style

* fix mtp kenrel name

* fix mtp kernel test

* mv xpu pre/post process

* mv xpu pre/post process

* [xpu] support mtp

* fix code style
2025-12-01 11:03:14 +08:00
GoldPancake cfc5b0ccf9 [BugFix] fix mtp logprob bugs in chunk prefill (#5244)
* fix mtp logprob bugs in chunk prefill

* fix

* fix
2025-11-27 11:31:29 +08:00
freeliuzc ba915e03e1 [BugFix]Fix attention mask bug in D-Node of PD-split mode (#5245) 2025-11-26 17:56:28 +08:00
freeliuzc 214942e1ae fix kernel output extract (#5208) 2025-11-26 16:48:42 +08:00