Commit Graph

5086 Commits

Author SHA1 Message Date
Jiaxin Sui fbc3aa93de [XPU][CI] Remove duplicate NICs from environment variables (#7244) 2026-04-08 19:14:15 +08:00
RichardWooSJTU 771d42c90b [TBO] Apply tbo to gpu_model_runner (#7165)
* apply tbo in gpu_model_runner

* fix
2026-04-08 16:55:17 +08:00
YuBaoku 4cd574cf90 [CI] Reduce execution time for ngram kernel tests (#7242) 2026-04-08 16:54:46 +08:00
Bingoo 043f2a16e3 support moe for sm103 (#7238) 2026-04-08 15:52:39 +08:00
guozhuangzhuang 757bafe3bd [Engine][DataProcessor] fix decode token (#7102) 2026-04-08 15:41:32 +08:00
GoldPancake aa23e0f966 remove arctic_inference deps (#7231) 2026-04-08 15:25:14 +08:00
YuBaoku 49b3d0d326 [CI] increase shm-size to 128G and set nproc/nofile limits in _unit_test_coverage.yml (#7227)
* [CI] increase shm-size to 128G in _unit_test_coverage.yml

* [CI] set nproc/nofile limits in _unit_test_coverage.yml
2026-04-08 11:35:10 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
luukunn 8496ec71a6 [DataProcessor] Move image_processor to unified directory and add MultiModalProcessor (#7109)
* first commit

* step 9~10

* update multimodal

* update multimodal

* fix load tokenizer

* add unit test

* fix unit test & AdaptiveImageProcessor

* Delete unused code
2026-04-08 10:16:27 +08:00
ChowMingSing d693d4be14 [Feature]distinguish whl version (#7204)
* [Feature]whl version

* [Feature]whl version,set root_is_pure = false

* [Feature]code style
2026-04-07 20:38:27 +08:00
GoldPancake 9d4fd19c3f [Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding (#7215) 2026-04-07 20:22:28 +08:00
lizhenyun01 446b26bbc0 [Feature] support blackwell gemm in ht (#7053)
* [Feature] support blackwell gemm in ht

* [Feature] support ops for convert

* fix cuda error 716

* fix cuda error

* opt memory

* remove unused code
2026-04-07 19:52:51 +08:00
yinwei 334b02c12b [XPU][Docs] Update Release2.5 Note (#7187)
* update docs

* update

* update
2026-04-07 18:45:52 +08:00
MingkunZhang bb1f977c89 [Metax][Fix] add compilation option (#7209) 2026-04-07 02:43:43 -07:00
cloudforge1 c529c2ad98 [Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized (#7136)
* Port ngram_match and hybrid_mtp_ngram kernels to CUDA

Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate
CPU↔GPU data transfer overhead in speculative decoding.

Key changes:
- ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving
  sequential threshold semantics across batch items
- ngram_match_mixed.cu: Replace CPU function with __global__ kernel
- ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly
- mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies

Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM).
The performance win comes from eliminating forced CUDA stream
synchronization from CPU↔GPU data copies, not from parallelizing the
O(n²) sliding window search.

* Add correctness + latency test for GPU ngram kernels

* Fix test data: step_idx semantics and ngram-matchable patterns

* fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops

Restore backward compatibility with existing CPU-only operator tests
(test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based
dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original
C++ implementation.

* fix(test): wrap imported ops with staticmethod to prevent self-binding

Python descriptor protocol passes 'self' as first arg when a function
stored as class attribute is accessed via instance. Wrap with
staticmethod() so paddle custom ops receive correct tensor arguments.

* fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test

* fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl

Reverts line 39 to match develop (keeps .cpu()) so diff-cover
no longer flags it as an uncovered changed line. The tensor is
moved to GPU via .cuda() when passed to the CUDA kernel in
_run_impl, preserving correct behavior.

* Extract shared ngram search into __device__ helper (ngram_match_common.cuh)

Per upstream requirement: '两个Kernel逻辑有较为相似部分,Kernel
形式为提取共用的匹配逻辑,外加业务逻辑'

The core ngram sliding-window search + token copy logic is now defined
once in ngram_match_common.cuh as two __device__ __forceinline__
functions:
  - ngram_search_and_copy: single-haystack sliding window match
  - ngram_search_batch_item: two-phase search (input_ids then pre_ids)

Both kernels call ngram_search_batch_item with their business-specific
parameters:
  - ngram_match_kernel: write_offset=1, min_ngram_size=1
  - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time,
    min_ngram_size=configurable

No functional change. CPU fallback paths unchanged.

* refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search)

Two-phase parallel architecture addressing reviewer feedback:
- Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search
  using atomicMin64 CAS loop for leftmost-match semantics
- Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch
  dependency via running sum of seq_lens_this_time)

Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256
threads.  Phase 2 is O(bsz × max_draft_tokens) — negligible.

Shared code extracted into ngram_match_common.cuh:
  NgramMatchResult struct, atomicMin64, parallel_ngram_search,
  4 kernel functions (search+gather for both kernel types)

Tests: 6 new large-scale correctness tests with env-var threshold
override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k
for both ngram_match and hybrid_mtp_ngram.

* fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error)

Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh.
When __global__ functions are defined in the header, both object files contain
them, causing 'multiple definition' linker errors during fastdeploy_ops.so link.

Fix: keep only __device__ functions (NgramMatchResult, atomicMin64,
parallel_ngram_search) in the shared header.  Move __global__ kernel
definitions into each respective .cu file.

Net code change: +304/-304 (zero net lines).

* fix: align mixed kernel signatures with host function tensors

Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu:
- Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time
  (host function does not have seq_lens_encoder tensor)
- Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time
  per-batch from seq_lens_this_time (matches CPU path logic)
- Fix max_draft_tokens computation to match CPU path formula
- Fix skip condition to match CPU path: ori_seq_len_this_time==0 || max_draft_tokens<=0

* 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold

Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB
BlockScan prefix-sum for parallel threshold enforcement, replacing
the serial <<<1,1>>> loop.

Architecture:
- Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies
  matched draft tokens to scratch buffers (draft_tokens_copy) and
  writes tentative seq_lens_this_time to a copy buffer.
- Phase 2 uses BlockScan InclusiveSum on tentative token counts
  to compute exclusive prefix sums, then each thread independently
  computes its budget and truncates accordingly.

Both ngram_match.cu and ngram_match_mixed.cu updated.
Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers
are allocated internally in the host function.

* fix: resolve Copilot/bot review comments on PR #7136

- Remove dead NgramMatchResult writes from both Phase 1 kernels
- Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active
- Add remaining_active budget deduction to mixed gather kernel (parity)
- Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions
- Remove unused match_buf/match_results allocation from both host functions
- Pass seq_lens_encoder to Phase 2 gather for encoder-active skip
- clang-format applied

* test: add multi-scale latency benchmark (batch 32→1024)

Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at
batch sizes 32, 128, 256, 512, 1024 with input_len=512.
Shows Phase 2 BlockScan scaling and per-batch-item amortization.

* cleanup: remove unused kernel params, dead struct, add benchmark env gate

- Remove unused max_draft_tokens_param from ngram_match_search_kernel
  (draft_token_num[batch_idx] already covers the constraint)
- Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel
  (only used in gather kernel, not search kernel)
- Remove dead NgramMatchResult struct from ngram_match_common.cuh
- Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling
  (prevents benchmark tests from inflating CI runtime)

* revert: remove benchmark env gate — let CI run benchmarks

* fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation

* fix: correct stale comment in mixed gather (at-least-ori → 1-token)

* bench: add 5-group benchmark matching NKNaN methodology

Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch.
Data creation outside timing loop. GPU kernel vs CPU-copy path.

* fix: rename benchmark for CI discovery, bump to 10k iterations

- Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py
  so pytest discovers it (test_*.py pattern)
- Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling
- Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI)

* fix: correct stale filename in benchmark docstring

* fix: move PD_CHECK before Phase 1 launch (fail-fast)

* bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000

Benchmark groups 1-5 now run unconditionally in CI (~9s total).
Env-gates moved to separate PR #7170.

* fix: address Copilot review — conditional return, defensive guards, GPU placement

- ngram_match.cu: add remaining<=0 early return, conditional return
  only when tokens produced (matches CPU continue behavior), include
  encoder-active items in Phase 2 threshold-budget scan
- ngram_match_mixed.cu: split max_draft_tokens into explicit steps to
  prevent negative intermediates, conditional return only when tokens
  produced, add seq_lens_decoder invariant comment
- ngram.py: explicit .cuda() on input_ids_len_gpu creation
- test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to
  measure actual D2H/H2D roundtrip

* fix: clarify CAS comment, fix negative intermediate in CPU fallback

- Add CAS non-atomic initial read comment in atomicMin64 (#3031826678)
- Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456)

* perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match

- NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block
- Add early-exit break when position exceeds current best match
- Fix __ballot_sync UB: was inside divergent if(match) + loop break,
  revert to plain atomicMin64 (contention-free since matches are rare)
- Update stale '256 threads' comments in both .cu files

* perf: template-specialize ngram search + cache scratch buffers + fix benchmark

Kernel optimizations:
- Template-specialize parallel_ngram_search for ngram_size 1,2,3:
  register-cached ngram tokens, #pragma unroll, __restrict__ hints
- Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor)
  to eliminate per-call paddle::empty allocation overhead

Benchmark fix:
- Pre-allocate output tensors once, use fill_() in timing loop
  instead of creating new paddle.zeros/ones each iteration
  (removes ~20-40µs measurement noise per iteration)

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
2026-04-07 01:36:25 -07:00
Nana 367d37b523 fix typo (#7147) 2026-04-07 16:30:32 +08:00
Zhang Yulong f422f835e8 [benchmark] update tools (#7211) 2026-04-07 16:25:44 +08:00
sunxin ae2f9f4d22 [BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130)
* rl gate fp32

* clean
2026-04-06 21:07:38 -07:00
周周周 18f012457d [OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201) 2026-04-07 11:21:57 +08:00
YuBaoku 8cb417e8fb [CI] Improve Code Prepare stability and cleanup logic (#7198) 2026-04-07 10:31:36 +08:00
YuBaoku da3dfe1c80 [CI] Use GPU-Build-RL runner for _build_linux_rl.yml (#7186) 2026-04-03 20:24:39 +08:00
Bingoo 2068656a85 [Optimization] merge matmul and add (#6986)
* merge matmul and add

* modify format

* using paddle.nn.functional.linear

* using _C_ops.linear

* using paddle.nn.functional.linear

* add FLAGS_use_legacy_linear env var in test case

* fix format

* add assert and remove env

* modify format

* using matmul for no bias

* modify accurate baseline
2026-04-03 18:02:03 +08:00
AIbin 1090f8b123 [Models]support GLM4.7 Flash && Ernie_MLA (#7139)
* support GLM4.7 Flash && Ernie_MLA
2026-04-03 17:41:33 +08:00
lizexu123 5f612a348d [BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)
* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
2026-04-03 15:43:19 +08:00
huicongyao 095a11d932 fix MTP bugs in TP and overlap (#7172)
* fix MTP bugs in TP and overlap

* fix
2026-04-03 14:19:11 +08:00
Yonghua Li 3b8dac3b97 [BugFix] prevent requests from entering running state without a slot (#7141)
* [fix] prevent requests from entering running state without a slot

* [fix] count abort set

* [fix] count preempted task in waiting list
2026-04-03 14:07:57 +08:00
Jiang-Jia-Jun 0ce85190db Update setup.py 2026-04-03 11:28:52 +08:00
jackyYang6 e3aed6de2f fix oom bug, optimize async weight loading and update read step by yaml (#7171) 2026-04-03 11:05:24 +08:00
jc 1cc0cf23c2 [BugFix] Set MC_MAX_MR_SIZE to avoid register hang in default (#7161)
* Set MC_MAX_MR_SIZE to avoid register hang

* Set MC_MAX_MR_SIZE to avoid register hang
2026-04-03 10:51:15 +08:00
chenjian 2632e6cf32 [Feature] Support chunk prefill disabled in scheduler v1 (#7152) 2026-04-03 10:18:14 +08:00
luukunn 562fa31791 [BugFix]fix extract_tool_calls (#7154)
* fix extract_tool_calls
2026-04-02 21:18:37 +08:00
Yonghua Li 98f3fc9267 [RL] [KVCache] let cache transfer managers update key prefix after weight update and add unit tests (#7083)
* [test] add a few unit tests

* [feat] update key prefix when model weights are updated

* [test] try to fix test_worker_process
2026-04-02 19:58:41 +08:00
fxyfxy777 9f3b3ce7f5 [Optimization] merge_allreduce (#7039) 2026-04-02 19:52:13 +08:00
bukejiyu f142b486c9 update (#7101) 2026-04-02 16:07:26 +08:00
Longzhi Wang 938e7dd881 [Other] support video_fps args for video bench (#7077) 2026-04-02 10:40:15 +08:00
YuBaoku 7aa213bba9 [CI] Replace ipc=host with shm-size and sysctl configuration (#7138) 2026-04-02 10:33:55 +08:00
YuBaoku db808f2080 [CI] Optimize log cleanup and isolation in unittest (#7132) 2026-04-01 22:07:55 +08:00
Yuanle Liu 1af7f80811 Revert "[BugFix][Speculative Decoding] Correct index calculation in speculate…" (#7133)
This reverts commit ba1aa1edff.
2026-04-01 06:54:23 -07:00
luukunn fa7a84926d [Optimization]Fix tool parser (#7079)
* fix tool parser
2026-04-01 21:20:34 +08:00
Bingoo 410988d9ec [OP] support deepgeem for sm103 (#7073)
* support deepgeem for sm103

* add assert

* modify code style

* add assert

* modify sm version condition

* remove assert
2026-04-01 21:01:09 +08:00
lonelygsh ba1aa1edff [BugFix][Speculative Decoding] Correct index calculation in speculate decoding operators (#7121)
- Fix accept_idx calculation in spec_set_value_by_stop_seqs
- Fix condition check from < to <= for token matching
- Fix accept_tokens indexing logic
- Remove unnecessary -1 in current_step comparison for max_think_len

Co-authored-by: guanshihui] <guanshihui@baidu.com>
2026-04-01 05:36:53 -07:00
cmcamdy 7a2e33098f [XPU] Refactor pre process (#6993)
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
2026-04-01 20:29:55 +08:00
mouxin fba8a51ad1 [Feature] Fix mixed cache-aware (#7129)
* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Fix mixed cache-aware

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-04-01 19:29:29 +08:00
Jingfeng Wu 3b564116d5 [Docs] Add docs for disaggregated deployment (#6700)
* add docs for disaggregated deployment

* pre-commit run for style check

* update docs
2026-04-01 19:27:09 +08:00
yzwu ceaf5df350 [Iluvatar] Fix cuda graph error for tp > 1 in ernie models (#7126) 2026-04-01 19:13:34 +08:00
luukunn fdfc908e2f [Others] reuse unit test (#7127) 2026-04-01 18:36:00 +08:00
mouxin 6cae9b1f50 [Feature] Config eviction_duration (#7125)
* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

* [Feature] Config eviction_duration

---------

Co-authored-by: mouxin <mouxin@baidu.com>
2026-04-01 16:46:21 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
YuBaoku c6f0c5c3a6 [CI] Optimize test execution with single-GPU parallelism (#7085)
* [CI] Optimize test execution with single-GPU parallelism and log collection

* remove export CUDA_VISIBLE_DEVICES

* fix path error

* fix log_* path and debug

* [CI] Optimize test execution with single-GPU parallelism and log collection
2026-04-01 14:18:40 +08:00
zhouchong 91c832f607 [Feature] Add logging parameters and error output to terminal (#7098) 2026-04-01 13:18:42 +08:00