mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 17:11:21 +08:00
c529c2ad98
* Port ngram_match and hybrid_mtp_ngram kernels to CUDA
Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate
CPU↔GPU data transfer overhead in speculative decoding.
Key changes:
- ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving
sequential threshold semantics across batch items
- ngram_match_mixed.cu: Replace CPU function with __global__ kernel
- ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly
- mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies
Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM).
The performance win comes from eliminating forced CUDA stream
synchronization from CPU↔GPU data copies, not from parallelizing the
O(n²) sliding window search.
* Add correctness + latency test for GPU ngram kernels
* Fix test data: step_idx semantics and ngram-matchable patterns
* fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops
Restore backward compatibility with existing CPU-only operator tests
(test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based
dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original
C++ implementation.
* fix(test): wrap imported ops with staticmethod to prevent self-binding
Python descriptor protocol passes 'self' as first arg when a function
stored as class attribute is accessed via instance. Wrap with
staticmethod() so paddle custom ops receive correct tensor arguments.
* fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test
* fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl
Reverts line 39 to match develop (keeps .cpu()) so diff-cover
no longer flags it as an uncovered changed line. The tensor is
moved to GPU via .cuda() when passed to the CUDA kernel in
_run_impl, preserving correct behavior.
* Extract shared ngram search into __device__ helper (ngram_match_common.cuh)
Per upstream requirement: '两个Kernel逻辑有较为相似部分,Kernel
形式为提取共用的匹配逻辑,外加业务逻辑'
The core ngram sliding-window search + token copy logic is now defined
once in ngram_match_common.cuh as two __device__ __forceinline__
functions:
- ngram_search_and_copy: single-haystack sliding window match
- ngram_search_batch_item: two-phase search (input_ids then pre_ids)
Both kernels call ngram_search_batch_item with their business-specific
parameters:
- ngram_match_kernel: write_offset=1, min_ngram_size=1
- ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time,
min_ngram_size=configurable
No functional change. CPU fallback paths unchanged.
* refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search)
Two-phase parallel architecture addressing reviewer feedback:
- Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search
using atomicMin64 CAS loop for leftmost-match semantics
- Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch
dependency via running sum of seq_lens_this_time)
Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256
threads. Phase 2 is O(bsz × max_draft_tokens) — negligible.
Shared code extracted into ngram_match_common.cuh:
NgramMatchResult struct, atomicMin64, parallel_ngram_search,
4 kernel functions (search+gather for both kernel types)
Tests: 6 new large-scale correctness tests with env-var threshold
override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k
for both ngram_match and hybrid_mtp_ngram.
* fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error)
Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh.
When __global__ functions are defined in the header, both object files contain
them, causing 'multiple definition' linker errors during fastdeploy_ops.so link.
Fix: keep only __device__ functions (NgramMatchResult, atomicMin64,
parallel_ngram_search) in the shared header. Move __global__ kernel
definitions into each respective .cu file.
Net code change: +304/-304 (zero net lines).
* fix: align mixed kernel signatures with host function tensors
Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu:
- Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time
(host function does not have seq_lens_encoder tensor)
- Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time
per-batch from seq_lens_this_time (matches CPU path logic)
- Fix max_draft_tokens computation to match CPU path formula
- Fix skip condition to match CPU path: ori_seq_len_this_time==0 || max_draft_tokens<=0
* 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold
Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB
BlockScan prefix-sum for parallel threshold enforcement, replacing
the serial <<<1,1>>> loop.
Architecture:
- Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies
matched draft tokens to scratch buffers (draft_tokens_copy) and
writes tentative seq_lens_this_time to a copy buffer.
- Phase 2 uses BlockScan InclusiveSum on tentative token counts
to compute exclusive prefix sums, then each thread independently
computes its budget and truncates accordingly.
Both ngram_match.cu and ngram_match_mixed.cu updated.
Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers
are allocated internally in the host function.
* fix: resolve Copilot/bot review comments on PR #7136
- Remove dead NgramMatchResult writes from both Phase 1 kernels
- Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active
- Add remaining_active budget deduction to mixed gather kernel (parity)
- Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions
- Remove unused match_buf/match_results allocation from both host functions
- Pass seq_lens_encoder to Phase 2 gather for encoder-active skip
- clang-format applied
* test: add multi-scale latency benchmark (batch 32→1024)
Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at
batch sizes 32, 128, 256, 512, 1024 with input_len=512.
Shows Phase 2 BlockScan scaling and per-batch-item amortization.
* cleanup: remove unused kernel params, dead struct, add benchmark env gate
- Remove unused max_draft_tokens_param from ngram_match_search_kernel
(draft_token_num[batch_idx] already covers the constraint)
- Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel
(only used in gather kernel, not search kernel)
- Remove dead NgramMatchResult struct from ngram_match_common.cuh
- Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling
(prevents benchmark tests from inflating CI runtime)
* revert: remove benchmark env gate — let CI run benchmarks
* fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation
* fix: correct stale comment in mixed gather (at-least-ori → 1-token)
* bench: add 5-group benchmark matching NKNaN methodology
Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch.
Data creation outside timing loop. GPU kernel vs CPU-copy path.
* fix: rename benchmark for CI discovery, bump to 10k iterations
- Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py
so pytest discovers it (test_*.py pattern)
- Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling
- Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI)
* fix: correct stale filename in benchmark docstring
* fix: move PD_CHECK before Phase 1 launch (fail-fast)
* bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000
Benchmark groups 1-5 now run unconditionally in CI (~9s total).
Env-gates moved to separate PR #7170.
* fix: address Copilot review — conditional return, defensive guards, GPU placement
- ngram_match.cu: add remaining<=0 early return, conditional return
only when tokens produced (matches CPU continue behavior), include
encoder-active items in Phase 2 threshold-budget scan
- ngram_match_mixed.cu: split max_draft_tokens into explicit steps to
prevent negative intermediates, conditional return only when tokens
produced, add seq_lens_decoder invariant comment
- ngram.py: explicit .cuda() on input_ids_len_gpu creation
- test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to
measure actual D2H/H2D roundtrip
* fix: clarify CAS comment, fix negative intermediate in CPU fallback
- Add CAS non-atomic initial read comment in atomicMin64 (#3031826678)
- Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456)
* perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match
- NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block
- Add early-exit break when position exceeds current best match
- Fix __ballot_sync UB: was inside divergent if(match) + loop break,
revert to plain atomicMin64 (contention-free since matches are rare)
- Update stale '256 threads' comments in both .cu files
* perf: template-specialize ngram search + cache scratch buffers + fix benchmark
Kernel optimizations:
- Template-specialize parallel_ngram_search for ngram_size 1,2,3:
register-cached ngram tokens, #pragma unroll, __restrict__ hints
- Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor)
to eliminate per-call paddle::empty allocation overhead
Benchmark fix:
- Pre-allocate output tensors once, use fill_() in timing loop
instead of creating new paddle.zeros/ones each iteration
(removes ~20-40µs measurement noise per iteration)
---------
Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
359 lines
15 KiB
Python
359 lines
15 KiB
Python
#!/usr/bin/env python3
|
||
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
|
||
#
|
||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||
# you may not use this file except in compliance with the License.
|
||
# You may obtain a copy of the License at
|
||
#
|
||
# http://www.apache.org/licenses/LICENSE-2.0
|
||
#
|
||
# Unless required by applicable law or agreed to in writing, software
|
||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||
# See the License for the specific language governing permissions and
|
||
# limitations under the License.
|
||
"""
|
||
Multi-dimension benchmark for ngram_match GPU kernel vs CPU copy path.
|
||
|
||
Matches NKNaN's profiling methodology (5 experiment groups) using
|
||
FastDeploy's native ngram_match op interface.
|
||
|
||
Groups:
|
||
1. seq_len — [1024, 4096, 16384, 65536, 131072]
|
||
2. batch_size — [1, 8, 32, 128, 512]
|
||
3. ngram hit — [high_input, high_pre, low_input, low_pre, none]
|
||
4. threshold — [16, 32, 64, 128, 256]
|
||
5. threshold × batch (batch=128)
|
||
|
||
Run:
|
||
cd FastDeploy && python tests/spec_decode/test_benchmark_ngram_kernel.py
|
||
"""
|
||
import os
|
||
import sys
|
||
import time
|
||
import unittest
|
||
|
||
import numpy as np
|
||
import paddle
|
||
|
||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../.."))
|
||
|
||
MAX_NGRAM_SIZE = 3
|
||
MAX_DRAFT_TOKENS = 10
|
||
NUM_ITERS = 1000
|
||
WARMUP = 5
|
||
|
||
|
||
def _build_data(batch_size, seq_len, hit_type="low_input", seed=42):
|
||
"""
|
||
Build test tensors with controlled ngram hit placement.
|
||
|
||
hit_type controls where the ngram match is found:
|
||
- high_input: match near start of input_ids (fast find)
|
||
- high_pre: match near start of token_ids_all gen tokens
|
||
- low_input: match near end of input_ids (worst-case scan)
|
||
- low_pre: match near end of token_ids_all gen tokens
|
||
- none: no planted match (full scan, no hit)
|
||
"""
|
||
rng = np.random.RandomState(seed)
|
||
step_idx_val = max(MAX_NGRAM_SIZE + 2, 20)
|
||
pre_len = step_idx_val + 1
|
||
max_model_len = max(seq_len + 64, pre_len + 64)
|
||
|
||
input_ids = rng.randint(10, 500, (batch_size, seq_len)).astype(np.int64)
|
||
token_ids_all = rng.randint(10, 500, (batch_size, max_model_len)).astype(np.int64)
|
||
pattern = np.arange(1001, 1001 + MAX_NGRAM_SIZE, dtype=np.int64)
|
||
|
||
for b in range(batch_size):
|
||
# Plant pattern in token_ids_all at step_idx alignment (the ngram to search for)
|
||
ng_start = step_idx_val + 1 - MAX_NGRAM_SIZE
|
||
token_ids_all[b, ng_start : step_idx_val + 1] = pattern
|
||
|
||
if hit_type == "high_input":
|
||
pos = 5
|
||
if pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS <= seq_len:
|
||
input_ids[b, pos : pos + MAX_NGRAM_SIZE] = pattern
|
||
input_ids[b, pos + MAX_NGRAM_SIZE : pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS] = np.arange(
|
||
2001, 2001 + MAX_DRAFT_TOKENS, dtype=np.int64
|
||
)
|
||
|
||
elif hit_type == "high_pre":
|
||
pos = 5
|
||
if pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS < ng_start:
|
||
token_ids_all[b, pos : pos + MAX_NGRAM_SIZE] = pattern
|
||
token_ids_all[b, pos + MAX_NGRAM_SIZE : pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS] = np.arange(
|
||
2001, 2001 + MAX_DRAFT_TOKENS, dtype=np.int64
|
||
)
|
||
|
||
elif hit_type == "low_input":
|
||
pos = seq_len - MAX_NGRAM_SIZE - MAX_DRAFT_TOKENS - 5
|
||
if pos > 0:
|
||
input_ids[b, pos : pos + MAX_NGRAM_SIZE] = pattern
|
||
input_ids[b, pos + MAX_NGRAM_SIZE : pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS] = np.arange(
|
||
2001, 2001 + MAX_DRAFT_TOKENS, dtype=np.int64
|
||
)
|
||
|
||
elif hit_type == "low_pre":
|
||
pos = step_idx_val - MAX_NGRAM_SIZE - MAX_DRAFT_TOKENS - 5
|
||
if pos > 0 and pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS < ng_start:
|
||
token_ids_all[b, pos : pos + MAX_NGRAM_SIZE] = pattern
|
||
token_ids_all[b, pos + MAX_NGRAM_SIZE : pos + MAX_NGRAM_SIZE + MAX_DRAFT_TOKENS] = np.arange(
|
||
2001, 2001 + MAX_DRAFT_TOKENS, dtype=np.int64
|
||
)
|
||
|
||
elif hit_type == "none":
|
||
pass # No match planted — random data only
|
||
|
||
input_ids_len = np.full((batch_size, 1), seq_len, dtype=np.int64)
|
||
prompt_lens = np.zeros((batch_size, 1), dtype=np.int64)
|
||
step_idx = np.full((batch_size, 1), step_idx_val, dtype=np.int64)
|
||
draft_token_num = np.full((batch_size, 1), MAX_DRAFT_TOKENS, dtype=np.int32)
|
||
draft_tokens = np.zeros((batch_size, MAX_DRAFT_TOKENS + 1), dtype=np.int64)
|
||
seq_lens_this_time = np.ones(batch_size, dtype=np.int32)
|
||
seq_lens_encoder = np.zeros(batch_size, dtype=np.int32)
|
||
seq_lens_decoder = np.ones(batch_size, dtype=np.int32)
|
||
max_dec_len = np.full((batch_size, 1), 1048576, dtype=np.int64)
|
||
|
||
return {
|
||
"input_ids": input_ids,
|
||
"input_ids_len": input_ids_len,
|
||
"token_ids_all": token_ids_all,
|
||
"prompt_lens": prompt_lens,
|
||
"step_idx": step_idx,
|
||
"draft_token_num": draft_token_num,
|
||
"draft_tokens": draft_tokens,
|
||
"seq_lens_this_time": seq_lens_this_time,
|
||
"seq_lens_encoder": seq_lens_encoder,
|
||
"seq_lens_decoder": seq_lens_decoder,
|
||
"max_dec_len": max_dec_len,
|
||
}
|
||
|
||
|
||
def _to_gpu(np_dict):
|
||
out = {}
|
||
for k, v in np_dict.items():
|
||
out[k] = paddle.to_tensor(v, place=paddle.CUDAPlace(0))
|
||
return out
|
||
|
||
|
||
def _run_gpu(ngram_match_fn, gpu_data):
|
||
"""Run GPU kernel (tensors already on GPU)."""
|
||
ngram_match_fn(
|
||
gpu_data["input_ids"],
|
||
gpu_data["input_ids_len"],
|
||
gpu_data["token_ids_all"],
|
||
gpu_data["prompt_lens"],
|
||
gpu_data["step_idx"],
|
||
gpu_data["draft_token_num"],
|
||
gpu_data["draft_tokens"],
|
||
gpu_data["seq_lens_this_time"],
|
||
gpu_data["seq_lens_encoder"],
|
||
gpu_data["seq_lens_decoder"],
|
||
gpu_data["max_dec_len"],
|
||
MAX_NGRAM_SIZE,
|
||
MAX_DRAFT_TOKENS,
|
||
)
|
||
|
||
|
||
def _time_gpu(ngram_match_fn, batch_size, seq_len, hit_type, n_runs):
|
||
"""Time GPU kernel with pre-created tensors (no data creation in loop)."""
|
||
gpu_data = _to_gpu(_build_data(batch_size, seq_len, hit_type))
|
||
# Pre-allocate mutable output buffers once — avoids per-iteration
|
||
# paddle.zeros/ones which add ~20-40µs allocation + fill overhead.
|
||
draft_buf = paddle.zeros([batch_size, MAX_DRAFT_TOKENS + 1], dtype="int64").cuda()
|
||
seqlens_buf = paddle.ones([batch_size], dtype="int32").cuda()
|
||
# Warmup
|
||
for _ in range(WARMUP):
|
||
seqlens_buf.fill_(1)
|
||
gpu_data["draft_tokens"] = draft_buf
|
||
gpu_data["seq_lens_this_time"] = seqlens_buf
|
||
_run_gpu(ngram_match_fn, gpu_data)
|
||
paddle.device.synchronize()
|
||
|
||
paddle.device.synchronize()
|
||
t0 = time.perf_counter()
|
||
for _ in range(n_runs):
|
||
seqlens_buf.fill_(1)
|
||
gpu_data["draft_tokens"] = draft_buf
|
||
gpu_data["seq_lens_this_time"] = seqlens_buf
|
||
_run_gpu(ngram_match_fn, gpu_data)
|
||
paddle.device.synchronize()
|
||
return (time.perf_counter() - t0) / n_runs * 1e6 # microseconds
|
||
|
||
|
||
def _time_cpu_copy(batch_size, seq_len, hit_type, n_runs):
|
||
"""Time the old CPU-copy path: GPU→CPU transfer + CPU→GPU transfer back."""
|
||
gpu_data = _to_gpu(_build_data(batch_size, seq_len, hit_type))
|
||
# Warmup
|
||
for _ in range(WARMUP):
|
||
_ = {k: v.cpu() for k, v in gpu_data.items()}
|
||
paddle.device.synchronize()
|
||
|
||
paddle.device.synchronize()
|
||
t0 = time.perf_counter()
|
||
for _ in range(n_runs):
|
||
cpu_copy = {k: v.cpu() for k, v in gpu_data.items()}
|
||
_ = cpu_copy["draft_tokens"].cuda()
|
||
_ = cpu_copy["seq_lens_this_time"].cuda()
|
||
paddle.device.synchronize()
|
||
return (time.perf_counter() - t0) / n_runs * 1e6 # microseconds
|
||
|
||
|
||
def _print_table(title, header, rows):
|
||
"""Print formatted benchmark table."""
|
||
print(f"\n{'=' * 80}")
|
||
print(title)
|
||
print(f"{'─' * 80}")
|
||
print(header)
|
||
print(f"{'─' * 80}")
|
||
for row in rows:
|
||
print(row)
|
||
print(f"{'=' * 80}")
|
||
|
||
|
||
class TestNgramBenchmarkGroups(unittest.TestCase):
|
||
"""Multi-dimension benchmark matching NKNaN's 5-group methodology."""
|
||
|
||
@classmethod
|
||
def setUpClass(cls):
|
||
if not paddle.is_compiled_with_cuda():
|
||
raise unittest.SkipTest("CUDA not available")
|
||
paddle.set_device("gpu")
|
||
try:
|
||
from fastdeploy.model_executor.ops.gpu import ngram_match
|
||
|
||
cls.ngram_match = staticmethod(ngram_match)
|
||
except Exception as e:
|
||
raise unittest.SkipTest(f"Cannot import ngram_match op: {e}")
|
||
|
||
def test_group1_seq_len(self):
|
||
"""Group 1: Vary seq_len with fixed batch=16, threshold=512, hit=low_input."""
|
||
seq_lens = [1024, 4096, 16384, 65536, 131072]
|
||
batch_size = 16
|
||
hit_type = "low_input"
|
||
old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD")
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = "512"
|
||
try:
|
||
rows = []
|
||
for sl in seq_lens:
|
||
gpu_us = _time_gpu(self.ngram_match, batch_size, sl, hit_type, NUM_ITERS)
|
||
cpu_us = _time_cpu_copy(batch_size, sl, hit_type, NUM_ITERS)
|
||
speedup = cpu_us / gpu_us if gpu_us > 0 else 0
|
||
rows.append(f"{sl:>8} {gpu_us:>12.1f} {cpu_us:>12.1f} {speedup:>8.2f}x")
|
||
_print_table(
|
||
f"Group 1: seq_len (batch={batch_size}, threshold=512, hit={hit_type}, {NUM_ITERS} runs)",
|
||
f"{'seq_len':>8} {'GPU (µs)':>12} {'CPU copy (µs)':>12} {'Speedup':>8}",
|
||
rows,
|
||
)
|
||
finally:
|
||
if old_env is None:
|
||
os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None)
|
||
else:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env
|
||
|
||
def test_group2_batch_size(self):
|
||
"""Group 2: Vary batch_size with fixed seq_len=16384, threshold=8192, hit=low_input."""
|
||
batch_sizes = [1, 8, 32, 128, 512]
|
||
seq_len = 16384
|
||
hit_type = "low_input"
|
||
old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD")
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = "8192"
|
||
try:
|
||
rows = []
|
||
for bsz in batch_sizes:
|
||
gpu_us = _time_gpu(self.ngram_match, bsz, seq_len, hit_type, NUM_ITERS)
|
||
cpu_us = _time_cpu_copy(bsz, seq_len, hit_type, NUM_ITERS)
|
||
speedup = cpu_us / gpu_us if gpu_us > 0 else 0
|
||
rows.append(f"{bsz:>8} {gpu_us:>12.1f} {cpu_us:>12.1f} {speedup:>8.2f}x")
|
||
_print_table(
|
||
f"Group 2: batch_size (seq_len={seq_len}, threshold=8192, hit={hit_type}, {NUM_ITERS} runs)",
|
||
f"{'batch':>8} {'GPU (µs)':>12} {'CPU copy (µs)':>12} {'Speedup':>8}",
|
||
rows,
|
||
)
|
||
finally:
|
||
if old_env is None:
|
||
os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None)
|
||
else:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env
|
||
|
||
def test_group3_ngram_hit(self):
|
||
"""Group 3: Vary hit pattern with fixed batch=16, seq_len=32768, threshold=512."""
|
||
hit_types = ["high_input", "high_pre", "low_input", "low_pre", "none"]
|
||
batch_size = 16
|
||
seq_len = 32768
|
||
old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD")
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = "512"
|
||
try:
|
||
rows = []
|
||
for ht in hit_types:
|
||
gpu_us = _time_gpu(self.ngram_match, batch_size, seq_len, ht, NUM_ITERS)
|
||
cpu_us = _time_cpu_copy(batch_size, seq_len, ht, NUM_ITERS)
|
||
speedup = cpu_us / gpu_us if gpu_us > 0 else 0
|
||
rows.append(f"{ht:>12} {gpu_us:>12.1f} {cpu_us:>12.1f} {speedup:>8.2f}x")
|
||
_print_table(
|
||
f"Group 3: ngram hit (batch={batch_size}, seq_len={seq_len}, threshold=512, {NUM_ITERS} runs)",
|
||
f"{'hit_type':>12} {'GPU (µs)':>12} {'CPU copy (µs)':>12} {'Speedup':>8}",
|
||
rows,
|
||
)
|
||
finally:
|
||
if old_env is None:
|
||
os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None)
|
||
else:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env
|
||
|
||
def test_group4_threshold(self):
|
||
"""Group 4: Vary threshold with fixed batch=8, seq_len=32768, hit=low_input."""
|
||
thresholds = [16, 32, 64, 128, 256]
|
||
batch_size = 8
|
||
seq_len = 32768
|
||
hit_type = "low_input"
|
||
rows = []
|
||
old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD")
|
||
try:
|
||
for thr in thresholds:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = str(thr)
|
||
gpu_us = _time_gpu(self.ngram_match, batch_size, seq_len, hit_type, NUM_ITERS)
|
||
cpu_us = _time_cpu_copy(batch_size, seq_len, hit_type, NUM_ITERS)
|
||
speedup = cpu_us / gpu_us if gpu_us > 0 else 0
|
||
rows.append(f"{thr:>8} {gpu_us:>12.1f} {cpu_us:>12.1f} {speedup:>8.2f}x")
|
||
_print_table(
|
||
f"Group 4: threshold (batch={batch_size}, seq_len={seq_len}, hit={hit_type}, {NUM_ITERS} runs)",
|
||
f"{'thresh':>8} {'GPU (µs)':>12} {'CPU copy (µs)':>12} {'Speedup':>8}",
|
||
rows,
|
||
)
|
||
finally:
|
||
if old_env is None:
|
||
os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None)
|
||
else:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env
|
||
|
||
def test_group5_threshold_x_batch(self):
|
||
"""Group 5: Vary threshold with large batch=128 to expose truncation effects."""
|
||
thresholds = [16, 32, 64, 128, 256]
|
||
batch_size = 128
|
||
seq_len = 32768
|
||
hit_type = "low_input"
|
||
rows = []
|
||
old_env = os.environ.get("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD")
|
||
try:
|
||
for thr in thresholds:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = str(thr)
|
||
gpu_us = _time_gpu(self.ngram_match, batch_size, seq_len, hit_type, NUM_ITERS)
|
||
cpu_us = _time_cpu_copy(batch_size, seq_len, hit_type, NUM_ITERS)
|
||
speedup = cpu_us / gpu_us if gpu_us > 0 else 0
|
||
rows.append(f"{thr:>8} {gpu_us:>12.1f} {cpu_us:>12.1f} {speedup:>8.2f}x")
|
||
_print_table(
|
||
f"Group 5: threshold×batch (batch={batch_size}, seq_len={seq_len}, hit={hit_type}, {NUM_ITERS} runs)",
|
||
f"{'thresh':>8} {'GPU (µs)':>12} {'CPU copy (µs)':>12} {'Speedup':>8}",
|
||
rows,
|
||
)
|
||
finally:
|
||
if old_env is None:
|
||
os.environ.pop("INFER_WITH_REFERENCE_TOKENUM_THRESHOLD", None)
|
||
else:
|
||
os.environ["INFER_WITH_REFERENCE_TOKENUM_THRESHOLD"] = old_env
|
||
|
||
|
||
if __name__ == "__main__":
|
||
unittest.main()
|