[Speculative Decoding] Unify Spec and non-spec branch (#6685)

* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
This commit is contained in:
freeliuzc
2026-03-11 14:58:44 +08:00
committed by GitHub
parent b6190de557
commit cf7934a4b2
41 changed files with 3428 additions and 392 deletions
+48 -3
View File
@@ -10,7 +10,9 @@ This project implements an efficient **Speculative Decoding** inference framewor
### Supported
- **Ngram**
- **Naive**: Normal decoding mode that uses the speculative decoding code path without generating draft tokens, useful for testing the speculative decoding framework
- **Ngram**: N-gram matching based speculative decoding
- **Suffix Decoding**
@@ -54,12 +56,41 @@ This project implements an efficient **Speculative Decoding** inference framewor
## 🔧 Configuration Parameters
- `method`: The speculative decoding strategy, currently supports `["mtp", "ngram", "suffix"]`.
### Basic Parameters
- `method`: The speculative decoding strategy, supports `["mtp", "ngram", "naive", "suffix"]`.
- `naive`: Normal decoding mode using speculative decoding code path without generating draft tokens
- `ngram`: N-gram matching based speculative decoding
- `mtp`: Multi-Token Prediction
- `suffix`: Suffix decoding based speculative decoding
- `num_speculative_tokens`: Number of speculative tokens to generate; max is 5, currently MTP supports only 1.
- `num_model_steps`: MTP model steps, must satisfy `num_speculative_tokens >= num_model_steps`
- `model`: Path to the MTP draft model when using the `"mtp"` method.
- `quantization`: Quantization method of the MTP model (e.g., WINT4).
- Max `batch_size`: 256
### Verification Strategy (verify_strategy)
Controls how draft tokens are verified:
- `topp` (default): Top-P sampling verification, draft token must be in top-p candidate set
- `greedy`: Greedy verification, draft token must equal target model's argmax output
- `target_match`: Target match verification, draft token must equal target model's sampled output
```bash
--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
### Accept Policy (accept_policy)
Controls draft token acceptance behavior:
- `normal` (default): Normal verification flow
- `accept_all`: Accept all draft tokens (for debugging)
- `reject_all`: Reject all draft tokens (for debugging)
```bash
--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
```
---
## 🚀 Using Multi-Token Prediction (MTP)
@@ -161,7 +192,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
```
@@ -196,3 +227,17 @@ self.suffix_decoding_max_spec_factor: float = 1.0
# The probability threshold for speculated tokens.
self.suffix_decoding_min_token_prob: float = 0.1
```
---
## 📝 Using Naive Mode (Normal Decoding)
Naive mode uses the speculative decoding code path without generating draft tokens, useful for testing the correctness of the speculative decoding framework or establishing performance baselines.
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--speculative-config '{"method": "naive", "num_speculative_tokens": 1}'
```
**Note**: In Naive mode, `num_speculative_tokens` will be forced to 0.
+45 -4
View File
@@ -6,7 +6,9 @@
## ✅ 投机解码方法支持
### ✅ 支持列表
- **Ngram**
- **Naive**: 普通解码模式,走投机解码代码路径但不生成草稿Token,用于测试投机解码框架的正确性
- **Ngram**: 基于n-gram匹配的投机解码方法
- **后缀解码**
@@ -38,12 +40,39 @@
- **高效 DraftModel/MTP 框架**:开发多个融合 Cuda Kernel,统一完成模型类方法的前后处理,相比传统的循环、切片方法,性能高效且易维护
## 🔧 参数说明
- `method`: 解码策略,可选值为 `"mtp"``"ngram"``"suffix"`
### 基础参数
- `method`: 解码策略,可选值为 `"mtp"``"ngram"``"naive"``"suffix"`
- `naive`: 普通解码模式,走投机解码代码路径但不生成草稿Token
- `ngram`: 基于n-gram匹配的投机解码
- `mtp`: 多Token预测(Multi-Token Prediction
- `suffix`: 基于后缀解码的投机解码
- `num_speculative_tokens`: 每轮预测的 Token 数,最大支持 5(当前 MTP 仅支持 1)
- `num_model_steps`: MTP 模型步数,需满足 `num_speculative_tokens >= num_model_steps`
- `model`: 若选择 MTP,则需指定 MTP 模型路径
- `quantization`: 模型量化方式,推荐使用 `wint8`
- `batch_size`: 当前支持最大值为 256
### 验证策略参数 (verify_strategy)
控制草稿Token的验证方式:
- `topp` (默认): Top-P采样验证,草稿Token需在Top-P候选集中
- `greedy`: 贪婪验证,草稿Token需等于目标模型的argmax输出
- `target_match`: 目标匹配验证,草稿Token需等于目标模型的采样输出
```bash
--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
```
### 接受策略参数 (accept_policy)
控制草稿Token的接受行为:
- `normal` (默认): 正常验证流程
- `accept_all`: 接受所有草稿Token(调试用)
- `reject_all`: 拒绝所有草稿Token(调试用)
```bash
--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
```
## 🚀 使用 Multi-Token-Prediction(MTP) 解码
详见论文:[DeepSeek-V3](https://arxiv.org/pdf/2412.19437)
### TP 并行部署
@@ -133,9 +162,21 @@ python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
```
## 📝 使用 Naive 模式(普通解码)
Naive 模式走投机解码代码路径但不生成草稿 Token,用于测试投机解码框架的正确性或对比性能基线。
```
python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--speculative-config '{"method": "naive"}'
```
**注意**: Naive 模式下 `num_speculative_tokens` 会被强制设置为 0。
## 🌲 使用后缀解码 (Suffix Decoding)
后缀解码是一种无模型推理框架,通过在 CPU 上使用高效后缀树进行快速草稿 Token 预测,加速重复性推理任务(如代理工作流程、编码等),消除 GPU 开销。
@@ -149,7 +190,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
--model ${path_to_main_model} \
--tensor-parallel-size 4 \
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
--speculative-config '{"method": "suffix", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
```
参数描述