[Speculative Decoding] Unify Spec and non-spec branch (#6685)

* optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge
2026-04-23 00:17:25 +08:00 · 2026-03-11 14:58:44 +08:00
parent b6190de557
commit cf7934a4b2
41 changed files with 3428 additions and 392 deletions
@@ -10,7 +10,9 @@ This project implements an efficient **Speculative Decoding** inference framewor

 ### Supported

- **Ngram**
+- **Naive**: Normal decoding mode that uses the speculative decoding code path without generating draft tokens, useful for testing the speculative decoding framework
+
+- **Ngram**: N-gram matching based speculative decoding

 - **Suffix Decoding**

@@ -54,12 +56,41 @@ This project implements an efficient **Speculative Decoding** inference framewor

 ## 🔧 Configuration Parameters

- `method`: The speculative decoding strategy, currently supports `["mtp", "ngram", "suffix"]`.
+### Basic Parameters
+
+- `method`: The speculative decoding strategy, supports `["mtp", "ngram", "naive", "suffix"]`.
+  - `naive`: Normal decoding mode using speculative decoding code path without generating draft tokens
+  - `ngram`: N-gram matching based speculative decoding
+  - `mtp`: Multi-Token Prediction
+  - `suffix`: Suffix decoding based speculative decoding
 - `num_speculative_tokens`: Number of speculative tokens to generate; max is 5, currently MTP supports only 1.
+- `num_model_steps`: MTP model steps, must satisfy `num_speculative_tokens >= num_model_steps`
 - `model`: Path to the MTP draft model when using the `"mtp"` method.
 - `quantization`: Quantization method of the MTP model (e.g., WINT4).
 - Max `batch_size`: 256

+### Verification Strategy (verify_strategy)
+
+Controls how draft tokens are verified:
+- `topp` (default): Top-P sampling verification, draft token must be in top-p candidate set
+- `greedy`: Greedy verification, draft token must equal target model's argmax output
+- `target_match`: Target match verification, draft token must equal target model's sampled output
+
+```bash
+--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
+```
+
+### Accept Policy (accept_policy)
+
+Controls draft token acceptance behavior:
+- `normal` (default): Normal verification flow
+- `accept_all`: Accept all draft tokens (for debugging)
+- `reject_all`: Reject all draft tokens (for debugging)
+
+```bash
+--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
+```
+
 ---

 ## 🚀 Using Multi-Token Prediction (MTP)
@@ -161,7 +192,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
    --model ${path_to_main_model} \
    --tensor-parallel-size 4 \
    --config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
-    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
+    --speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'

 ```

@@ -196,3 +227,17 @@ self.suffix_decoding_max_spec_factor: float = 1.0
 # The probability threshold for speculated tokens.
 self.suffix_decoding_min_token_prob: float = 0.1
 ```
+---
+
+## 📝 Using Naive Mode (Normal Decoding)
+
+Naive mode uses the speculative decoding code path without generating draft tokens, useful for testing the correctness of the speculative decoding framework or establishing performance baselines.
+
+```bash
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${path_to_main_model} \
+    --tensor-parallel-size 4 \
+    --speculative-config '{"method": "naive", "num_speculative_tokens": 1}'
+```
+
+**Note**: In Naive mode, `num_speculative_tokens` will be forced to 0.
@@ -6,7 +6,9 @@
 ## ✅ 投机解码方法支持
 ### ✅ 支持列表

- **Ngram**
+- **Naive**: 普通解码模式，走投机解码代码路径但不生成草稿Token，用于测试投机解码框架的正确性
+
+- **Ngram**: 基于n-gram匹配的投机解码方法

 - **后缀解码**

@@ -38,12 +40,39 @@
 - **高效 DraftModel/MTP 框架**：开发多个融合 Cuda Kernel，统一完成模型类方法的前后处理，相比传统的循环、切片方法，性能高效且易维护

 ## 🔧 参数说明
- `method`: 解码策略，可选值为 `"mtp"` 、 `"ngram"` 或 `"suffix"`
+
+### 基础参数
+- `method`: 解码策略，可选值为 `"mtp"`、`"ngram"`、`"naive"` 或 `"suffix"`
+  - `naive`: 普通解码模式，走投机解码代码路径但不生成草稿Token
+  - `ngram`: 基于n-gram匹配的投机解码
+  - `mtp`: 多Token预测（Multi-Token Prediction）
+  - `suffix`: 基于后缀解码的投机解码
 - `num_speculative_tokens`: 每轮预测的 Token 数，最大支持 5（当前 MTP 仅支持 1）
+- `num_model_steps`: MTP 模型步数，需满足 `num_speculative_tokens >= num_model_steps`
 - `model`: 若选择 MTP，则需指定 MTP 模型路径
 - `quantization`: 模型量化方式，推荐使用 `wint8`
 - `batch_size`: 当前支持最大值为 256

+### 验证策略参数 (verify_strategy)
+控制草稿Token的验证方式：
+- `topp` (默认): Top-P采样验证，草稿Token需在Top-P候选集中
+- `greedy`: 贪婪验证，草稿Token需等于目标模型的argmax输出
+- `target_match`: 目标匹配验证，草稿Token需等于目标模型的采样输出
+
+```bash
+--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
+```
+
+### 接受策略参数 (accept_policy)
+控制草稿Token的接受行为：
+- `normal` (默认): 正常验证流程
+- `accept_all`: 接受所有草稿Token（调试用）
+- `reject_all`: 拒绝所有草稿Token（调试用）
+
+```bash
+--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
+```
+
 ## 🚀 使用 Multi-Token-Prediction(MTP) 解码
 详见论文：[DeepSeek-V3](https://arxiv.org/pdf/2412.19437)
 ### TP 并行部署
@@ -133,9 +162,21 @@ python -m fastdeploy.entrypoints.openai.api_server \
    --model ${path_to_main_model} \
    --tensor-parallel-size 4 \
    --config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
-    --speculative-config '{"method": "ngram", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
+    --speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
 ```

+## 📝 使用 Naive 模式（普通解码）
+Naive 模式走投机解码代码路径但不生成草稿 Token，用于测试投机解码框架的正确性或对比性能基线。
+
+```
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${path_to_main_model} \
+    --tensor-parallel-size 4 \
+    --speculative-config '{"method": "naive"}'
+```
+
+**注意**: Naive 模式下 `num_speculative_tokens` 会被强制设置为 0。
+
 ## 🌲 使用后缀解码 (Suffix Decoding)

 后缀解码是一种无模型推理框架，通过在 CPU 上使用高效后缀树进行快速草稿 Token 预测，加速重复性推理任务（如代理工作流程、编码等），消除 GPU 开销。
@@ -149,7 +190,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
    --model ${path_to_main_model} \
    --tensor-parallel-size 4 \
    --config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
-    --speculative-config '{"method": "mtp", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
+    --speculative-config '{"method": "suffix", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
 ```

 参数描述