mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
[Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge
This commit is contained in:
@@ -10,7 +10,9 @@ This project implements an efficient **Speculative Decoding** inference framewor
|
||||
|
||||
### Supported
|
||||
|
||||
- **Ngram**
|
||||
- **Naive**: Normal decoding mode that uses the speculative decoding code path without generating draft tokens, useful for testing the speculative decoding framework
|
||||
|
||||
- **Ngram**: N-gram matching based speculative decoding
|
||||
|
||||
- **Suffix Decoding**
|
||||
|
||||
@@ -54,12 +56,41 @@ This project implements an efficient **Speculative Decoding** inference framewor
|
||||
|
||||
## 🔧 Configuration Parameters
|
||||
|
||||
- `method`: The speculative decoding strategy, currently supports `["mtp", "ngram", "suffix"]`.
|
||||
### Basic Parameters
|
||||
|
||||
- `method`: The speculative decoding strategy, supports `["mtp", "ngram", "naive", "suffix"]`.
|
||||
- `naive`: Normal decoding mode using speculative decoding code path without generating draft tokens
|
||||
- `ngram`: N-gram matching based speculative decoding
|
||||
- `mtp`: Multi-Token Prediction
|
||||
- `suffix`: Suffix decoding based speculative decoding
|
||||
- `num_speculative_tokens`: Number of speculative tokens to generate; max is 5, currently MTP supports only 1.
|
||||
- `num_model_steps`: MTP model steps, must satisfy `num_speculative_tokens >= num_model_steps`
|
||||
- `model`: Path to the MTP draft model when using the `"mtp"` method.
|
||||
- `quantization`: Quantization method of the MTP model (e.g., WINT4).
|
||||
- Max `batch_size`: 256
|
||||
|
||||
### Verification Strategy (verify_strategy)
|
||||
|
||||
Controls how draft tokens are verified:
|
||||
- `topp` (default): Top-P sampling verification, draft token must be in top-p candidate set
|
||||
- `greedy`: Greedy verification, draft token must equal target model's argmax output
|
||||
- `target_match`: Target match verification, draft token must equal target model's sampled output
|
||||
|
||||
```bash
|
||||
--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
|
||||
```
|
||||
|
||||
### Accept Policy (accept_policy)
|
||||
|
||||
Controls draft token acceptance behavior:
|
||||
- `normal` (default): Normal verification flow
|
||||
- `accept_all`: Accept all draft tokens (for debugging)
|
||||
- `reject_all`: Reject all draft tokens (for debugging)
|
||||
|
||||
```bash
|
||||
--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Using Multi-Token Prediction (MTP)
|
||||
@@ -161,7 +192,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${path_to_main_model} \
|
||||
--tensor-parallel-size 4 \
|
||||
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
|
||||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
|
||||
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
|
||||
|
||||
```
|
||||
|
||||
@@ -196,3 +227,17 @@ self.suffix_decoding_max_spec_factor: float = 1.0
|
||||
# The probability threshold for speculated tokens.
|
||||
self.suffix_decoding_min_token_prob: float = 0.1
|
||||
```
|
||||
---
|
||||
|
||||
## 📝 Using Naive Mode (Normal Decoding)
|
||||
|
||||
Naive mode uses the speculative decoding code path without generating draft tokens, useful for testing the correctness of the speculative decoding framework or establishing performance baselines.
|
||||
|
||||
```bash
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${path_to_main_model} \
|
||||
--tensor-parallel-size 4 \
|
||||
--speculative-config '{"method": "naive", "num_speculative_tokens": 1}'
|
||||
```
|
||||
|
||||
**Note**: In Naive mode, `num_speculative_tokens` will be forced to 0.
|
||||
|
||||
@@ -6,7 +6,9 @@
|
||||
## ✅ 投机解码方法支持
|
||||
### ✅ 支持列表
|
||||
|
||||
- **Ngram**
|
||||
- **Naive**: 普通解码模式,走投机解码代码路径但不生成草稿Token,用于测试投机解码框架的正确性
|
||||
|
||||
- **Ngram**: 基于n-gram匹配的投机解码方法
|
||||
|
||||
- **后缀解码**
|
||||
|
||||
@@ -38,12 +40,39 @@
|
||||
- **高效 DraftModel/MTP 框架**:开发多个融合 Cuda Kernel,统一完成模型类方法的前后处理,相比传统的循环、切片方法,性能高效且易维护
|
||||
|
||||
## 🔧 参数说明
|
||||
- `method`: 解码策略,可选值为 `"mtp"` 、 `"ngram"` 或 `"suffix"`
|
||||
|
||||
### 基础参数
|
||||
- `method`: 解码策略,可选值为 `"mtp"`、`"ngram"`、`"naive"` 或 `"suffix"`
|
||||
- `naive`: 普通解码模式,走投机解码代码路径但不生成草稿Token
|
||||
- `ngram`: 基于n-gram匹配的投机解码
|
||||
- `mtp`: 多Token预测(Multi-Token Prediction)
|
||||
- `suffix`: 基于后缀解码的投机解码
|
||||
- `num_speculative_tokens`: 每轮预测的 Token 数,最大支持 5(当前 MTP 仅支持 1)
|
||||
- `num_model_steps`: MTP 模型步数,需满足 `num_speculative_tokens >= num_model_steps`
|
||||
- `model`: 若选择 MTP,则需指定 MTP 模型路径
|
||||
- `quantization`: 模型量化方式,推荐使用 `wint8`
|
||||
- `batch_size`: 当前支持最大值为 256
|
||||
|
||||
### 验证策略参数 (verify_strategy)
|
||||
控制草稿Token的验证方式:
|
||||
- `topp` (默认): Top-P采样验证,草稿Token需在Top-P候选集中
|
||||
- `greedy`: 贪婪验证,草稿Token需等于目标模型的argmax输出
|
||||
- `target_match`: 目标匹配验证,草稿Token需等于目标模型的采样输出
|
||||
|
||||
```bash
|
||||
--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
|
||||
```
|
||||
|
||||
### 接受策略参数 (accept_policy)
|
||||
控制草稿Token的接受行为:
|
||||
- `normal` (默认): 正常验证流程
|
||||
- `accept_all`: 接受所有草稿Token(调试用)
|
||||
- `reject_all`: 拒绝所有草稿Token(调试用)
|
||||
|
||||
```bash
|
||||
--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
|
||||
```
|
||||
|
||||
## 🚀 使用 Multi-Token-Prediction(MTP) 解码
|
||||
详见论文:[DeepSeek-V3](https://arxiv.org/pdf/2412.19437)
|
||||
### TP 并行部署
|
||||
@@ -133,9 +162,21 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${path_to_main_model} \
|
||||
--tensor-parallel-size 4 \
|
||||
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
|
||||
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'
|
||||
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
|
||||
```
|
||||
|
||||
## 📝 使用 Naive 模式(普通解码)
|
||||
Naive 模式走投机解码代码路径但不生成草稿 Token,用于测试投机解码框架的正确性或对比性能基线。
|
||||
|
||||
```
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${path_to_main_model} \
|
||||
--tensor-parallel-size 4 \
|
||||
--speculative-config '{"method": "naive"}'
|
||||
```
|
||||
|
||||
**注意**: Naive 模式下 `num_speculative_tokens` 会被强制设置为 0。
|
||||
|
||||
## 🌲 使用后缀解码 (Suffix Decoding)
|
||||
|
||||
后缀解码是一种无模型推理框架,通过在 CPU 上使用高效后缀树进行快速草稿 Token 预测,加速重复性推理任务(如代理工作流程、编码等),消除 GPU 开销。
|
||||
@@ -149,7 +190,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${path_to_main_model} \
|
||||
--tensor-parallel-size 4 \
|
||||
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
|
||||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
|
||||
--speculative-config '{"method": "suffix", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
|
||||
```
|
||||
|
||||
参数描述
|
||||
|
||||
Reference in New Issue
Block a user