mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
cf7934a4b2
* optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge
244 lines
9.6 KiB
Markdown
244 lines
9.6 KiB
Markdown
[简体中文](../zh/features/speculative_decoding.md)
|
||
|
||
# 🔮 Speculative Decoding
|
||
|
||
This project implements an efficient **Speculative Decoding** inference framework based on PaddlePaddle. It supports **Multi-Token Proposing (MTP)** to accelerate large language model (LLM) generation, significantly reducing latency and improving throughput.
|
||
|
||
---
|
||
|
||
## ✅ Supported Speculative Decoding Methods
|
||
|
||
### Supported
|
||
|
||
- **Naive**: Normal decoding mode that uses the speculative decoding code path without generating draft tokens, useful for testing the speculative decoding framework
|
||
|
||
- **Ngram**: N-gram matching based speculative decoding
|
||
|
||
- **Suffix Decoding**
|
||
|
||
- **MTP (Multi-Token Prediction)**
|
||
- ✅ Supported: TP Sharding
|
||
- ✅ Supported: Shared Prefix
|
||
- ✅ Supported: TP Sharding + PD Separation
|
||
- ⏳ Coming Soon: EP + DP + PD Separation
|
||
- ⏳ Coming Soon: Support Chunk-prefill
|
||
- ⏳ Coming Soon: Multi-layer MTP Layer
|
||
|
||
- **Decoding with Hybrid MTP and Ngram Methods(Hybrid-MTP-with-Ngram)**
|
||
|
||
- Overview: A hybrid method combining MTP and Ngram. First, MTP generates N draft tokens, then Ngram matching is used to supplement additional draft tokens.
|
||
|
||
- Use Cases: Suitable when higher draft token coverage is required, leveraging both MTP’s generation capability and the efficiency of Ngram matching.
|
||
|
||
---
|
||
|
||
### Coming Soon
|
||
|
||
- Draft Model
|
||
- Eagle
|
||
- Hydra
|
||
- Medusa
|
||
- ...
|
||
|
||
---
|
||
|
||
## ⚙️ Efficient Speculative Decoding Architecture
|
||
|
||
- **Attention Mechanism**: We employ [Cascade Append Attention](https://flashinfer.ai/2024/02/02/cascade-inference.html), which allows unified processing of queries with varying token lengths, enabling efficient verification. All tokens can be verified in a single forward pass. We deeply customized the underlying kernels to fully leverage Tensor Cores and maintain high throughput even under heavy concurrency.
|
||
|
||
- **Virtual Padding Mechanism**: A virtual padding strategy is used to locate output token batch IDs, eliminating the overhead of data copying and slicing operations.
|
||
|
||
- **Parallel Sampling and Verification**: We developed multiple fused CUDA kernels for concurrent sampling and verification. These kernels allow parallel processing for each sample in a batch, avoiding explicit loop execution on the host side.
|
||
|
||
- **Efficient Draft Model/MTP Framework**: Multiple fused CUDA kernels are used to handle pre- and post-processing within the model class, replacing traditional loop-based and slicing-based methods with a more performant and maintainable structure.
|
||
|
||
---
|
||
|
||
## 🔧 Configuration Parameters
|
||
|
||
### Basic Parameters
|
||
|
||
- `method`: The speculative decoding strategy, supports `["mtp", "ngram", "naive", "suffix"]`.
|
||
- `naive`: Normal decoding mode using speculative decoding code path without generating draft tokens
|
||
- `ngram`: N-gram matching based speculative decoding
|
||
- `mtp`: Multi-Token Prediction
|
||
- `suffix`: Suffix decoding based speculative decoding
|
||
- `num_speculative_tokens`: Number of speculative tokens to generate; max is 5, currently MTP supports only 1.
|
||
- `num_model_steps`: MTP model steps, must satisfy `num_speculative_tokens >= num_model_steps`
|
||
- `model`: Path to the MTP draft model when using the `"mtp"` method.
|
||
- `quantization`: Quantization method of the MTP model (e.g., WINT4).
|
||
- Max `batch_size`: 256
|
||
|
||
### Verification Strategy (verify_strategy)
|
||
|
||
Controls how draft tokens are verified:
|
||
- `topp` (default): Top-P sampling verification, draft token must be in top-p candidate set
|
||
- `greedy`: Greedy verification, draft token must equal target model's argmax output
|
||
- `target_match`: Target match verification, draft token must equal target model's sampled output
|
||
|
||
```bash
|
||
--speculative-config '{"method": "mtp", "verify_strategy": "greedy", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
|
||
```
|
||
|
||
### Accept Policy (accept_policy)
|
||
|
||
Controls draft token acceptance behavior:
|
||
- `normal` (default): Normal verification flow
|
||
- `accept_all`: Accept all draft tokens (for debugging)
|
||
- `reject_all`: Reject all draft tokens (for debugging)
|
||
|
||
```bash
|
||
--speculative-config '{"method": "mtp", "accept_policy": "accept_all", "num_speculative_tokens": 1}'
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 Using Multi-Token Prediction (MTP)
|
||
|
||
For detailed theory, refer to:
|
||
📄 [DeepSeek-V3 Paper](https://arxiv.org/pdf/2412.19437)
|
||
|
||
### TP Sharding Mode
|
||
|
||
Launch service on 4 × H100 GPUs using WINT4 quantization (Dense: WINT8, MoE: WINT4):
|
||
|
||
> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml`
|
||
|
||
```bash
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model ${path_to_main_model} \
|
||
--tensor-parallel-size 4 \
|
||
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
|
||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'
|
||
```
|
||
|
||
### PD-Separated Deployment (1P1D Mode)
|
||
Deploy 1P1D on H100 with both Prefill (P) and Decode (D) nodes using TP4 + WINT4 quantization.
|
||
This deployment only requires changing the config and adding speculative_config.
|
||
For details, refer to the [PD Separation](./disaggregated.md).
|
||
- P Node(Prefill)
|
||
|
||
> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-prefill.yaml`
|
||
|
||
```
|
||
export FD_LOG_DIR="log_prefill"
|
||
rm -rf ${FD_LOG_DIR}
|
||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model ${path_to_main_model} \
|
||
--port 8180 \
|
||
--metrics-port 8181 \
|
||
--engine-worker-queue-port 8182 \
|
||
--cache-queue-port 8183 \
|
||
--workers 2 \
|
||
--tensor-parallel-size 4 \
|
||
--quantization wint4 \
|
||
--splitwise-role "prefill" \
|
||
--scheduler-name "splitwise" \
|
||
--scheduler-host "127.0.0.1" \
|
||
--scheduler-port 6379 \
|
||
--scheduler-ttl 9000 \
|
||
--scheduler-topic mtp \
|
||
--config ${path_to_FastDeploy}/benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-prefill.yaml \
|
||
--scheduler-password "scheduler_mtp" \
|
||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
|
||
```
|
||
|
||
- D Node(Decode)
|
||
|
||
> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-decode.yaml`
|
||
|
||
```
|
||
export FD_LOG_DIR="log_decode"
|
||
rm -rf ${FD_LOG_DIR}
|
||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model ${path_to_main_model} \
|
||
--port 8190 \
|
||
--metrics-port 8191 \
|
||
--engine-worker-queue-port 8192 \
|
||
--cache-queue-port 8193 \
|
||
--workers 2 \
|
||
--tensor-parallel-size 4 \
|
||
--quantization wint4 \
|
||
--splitwise-role "decode" \
|
||
--scheduler-name "splitwise" \
|
||
--scheduler-host "127.0.0.1" \
|
||
--scheduler-port 6379 \
|
||
--scheduler-ttl 9000 \
|
||
--scheduler-topic mtp \
|
||
--config ${path_to_FastDeploy}/benchmarks/yaml/eb45t-32k-wint4-mtp-tp4-decode.yaml \
|
||
--scheduler-password "scheduler_mtp" \
|
||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' &
|
||
```
|
||
## Decoding with Hybrid MTP and Ngram Methods
|
||
|
||
When starting the service, you only need to modify the --speculative-config option.
|
||
For example, use MTP to generate two draft tokens, and then append three additional draft tokens from Ngram matching:
|
||
```
|
||
--speculative-config '{"method": "mtp", "num_model_steps": 2, "mtp_strategy": "with_ngram", "num_speculative_tokens": 5, "model": "'$model_path'/mtp"}'
|
||
```
|
||
## 🧠 Using Ngram-Based Decoding
|
||
This method uses an n-gram sliding window to match the prompt and generated tokens to predict draft tokens. It is particularly effective in scenarios with high input-output overlap (e.g., code completion, document search).
|
||
|
||
Run on 4 × H100 GPUs with WINT4 quantization:
|
||
|
||
> Config file: `benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml`
|
||
|
||
```
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model ${path_to_main_model} \
|
||
--tensor-parallel-size 4 \
|
||
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
|
||
--speculative-config '{"method": "ngram", "num_speculative_tokens": 1}'
|
||
|
||
```
|
||
|
||
## 🌲 Using Suffix Decoding
|
||
|
||
Suffix Decoding is a model-free speculative decoding method that accelerates repetitive inference tasks (e.g., agent workflows, coding) using efficient CPU-based suffix trees for rapid draft token prediction, eliminating GPU overhead.
|
||
|
||
Run on 4 × H100 GPUs with WINT4 quantization:
|
||
|
||
> Config file: benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml
|
||
|
||
```
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model ${path_to_main_model} \
|
||
--tensor-parallel-size 4 \
|
||
--config ${path_to_FastDeploy}benchmarks/yaml/eb45t-32k-wint4-mtp-h100-tp4.yaml \
|
||
--speculative-config '{"method": "mtp", "num_speculative_tokens": 4, "suffix_decoding_max_tree_depth": 64, "suffix_decoding_max_cached_requests": 10000, "suffix_decoding_max_spec_factor": 1.0, "suffix_decoding_min_token_prob": 0.1}'
|
||
```
|
||
|
||
Parameter Descriptions
|
||
|
||
```
|
||
# The maximum length of token sequences cached in suffix trees.
|
||
self.suffix_decoding_max_tree_depth: int = 64
|
||
|
||
# The limits of requests that can be stored in the cache.
|
||
self.suffix_decoding_max_cached_requests: int = -1
|
||
|
||
# The factor of matched length, calculated as num_draft_tokens = suffix_max_spec_factor * matched_length
|
||
self.suffix_decoding_max_spec_factor: float = 1.0
|
||
|
||
# The probability threshold for speculated tokens.
|
||
self.suffix_decoding_min_token_prob: float = 0.1
|
||
```
|
||
---
|
||
|
||
## 📝 Using Naive Mode (Normal Decoding)
|
||
|
||
Naive mode uses the speculative decoding code path without generating draft tokens, useful for testing the correctness of the speculative decoding framework or establishing performance baselines.
|
||
|
||
```bash
|
||
python -m fastdeploy.entrypoints.openai.api_server \
|
||
--model ${path_to_main_model} \
|
||
--tensor-parallel-size 4 \
|
||
--speculative-config '{"method": "naive", "num_speculative_tokens": 1}'
|
||
```
|
||
|
||
**Note**: In Naive mode, `num_speculative_tokens` will be forced to 0.
|