[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934)

* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

* [Docs] Fix thinking_budget markdown formatting

* [Test] Align ernie thinking budget test with process_request_dict
This commit is contained in:
jackyYang6
2026-03-23 14:15:55 +08:00
committed by GitHub
parent 7a78001be2
commit 634d23a38a
10 changed files with 663 additions and 285 deletions
+28 -12
View File
@@ -2,7 +2,9 @@
## Overview
`ThinkingBudgetLogitsProcessor` limits the number of tokens generated inside the `<think> ... </think>` segment. When the budget is reached, it forces a line break token and then the `</think>` token to terminate the thinking section.
`ThinkingBudgetLogitsProcessor` limits the number of tokens generated inside the `<think> ... </think>`
segment. When the budget is reached, it terminates thinking by forcing `</think>`. If
`think_stop_sentence` is configured, it forces the custom sentence first and then `</think>`.
## When to Use
@@ -11,19 +13,22 @@
## How It Works
1. **CPU precompute (DataProcessor)**: when a request includes `thinking_budget`, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section.
1. **Request-side precompute (DataProcessor)**: when a request includes `thinking_budget`, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section.
2. **Per-step update**: during decoding, the processor tracks `last_token_id` and `tokens_after_start`.
3. **Budget enforcement**: once the budget is reached, it forces a line break and then the thinking end token.
3. **Budget enforcement**: once the budget is reached, it forces `</think>` directly. If `think_stop_sentence`
is configured, it forces that sentence first and then `</think>`.
## Requirements
- The model must provide valid token ids for `think_start_id`, `think_end_id`, and `line_break_id` (via `ModelConfig`).
- If any of these ids are invalid, the processor is disabled and `thinking_budget` will not take effect.
- The model must provide valid token ids for `think_start_id` and `think_end_id` (via `ModelConfig`).
- If either of these ids is invalid, the processor is disabled and `thinking_budget` will not take effect.
## Request Parameters
- `thinking_budget` (int, required to enable): maximum number of tokens after `<think>` before forced termination.
- `think_stop_sentence` (string, optional): a stop sentence that will be tokenized on the CPU side and enforced near the budget boundary.
- `thinking_budget` (int, required to enable): maximum number of decode-time tokens after `<think>` before forced
termination.
- `think_stop_sentence` (string, optional): a literal custom sentence that will be tokenized on the request side
and enforced near the budget boundary.
## Operator-Level vs LogitsProcessor
@@ -41,16 +46,25 @@ FastDeploy has two ways to limit thinking length:
In short:
- If you only need a hard cap on thinking length, prefer `reasoning_max_tokens`.
- If you need custom behavior (for example, injecting custom sentence tokens), use `ThinkingBudgetLogitsProcessor`.
- If you need custom behavior (for example, inserting a custom sentence before `</think>`), use
`ThinkingBudgetLogitsProcessor`.
## Practical guidance
`reasoning_max_tokens` and `thinking_budget` are not mutually exclusive in current implementation.
If both are configured for the same request, both constraints can take effect, and whichever triggers first will end the thinking phase.
- To use **operator-level-only** behavior: this is request-level config only. Set `enable_thinking=true` and `reasoning_max_tokens` in request, and do not set `thinking_budget`.
- To use **logits-processor-only** behavior (especially with `think_stop_sentence`): this requires service-level + request-level config. Start service with `--logits-processors ThinkingBudgetLogitsProcessor`, and set `thinking_budget` (and optional `think_stop_sentence`) in `logits_processors_args`; leave `reasoning_max_tokens` unset.
- Avoid enabling both for strict custom sentence insertion requirements, because operator-level termination may cut the custom sentence path earlier.
- To use **operator-level-only** behavior: this is request-level config only. Set
`enable_thinking=true` and `reasoning_max_tokens` in request, and do not set `thinking_budget`.
- To use **logits-processor-only** behavior (especially with `think_stop_sentence`): this requires
service-level + request-level config. Start service with `--logits-processors ThinkingBudgetLogitsProcessor`,
and set `thinking_budget` (and optional `think_stop_sentence`) in `logits_processors_args`; leave
`reasoning_max_tokens` unset.
- `thinking_budget` itself does not require `enable_thinking=true`.
- If an ERNIE chat template already appends `<think>` in the prompt, `thinking_budget` should still take effect; it
does not require the model to emit another `<think>` during decoding.
- Avoid enabling both for strict custom sentence insertion requirements, because operator-level
termination may cut the custom sentence path earlier.
## Online Usage
@@ -120,4 +134,6 @@ print(outputs[0].outputs.text)
## Performance Note
This processor runs `update_state` and `apply` on every decode step. If you only need a hard thinking-length cap and care most about throughput, consider the operator-level reasoning-length controls instead of per-step logits processing.
This processor runs `update_state` and `apply` on every decode step. If you only need a hard
thinking-length cap and care most about throughput, consider the operator-level reasoning-length
controls instead of per-step logits processing.
+20 -11
View File
@@ -2,7 +2,9 @@
## 概述
`ThinkingBudgetLogitsProcessor` 用于限制 `<think> ... </think>` 区间的生成长度。当预算达到阈值时,会强制生成换行符 token,再强制生成 `</think>`,从而结束思考段。
`ThinkingBudgetLogitsProcessor` 用于限制 `<think> ... </think>` 区间的生成长度。当预算达到阈值时,
会直接强制生成 `</think>` 来结束思考段;如果配置了 `think_stop_sentence`,则会先强制输出该自定义
文案,再输出 `</think>`
## 适用场景
@@ -11,19 +13,20 @@
## 工作原理
1. **CPU 侧预计算(DataProcessor**:当请求中包含 `thinking_budget`,会基于 prompt 的 token ids 计算是否已进入思考段、是否已结束,以及已有的思考长度。
1. **请求侧预计算(DataProcessor**:当请求中包含 `thinking_budget`,会基于 prompt 的 token ids 计算是否已进入思考段、是否已结束,以及已有的思考长度。
2. **每步更新**:解码过程中跟踪 `last_token_id``tokens_after_start`
3. **预算约束**:达到预算后,依次强制换行符与思考结束 token
3. **预算约束**:达到预算后,默认直接强制 `</think>`;如果配置了 `think_stop_sentence`,则先逐 token
强制输出该文案,再输出 `</think>`
## 前置要求
- 模型需提供有效的 `think_start_id``think_end_id``line_break_id`(来自 `ModelConfig`)。
- 若任意 id 无效,处理器会禁用,`thinking_budget` 不生效。
- 模型需提供有效的 `think_start_id``think_end_id`(来自 `ModelConfig`)。
-其中任意 id 无效,处理器会禁用,`thinking_budget` 不生效。
## 请求参数
- `thinking_budget`int,启用所需):`<think>` 之后允许的最大 token 数。
- `think_stop_sentence`string,可选):CPU 侧会将该字符串编码为 token ids,并在预算边界附近强制输出。
- `thinking_budget`int,启用所需):`<think>` 之后允许的最大 decode 阶段 token 数。
- `think_stop_sentence`string,可选):按字面串编码的自定义终止文案,并在预算边界附近强制输出。
## 算子级限制 vs LogitsProcessor
@@ -35,21 +38,27 @@ FastDeploy 当前有两种思考长度控制方式:
- 适合“只限制思考长度”的简单场景。
- **`ThinkingBudgetLogitsProcessor`**`logits_processors_args.thinking_budget`):
- 由每步 Python 侧 logits 处理实现。
- 支持更灵活的行为,例如 `think_stop_sentence`(在结束前插入自定义话术)
- 支持更灵活的行为,例如 `think_stop_sentence`
- 相比算子级限制,在高并发下通常有更高开销。
可按以下原则选择:
- 仅需限制思考长度:优先用 `reasoning_max_tokens`
- 需要更灵活控制(如插入自定义话术):使用 `ThinkingBudgetLogitsProcessor`
- 需要更灵活控制(如`</think>`插入自定义话术):使用 `ThinkingBudgetLogitsProcessor`
## 建议实践
当前实现中,`reasoning_max_tokens``thinking_budget` 不是互斥关系。
同一请求如果同时配置,两套约束都可能生效,谁先触发就先结束思考段。
- **只用算子级限制**:这是请求级配置。仅在请求中设置 `enable_thinking=true` + `reasoning_max_tokens`不要传 `thinking_budget`
- **只用 LogitsProcessor**(尤其要用 `think_stop_sentence`):这是“服务启动 + 请求参数”两级配置。服务启动时必须加 `--logits-processors ThinkingBudgetLogitsProcessor`,并在请求里通过 `logits_processors_args``thinking_budget`(以及可选的 `think_stop_sentence`);同时不要设置 `reasoning_max_tokens`
- **只用算子级限制**:这是请求级配置。仅在请求中设置 `enable_thinking=true` + `reasoning_max_tokens`
不要传 `thinking_budget`
- **只用 LogitsProcessor**(尤其要用 `think_stop_sentence`):这是“服务启动 + 请求参数”两级配置。
服务启动时必须加 `--logits-processors ThinkingBudgetLogitsProcessor`,并在请求里通过
`logits_processors_args``thinking_budget`(以及可选的 `think_stop_sentence`);同时不要设置
`reasoning_max_tokens`
- `thinking_budget` 本身不依赖 `enable_thinking=true`
- 如果 ERNIE 的 chat template 已经在 prompt 里拼入 `<think>``thinking_budget` 也应正常生效,不要求模型在 decode 阶段再次输出 `<think>`
- 如果业务要求“必须完整插入自定义话术”,不建议与算子级限制同时开启,否则可能被算子级提前截断。
## 在线使用