mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 (#6493)
* Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
This commit is contained in:
@@ -1243,8 +1243,6 @@ class ResourceManagerV1(ResourceManager):
|
||||
If can not, return False
|
||||
"""
|
||||
assert self.config.scheduler_config.splitwise_role == "decode", "Only D instance can call this method"
|
||||
if request.reasoning_max_tokens is not None:
|
||||
request.reasoning_max_tokens -= 1
|
||||
request.need_prefill_tokens = len(request.prompt_token_ids)
|
||||
need_prealloc_prefill_blocks = (
|
||||
request.need_prefill_tokens + self.config.cache_config.block_size - 1
|
||||
|
||||
Reference in New Issue
Block a user