[OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6493)

* Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
2026-04-23 17:11:21 +08:00 · 2026-02-25 21:36:50 +08:00
parent e18397134a
commit 6d3fede240
38 changed files with 771 additions and 1690 deletions
@@ -268,10 +268,11 @@ class Qwen3VLProcessor(TextProcessor):
            ]  # Leave space for at least 1 new token

        # Set default max_tokens if not specified
-        if request.sampling_params.max_tokens is None:
-            request.sampling_params.max_tokens = max(
-                1, max_model_len - len(request.prompt_token_ids)
-            )  # Ensure at least 1 token
+        max_tokens = max_model_len - len(request.prompt_token_ids)
+        if getattr(request.sampling_params, "max_tokens", None) is None:
+            request.sampling_params.max_tokens = max(1, max_tokens)
+        else:
+            request.sampling_params.max_tokens = min(max_tokens, request.sampling_params.max_tokens)
        data_processor_logger.info(f"Processed request {request}")

        return request