Sync v2.0 version of code to github repo

2026-04-23 08:21:53 +08:00 · 2025-06-29 23:29:37 +00:00
parent d151496038
commit 92c2cfa2e7
597 changed files with 78776 additions and 22905 deletions
@@ -0,0 +1,67 @@
+# Chain-of-Thought Content
+
+The reasoning model returns a `reasoning_content` field in the output, representing the chain-of-thought content—the reasoning steps that lead to the final conclusion.
+
+## Currently Supported Chain-of-Thought Models
+| Model Name     | Parser Name    | Chain-of-Thought Enabled by Default |
+|----------------|----------------|-------------------------------------|
+| ernie-45-vl    | ernie-45-vl    | ✓                                   |
+| ernie-lite-vl  | ernie-45-vl    | ✓                                   |
+
+The reasoning model requires a specified parser to interpret the reasoning content. The reasoning mode can be disabled by setting the `enable_thinking=False` parameter.
+
+Interfaces that support toggling the reasoning mode:
+1. `/v1/chat/completions` request in OpenAI services.
+2. `/v1/chat/completions` request in the OpenAI Python client.
+3. `llm.chat` request in Offline interfaces.
+
+For reasoning models, the length of the reasoning content can be controlled via `reasoning_max_tokens`. Add `metadata={"reasoning_max_tokens": 1024}` to the request.
+
+### Quick Start
+When launching the model service, specify the parser name using the `--reasoning-parser` argument.  
+This parser will process the model's output and extract the `reasoning_content` field.
+```bash
+python -m fastdeploy.entrypoints.openai.api_server --model /root/merge_llm_model  --enable-mm --tensor-parallel-size=8 --port 8192 --quantization wint4 --reasoning-parser=ernie-45-vl
+```
+
+Next, send a `chat completion` request to the model:
+```bash
+curl -X POST "http://0.0.0.0:8192/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": [
+      {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+      {"type": "text", "text": "Which era does the cultural relic in the picture belong to"}
+    ]}
+  ],
+  "metadata": {"enable_thinking": true}
+}'
+```
+The `reasoning_content` field contains the reasoning steps to reach the final conclusion, while the `content` field holds the conclusion itself.
+
+### Streaming Sessions
+In streaming sessions, the `reasoning_content` field can be retrieved from the `delta` in `chat completion response chunks`.
+```python
+from openai import OpenAI
+# Set OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8192/v1"
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+chat_response = client.chat.completions.create(
+    messages=[
+        {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example2.jpg"}},
+        {"type": "text", "text": "Which era does the cultural relic in the picture belong to"}]}
+    ],
+    model="vl",
+    stream=True,
+    metadata={"enable_thinking": True}
+)
+for chunk in chat_response:
+    if chunk.choices[0].delta is not None:
+        print(chunk.choices[0].delta, end='')
+        print("\n")
+```