[Feature] support reward model (#5301)

* Your commit message here * add test * update develop * support reward * support enable_chunk_prefill * support bingfa * support convert is reward * update test * delete print * fix enable_thinking * add document * fix place * fix test * fix * support enable_prefix_caching * add no-enable_prefix-caching test * fix * support enable_prefix_caching * delete print * fix document * fix * fix test * fix document and delete chinese * udpate * enable_thinking * fix test
2026-04-23 00:17:25 +08:00 · 2025-12-02 14:55:31 +08:00
parent 2e1680838f
commit c563eca791
17 changed files with 636 additions and 58 deletions
@@ -0,0 +1,175 @@
+[简体中文](../zh/features//pooling_models.md)
+
+# Pooling Models
+
+FastDeploy also supports pooling models, such as embedding models.
+
+In FastDeploy, pooling models implement the `FdModelForPooling` interface.
+These models use a `Pooler` to extract the final hidden states of the input
+before returning them.
+
+## Configuration
+
+### Model Runner
+
+Run a model in pooling mode via the option `--runner pooling`.
+
+!!! tip<br>
+    There is no need to set this option in the vast majority of cases as Fastdeploy can automatically
+    detect the appropriate model runner via `--runner auto`.
+
+### Model Conversion
+
+FastDeploy can adapt models for various pooling tasks via the option `--convert <type>`.
+
+If `--runner pooling` has been set (manually or automatically) but the model does not implement the
+`FdModelForPooling` interface,
+vLLM will attempt to automatically convert the model according to the architecture names
+shown in the table below.
+
+| Architecture                                    | `--convert` | Supported pooling tasks               |
+|-------------------------------------------------|-------------|---------------------------------------|
+| `*ForTextEncoding`, `*EmbeddingModel`, `*Model`  `*ForProcessRewardModel`   | `embed`     |         `embed`                       |
+
+!!! tip<br>
+    You can explicitly set `--convert <type>` to specify how to convert the model.
+
+### Pooler Configuration
+
+#### Predefined models
+
+If the `Pooler` defined by the model accepts `pooler_config`,
+you can override some of its attributes via the `--pooler-config` option.
+
+#### Converted models
+
+If the model has been converted via `--convert` (see above),
+the pooler assigned to each task has the following attributes by default:
+
+| Task       | Pooling Type | Normalization | Softmax |
+|------------|--------------|---------------|---------|
+| `embed`    | `LAST`       | ✅︎            | ❌      |
+
+When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
+its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults and It can also be specified during model network construction through @default_pooling_type("LAST").
+
+##### Pooling Type
+
+1.LastPool(PoolingType.LAST)
+
+Purpose:Extracts the hidden state of the last token in each sequence
+
+2.AllPool(PoolingType.ALL)
+
+Purpose:Returns the hidden states of all tokens in each sequence
+
+3.CLSPool(PoolingType.CLS)
+
+Purpose:Returns the hidden state of the first token in each sequence (CLS token)
+
+4.MeanPool(PoolingType.MEAN)
+
+Purpose:Computes the average of all token hidden states in each sequence
+
+## Online Serving
+
+FastDeploy's OpenAI-compatible server provides API endpoints and custom reward interfaces.
+
+[Embeddings API], supports text and multi-modal inputs
+
+[Reward API], scores specific content
+
+### Embedding Model:
+```python
+model_path=Qwen/Qwen3-Embedding-0.6B
+
+python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
+    --max-num-seqs 256 --max-model-len 32768 \
+    --port 9412 --engine-worker-queue-port 7142 \
+    --metrics-port 7211 --tensor-parallel-size 1 \
+    --gpu-memory-utilization 0.9 \
+    --runner pooling
+```
+
+Request Methods:
+A. EmbeddingCompletionRequest Example (Standard Text Input)
+
+```bash
+curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "text-embedding-chat-model",
+    "input": [
+      "This is a sentence for pooling embedding.",
+      "Another input text."
+    ],
+    "user": "test_client"
+  }'
+```
+
+B. EmbeddingChatRequest Example (Message Sequence Input)
+
+```bash
+curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "text-embedding-chat-model",
+    "messages": [
+      {"role": "user", "content": "Generate embedding for user query."}
+    ]
+  }'
+```
+
+### Pooling Model and reward score
+```python
+model_path=RM_v1008
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${model_path} \
+    --max-num-seqs 256 \
+    --max-model-len 8192 \
+    --port 13351 \
+    --engine-worker-queue-port 7562 \
+    --metrics-port 7531 \
+    --tensor-parallel-size 8 \
+    --gpu-memory-utilization 0.9 \
+    --runner pooling \
+    --convert embed
+```
+Request Method: ChatRewardRequest
+```bash
+curl --location 'http://xxxx/v1/chat/reward' \
+--header 'Content-Type: application/json' \
+--data '{
+  "model": "",
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {
+          "type": "image_url",
+          "image_url": {
+            "url": "https://xxx/a.png"
+          }
+        }
+      ]
+    },
+    {
+      "role": "assistant",
+      "content": [
+        {
+          "type": "text",
+          "text": "图里有几个人"
+        }
+      ]
+    }
+  ],
+  "user": "user-123",
+  "chat_template": null,
+  "chat_template_kwargs": {
+    "custom_var": "value"
+  },
+  "mm_processor_kwargs": {
+    "image_size": 224
+  }
+}'
+```
@@ -0,0 +1,168 @@
+[English](../../features/pooling_models.md)
+
+# Pooling Models
+
+FastDeploy也支持pooling模型，例如嵌入(embedding)模型。
+
+在FastDeploy中,池化模型通过`FdModelForPooling`接口。这些模型使用一个`Pooler`来提取输入的最终隐藏状态并返回。
+
+## Configuration
+
+### Model Runner
+
+通过`--runner pooling`选项以池化模型运行模型。
+
+!!! 提示<br>
+    在绝大多数情况下无需手动设置该选项，因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。
+
+### Model Conversion
+
+如果模型未实现FdModelForPooling接口但你希望以池化模式运行，FastDeploy可通过`--convert <type>`自动转换模型。
+
+当设置了`--runner pooling`(手动或自动)但模型不符合接口时，FastDeploy会根据模型架构名称自动转换:
+
+| Architecture                                    | `--convert` | 支持的池化类型               |
+|-------------------------------------------------|-------------|---------------------------------------|
+| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel`  | `embed`     |  `embed`                              |
+
+!!! 提示<br>
+    你可以显示设置`--convert <type>`来制定模型转换方式。
+
+### Pooler Configuration
+
+#### Predefined models
+
+如果模型定义的`Pooler`接受pooler_config，你可以通过--pooler_config覆盖部分属性。
+
+#### Converted models
+
+如果模型通过--convert转换，各任务默认的池化配置如下:
+
+| Task       | Pooling Type | Normalization | Softmax |
+|------------|--------------|---------------|---------|
+| `embed`    | `LAST`       | ✅︎            | ❌      |
+
+加载[Sentence Transformers](https://huggingface.co/sentence-transformers)模型时，其`modules.json`配置优于默认值，也可以通过@default_pooling_type("LAST")在模型组网时指定。
+
+#### Pooling Type
+
+1.LastPool(PoolingType.LAST)
+
+作用:提取每个序列的最后一个token的隐藏状态
+
+2.AllPool(PoolingType.ALL)
+
+作用:返回每个序列的所有token的隐藏状态
+
+3.CLSPool(PoolingType.CLS)
+
+作用:返回每个序列的第一个token(CLS token)的隐藏状态
+
+4.MeanPool(PoolingType.MEAN)
+
+作用:计算每个序列所有token隐藏状态的平均值
+
+## Online Serving
+
+FastDeploy的OpenAI兼容服务器提供了API的端点和自定义的reward接口
+
+- `Embeddings API`，支持文本和多模态输入
+- `Reward API`,给指定的内容打分
+
+### Embedding模型:
+```python
+model_path=Qwen/Qwen3-Embedding-0.6B
+
+python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
+    --max-num-seqs 256 --max-model-len 32768 \
+    --port 9412 --engine-worker-queue-port 7142 \
+    --metrics-port 7211 --tensor-parallel-size 1 \
+    --gpu-memory-utilization 0.9 \
+    --runner pooling \
+
+```
+
+请求方式:<br>
+A. EmbeddingCompletionRequest 示例（标准文本输入）
+
+```bash
+curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "text-embedding-chat-model",
+    "input": [
+      "This is a sentence for pooling embedding.",
+      "Another input text."
+    ],
+    "user": "test_client"
+  }'
+```
+
+B. EmbeddingChatRequest 示例（消息序列输入）
+
+```bash
+curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "text-embedding-chat-model",
+    "messages": [
+      {"role": "user", "content": "Generate embedding for user query."}
+    ]
+  }'
+```
+
+### Pooling模型和打分机制
+```python
+model_path=RM_v1008
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${model_path} \
+    --max-num-seqs 256 \
+    --max-model-len 8192 \
+    --port 13351 \
+    --engine-worker-queue-port 7562 \
+    --metrics-port 7531 \
+    --tensor-parallel-size 8 \
+    --gpu-memory-utilization 0.9 \
+    --runner pooling \
+    --convert embed \
+```
+
+请求方式: ChatRewardRequest
+
+```bash
+curl --location 'http://xxxx/v1/chat/reward' \
+--header 'Content-Type: application/json' \
+--data '{
+  "model": "",
+  "messages": [
+    {
+      "role": "user",
+      "content": [
+        {
+          "type": "image_url",
+          "image_url": {
+            "url": "https://xxx/a.png"
+          }
+        }
+      ]
+    },
+    {
+      "role": "assistant",
+      "content": [
+        {
+          "type": "text",
+          "text": "图里有几个人"
+        }
+      ]
+    }
+  ],
+  "user": "user-123",
+  "chat_template": null,
+  "chat_template_kwargs": {
+    "custom_var": "value"
+  },
+  "mm_processor_kwargs": {
+    "image_size": 224
+  }
+}'
+```