mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
[Feature] support reward model (#5301)
* Your commit message here * add test * update develop * support reward * support enable_chunk_prefill * support bingfa * support convert is reward * update test * delete print * fix enable_thinking * add document * fix place * fix test * fix * support enable_prefix_caching * add no-enable_prefix-caching test * fix * support enable_prefix_caching * delete print * fix document * fix * fix test * fix document and delete chinese * udpate * enable_thinking * fix test
This commit is contained in:
@@ -0,0 +1,175 @@
|
||||
[简体中文](../zh/features//pooling_models.md)
|
||||
|
||||
# Pooling Models
|
||||
|
||||
FastDeploy also supports pooling models, such as embedding models.
|
||||
|
||||
In FastDeploy, pooling models implement the `FdModelForPooling` interface.
|
||||
These models use a `Pooler` to extract the final hidden states of the input
|
||||
before returning them.
|
||||
|
||||
## Configuration
|
||||
|
||||
### Model Runner
|
||||
|
||||
Run a model in pooling mode via the option `--runner pooling`.
|
||||
|
||||
!!! tip<br>
|
||||
There is no need to set this option in the vast majority of cases as Fastdeploy can automatically
|
||||
detect the appropriate model runner via `--runner auto`.
|
||||
|
||||
### Model Conversion
|
||||
|
||||
FastDeploy can adapt models for various pooling tasks via the option `--convert <type>`.
|
||||
|
||||
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
|
||||
`FdModelForPooling` interface,
|
||||
vLLM will attempt to automatically convert the model according to the architecture names
|
||||
shown in the table below.
|
||||
|
||||
| Architecture | `--convert` | Supported pooling tasks |
|
||||
|-------------------------------------------------|-------------|---------------------------------------|
|
||||
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `*ForProcessRewardModel` | `embed` | `embed` |
|
||||
|
||||
!!! tip<br>
|
||||
You can explicitly set `--convert <type>` to specify how to convert the model.
|
||||
|
||||
### Pooler Configuration
|
||||
|
||||
#### Predefined models
|
||||
|
||||
If the `Pooler` defined by the model accepts `pooler_config`,
|
||||
you can override some of its attributes via the `--pooler-config` option.
|
||||
|
||||
#### Converted models
|
||||
|
||||
If the model has been converted via `--convert` (see above),
|
||||
the pooler assigned to each task has the following attributes by default:
|
||||
|
||||
| Task | Pooling Type | Normalization | Softmax |
|
||||
|------------|--------------|---------------|---------|
|
||||
| `embed` | `LAST` | ✅︎ | ❌ |
|
||||
|
||||
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
||||
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults and It can also be specified during model network construction through @default_pooling_type("LAST").
|
||||
|
||||
##### Pooling Type
|
||||
|
||||
1.LastPool(PoolingType.LAST)
|
||||
|
||||
Purpose:Extracts the hidden state of the last token in each sequence
|
||||
|
||||
2.AllPool(PoolingType.ALL)
|
||||
|
||||
Purpose:Returns the hidden states of all tokens in each sequence
|
||||
|
||||
3.CLSPool(PoolingType.CLS)
|
||||
|
||||
Purpose:Returns the hidden state of the first token in each sequence (CLS token)
|
||||
|
||||
4.MeanPool(PoolingType.MEAN)
|
||||
|
||||
Purpose:Computes the average of all token hidden states in each sequence
|
||||
|
||||
## Online Serving
|
||||
|
||||
FastDeploy's OpenAI-compatible server provides API endpoints and custom reward interfaces.
|
||||
|
||||
[Embeddings API], supports text and multi-modal inputs
|
||||
|
||||
[Reward API], scores specific content
|
||||
|
||||
### Embedding Model:
|
||||
```python
|
||||
model_path=Qwen/Qwen3-Embedding-0.6B
|
||||
|
||||
python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
|
||||
--max-num-seqs 256 --max-model-len 32768 \
|
||||
--port 9412 --engine-worker-queue-port 7142 \
|
||||
--metrics-port 7211 --tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--runner pooling
|
||||
```
|
||||
|
||||
Request Methods:
|
||||
A. EmbeddingCompletionRequest Example (Standard Text Input)
|
||||
|
||||
```bash
|
||||
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "text-embedding-chat-model",
|
||||
"input": [
|
||||
"This is a sentence for pooling embedding.",
|
||||
"Another input text."
|
||||
],
|
||||
"user": "test_client"
|
||||
}'
|
||||
```
|
||||
|
||||
B. EmbeddingChatRequest Example (Message Sequence Input)
|
||||
|
||||
```bash
|
||||
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "text-embedding-chat-model",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Generate embedding for user query."}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Pooling Model and reward score
|
||||
```python
|
||||
model_path=RM_v1008
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${model_path} \
|
||||
--max-num-seqs 256 \
|
||||
--max-model-len 8192 \
|
||||
--port 13351 \
|
||||
--engine-worker-queue-port 7562 \
|
||||
--metrics-port 7531 \
|
||||
--tensor-parallel-size 8 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--runner pooling \
|
||||
--convert embed
|
||||
```
|
||||
Request Method: ChatRewardRequest
|
||||
```bash
|
||||
curl --location 'http://xxxx/v1/chat/reward' \
|
||||
--header 'Content-Type: application/json' \
|
||||
--data '{
|
||||
"model": "",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://xxx/a.png"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "图里有几个人"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"user": "user-123",
|
||||
"chat_template": null,
|
||||
"chat_template_kwargs": {
|
||||
"custom_var": "value"
|
||||
},
|
||||
"mm_processor_kwargs": {
|
||||
"image_size": 224
|
||||
}
|
||||
}'
|
||||
```
|
||||
@@ -0,0 +1,168 @@
|
||||
[English](../../features/pooling_models.md)
|
||||
|
||||
# Pooling Models
|
||||
|
||||
FastDeploy也支持pooling模型,例如嵌入(embedding)模型。
|
||||
|
||||
在FastDeploy中,池化模型通过`FdModelForPooling`接口。这些模型使用一个`Pooler`来提取输入的最终隐藏状态并返回。
|
||||
|
||||
## Configuration
|
||||
|
||||
### Model Runner
|
||||
|
||||
通过`--runner pooling`选项以池化模型运行模型。
|
||||
|
||||
!!! 提示<br>
|
||||
在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。
|
||||
|
||||
### Model Conversion
|
||||
|
||||
如果模型未实现FdModelForPooling接口但你希望以池化模式运行,FastDeploy可通过`--convert <type>`自动转换模型。
|
||||
|
||||
当设置了`--runner pooling`(手动或自动)但模型不符合接口时,FastDeploy会根据模型架构名称自动转换:
|
||||
|
||||
| Architecture | `--convert` | 支持的池化类型 |
|
||||
|-------------------------------------------------|-------------|---------------------------------------|
|
||||
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` |
|
||||
|
||||
!!! 提示<br>
|
||||
你可以显示设置`--convert <type>`来制定模型转换方式。
|
||||
|
||||
### Pooler Configuration
|
||||
|
||||
#### Predefined models
|
||||
|
||||
如果模型定义的`Pooler`接受pooler_config,你可以通过--pooler_config覆盖部分属性。
|
||||
|
||||
#### Converted models
|
||||
|
||||
如果模型通过--convert转换,各任务默认的池化配置如下:
|
||||
|
||||
| Task | Pooling Type | Normalization | Softmax |
|
||||
|------------|--------------|---------------|---------|
|
||||
| `embed` | `LAST` | ✅︎ | ❌ |
|
||||
|
||||
加载[Sentence Transformers](https://huggingface.co/sentence-transformers)模型时,其`modules.json`配置优于默认值,也可以通过@default_pooling_type("LAST")在模型组网时指定。
|
||||
|
||||
#### Pooling Type
|
||||
|
||||
1.LastPool(PoolingType.LAST)
|
||||
|
||||
作用:提取每个序列的最后一个token的隐藏状态
|
||||
|
||||
2.AllPool(PoolingType.ALL)
|
||||
|
||||
作用:返回每个序列的所有token的隐藏状态
|
||||
|
||||
3.CLSPool(PoolingType.CLS)
|
||||
|
||||
作用:返回每个序列的第一个token(CLS token)的隐藏状态
|
||||
|
||||
4.MeanPool(PoolingType.MEAN)
|
||||
|
||||
作用:计算每个序列所有token隐藏状态的平均值
|
||||
|
||||
## Online Serving
|
||||
|
||||
FastDeploy的OpenAI兼容服务器提供了API的端点和自定义的reward接口
|
||||
|
||||
- `Embeddings API`,支持文本和多模态输入
|
||||
- `Reward API`,给指定的内容打分
|
||||
|
||||
### Embedding模型:
|
||||
```python
|
||||
model_path=Qwen/Qwen3-Embedding-0.6B
|
||||
|
||||
python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
|
||||
--max-num-seqs 256 --max-model-len 32768 \
|
||||
--port 9412 --engine-worker-queue-port 7142 \
|
||||
--metrics-port 7211 --tensor-parallel-size 1 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--runner pooling \
|
||||
|
||||
```
|
||||
|
||||
请求方式:<br>
|
||||
A. EmbeddingCompletionRequest 示例(标准文本输入)
|
||||
|
||||
```bash
|
||||
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "text-embedding-chat-model",
|
||||
"input": [
|
||||
"This is a sentence for pooling embedding.",
|
||||
"Another input text."
|
||||
],
|
||||
"user": "test_client"
|
||||
}'
|
||||
```
|
||||
|
||||
B. EmbeddingChatRequest 示例(消息序列输入)
|
||||
|
||||
```bash
|
||||
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"model": "text-embedding-chat-model",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Generate embedding for user query."}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
### Pooling模型和打分机制
|
||||
```python
|
||||
model_path=RM_v1008
|
||||
python -m fastdeploy.entrypoints.openai.api_server \
|
||||
--model ${model_path} \
|
||||
--max-num-seqs 256 \
|
||||
--max-model-len 8192 \
|
||||
--port 13351 \
|
||||
--engine-worker-queue-port 7562 \
|
||||
--metrics-port 7531 \
|
||||
--tensor-parallel-size 8 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--runner pooling \
|
||||
--convert embed \
|
||||
```
|
||||
|
||||
请求方式: ChatRewardRequest
|
||||
|
||||
```bash
|
||||
curl --location 'http://xxxx/v1/chat/reward' \
|
||||
--header 'Content-Type: application/json' \
|
||||
--data '{
|
||||
"model": "",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://xxx/a.png"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "图里有几个人"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"user": "user-123",
|
||||
"chat_template": null,
|
||||
"chat_template_kwargs": {
|
||||
"custom_var": "value"
|
||||
},
|
||||
"mm_processor_kwargs": {
|
||||
"image_size": 224
|
||||
}
|
||||
}'
|
||||
```
|
||||
Reference in New Issue
Block a user