mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-22 16:07:51 +08:00
abort requests (#6992)
This commit is contained in:
@@ -577,3 +577,4 @@ DeltaFunctionCall:
|
||||
- `/v1/pause` - Pause generation (causes denial of service). Inflight requests are aborted and cache is reset.
|
||||
- `/v1/resume` - Resume generation.
|
||||
- `/v1/is_paused` - Check if generation is paused.
|
||||
- `/v1/abort_requests` - Abort inference requests to release GPU memory (KV Cache blocks) and compute resources. Accepts `req_ids` (list of request IDs) or `abort_all=true` (abort all requests). Returns the list of aborted requests with their generated token counts.
|
||||
|
||||
@@ -151,6 +151,7 @@ The Router exposes a set of HTTP services to provide unified request scheduling,
|
||||
|----------|------|------|
|
||||
| POST | `/v1/chat/completions` | Provide scheduling services for inference requests based on the Chat Completions API |
|
||||
| POST | `/v1/completions` | Provide scheduling services for general text completion inference requests |
|
||||
| POST | `/v1/abort_requests` | Abort inference requests to release GPU memory and compute resources. Accepts `req_ids` or `abort_all=true`. Returns aborted requests with their generated token counts |
|
||||
| POST | `/register` | Allow inference instances to register their metadata with the Router for scheduling |
|
||||
| GET | `/registered` | Query the list of currently registered inference instances |
|
||||
| GET | `/registered_number` | Query the number of currently registered inference instances |
|
||||
|
||||
@@ -563,3 +563,4 @@ DeltaFunctionCall:
|
||||
/v1/pause - 暂停推理生成(会导致服务拒绝推理请求)。正在进行中的请求会被中止,缓存会被重置。
|
||||
/v1/resume - 恢复推理生成。
|
||||
/v1/is_paused - 检查推理生成是否已暂停。
|
||||
/v1/abort_requests - 中断推理请求,释放 GPU 显存(KV Cache blocks)和计算资源。支持传入 `req_ids`(请求 ID 列表)或 `abort_all=true`(中断所有请求)。返回已中断请求列表及其已生成的 token 数。
|
||||
|
||||
@@ -152,6 +152,7 @@ Router 通过 HTTP 接口对外提供统一的调度服务,同时支持运行
|
||||
|----------|------|------|
|
||||
| POST | `/v1/chat/completions` | 对外提供基于 Chat 接口的推理请求调度服务 |
|
||||
| POST | `/v1/completions` | 对外提供通用文本补全请求的调度服务 |
|
||||
| POST | `/v1/abort_requests` | 中断推理请求,释放 GPU 显存和计算资源。支持传入 `req_ids` 或 `abort_all=true`,返回已中断请求列表及其已生成的 token 数 |
|
||||
| POST | `/register` | 推理实例向 Router 注册自身信息,用于参与调度 |
|
||||
| GET | `/registered` | 查询当前已注册的推理实例列表 |
|
||||
| GET | `/registered_number` | 查询当前已注册的推理实例数量 |
|
||||
|
||||
Reference in New Issue
Block a user