mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
[Feature][Docs] Adjust prefill release & expose load metrics (#6884)
This commit is contained in:
@@ -190,7 +190,7 @@ server:
|
||||
splitwise: true # true enables PD disaggregation; false disables it
|
||||
|
||||
scheduler:
|
||||
policy: "power_of_two" # Scheduling policy (optional): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, fd_metrics_score
|
||||
policy: "power_of_two" # Scheduling policy (optional): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, remote_cache_aware, fd_metrics_score, fd_remote_metrics_score
|
||||
prefill-policy: "cache_aware" # Prefill scheduling policy in PD mode
|
||||
decode-policy: "fd_metrics_score" # Decode scheduling policy in PD mode
|
||||
eviction-interval-secs: 60 # Cache eviction interval for CacheAware scheduling
|
||||
@@ -199,9 +199,13 @@ scheduler:
|
||||
hit-ratio-weight: 1.0 # Cache hit ratio weight
|
||||
load-balance-weight: 0.05 # Load balancing weight
|
||||
cache-block-size: 4 # Cache block size
|
||||
tokenizer-url: "http://0.0.0.0:8098" # Tokenizer service endpoint (optional)
|
||||
tokenizer-timeout-secs: 2 # Tokenizer service timeout
|
||||
# tokenizer-url: "http://0.0.0.0:8098" # Tokenizer service endpoint (optional), cache_aware uses character-level tokenization when not configured.
|
||||
# Note: Enabling this option causes a synchronous remote tokenizer call on every scheduling decision,
|
||||
# introducing additional network latency. Only enable it when precise token-level tokenization
|
||||
# is needed to improve cache hit rate.
|
||||
# tokenizer-timeout-secs: 2 # Tokenizer service timeout; default: 2
|
||||
waiting-weight: 10 # Waiting weight for CacheAware scheduling
|
||||
stats-interval-secs: 5 # Stats logging interval in seconds, includes load and cache hit rate statistics; default: 5
|
||||
|
||||
manager:
|
||||
health-failure-threshold: 3 # Number of failed health checks before marking unhealthy
|
||||
@@ -254,6 +258,24 @@ Instance Registration Parameters:
|
||||
|
||||
Among these, `role`, `host_ip`, and `port` are required; all other parameters are optional.
|
||||
|
||||
## Scheduling Strategies
|
||||
|
||||
The Router supports the following scheduling strategies, configurable via `policy` (mixed mode), `prefill-policy`, and `decode-policy` (PD disaggregated mode) fields in the configuration file.
|
||||
|
||||
**Default strategies**: When not configured, prefill nodes default to `process_tokens`, mixed and decode nodes default to `request_num`.
|
||||
|
||||
| Strategy | Applicable Scenario | Implementation |
|
||||
|----------|---------------------|----------------|
|
||||
| `random` | General | Randomly selects one available instance, stateless, suitable for lightweight scenarios. |
|
||||
| `round_robin` | General | Uses atomic counter to cycle through instance list, distributing requests evenly in order. |
|
||||
| `power_of_two` | General | Randomly picks two instances, compares their concurrent request counts, selects the one with lower load. |
|
||||
| `process_tokens` | **prefill (default)** | Iterates all instances, selects the one with the fewest tokens currently being processed (in-memory counting), suitable for prefill long-request load balancing. |
|
||||
| `request_num` | **mixed / decode (default)** | Iterates all instances, selects the one with the fewest concurrent requests (in-memory counting), suitable for decode and mixed scenarios. |
|
||||
| `fd_metrics_score` | mixed / decode | Uses in-memory counting to get running/waiting request counts, scores by `running + waiting × waitingWeight`, selects the instance with the lowest score. |
|
||||
| `fd_remote_metrics_score` | mixed / decode | Fetches running/waiting request counts from each instance's remote `/metrics` endpoint in real-time, scores by `running + waiting × waitingWeight`, selects the instance with the lowest score. Requires `metrics_port` in instance registration. **Note: A synchronous remote HTTP request is issued on every scheduling decision. With a large number of instances or poor network conditions, this can significantly increase scheduling latency. Evaluate your deployment conditions carefully before enabling this strategy.** |
|
||||
| `cache_aware` | prefill | Maintains KV Cache prefix hit information per instance via Radix Tree, selects instances by combining hit ratio and load scores (in-memory counting); automatically falls back to `process_tokens` when load is severely imbalanced. |
|
||||
| `remote_cache_aware` | prefill | Same cache-aware strategy as `cache_aware`, but uses remote `/metrics` endpoint for instance load data. Requires `metrics_port` in instance registration. **Note: A synchronous remote HTTP request is issued on every scheduling decision. With a large number of instances or poor network conditions, this can significantly increase scheduling latency. Evaluate your deployment conditions carefully before enabling this strategy.** |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If you encounter issues while using the Router, please refer to the [Router Troubleshooting Guide](router_faq.md), which covers common log analysis, response output interpretation, and troubleshooting methods.
|
||||
|
||||
@@ -23,6 +23,12 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
|
||||
| `Failed to register instance from index {index}: {error}` | Instance at index {index} in config file failed to register | That instance was not registered | Health status, registration parameters |
|
||||
| `failed to send request to {url} with error: {error}` | Health check request failed to send | The instance may be marked as unhealthy | Network connectivity, proxy settings |
|
||||
| `scanner error: {error}` | Error occurred while reading backend streaming response | The current request may fail | Backend instance status |
|
||||
| `[prefill] scanner error: {error}, message={message}` | Error occurred while reading Prefill backend streaming response | The current Prefill request may fail | Backend instance status |
|
||||
| `[prefill] copy error: {error}, message={message}` | Error occurred while copying Prefill response data | The current Prefill request may fail | Backend instance status |
|
||||
| `Panic recovered: {error}` | A panic occurred during request processing and was recovered | The current request fails, but the service continues running | Backend instance status, request content |
|
||||
| `empty baseURL provided` | Health check received an empty base URL | Health check cannot be performed | Registration parameters |
|
||||
| `failed to create request: {error}` | Failed to create health check request | The instance may be marked as unhealthy | Network environment |
|
||||
| `failed to read response body: {error}` | Failed to read health check response body | The instance may be marked as unhealthy | Backend instance status |
|
||||
|
||||
### Warn-Level Logs
|
||||
|
||||
@@ -30,7 +36,9 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `Server {url} is not healthy` | The instance at this URL failed health check | Router cannot register the instance, or will remove it from the registered list | Health status |
|
||||
| `Instance {url} role is unknown` | Instance role cannot be recognized | The instance will not be added to the scheduling list | Registration parameters |
|
||||
| `cache-aware prefill: tokenizer failed, fallback to process_tokens: {error}` | Tokenizer service call failed, automatically falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Tokenizer service status |
|
||||
| `cache-aware prefill: tokenizer failed, fallback to char tokens: {error}` | Tokenizer service call failed, automatically falling back to character-based tokenization | cache_aware strategy remains active, using character-based tokenization for cache matching instead of the Tokenizer; normal request processing is not affected | Tokenizer service status |
|
||||
| `cache-aware prefill: tokenize failed, fallback to process_tokens: {error}` | Tokenization completely failed (e.g., empty input), falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Request content, Tokenizer service status |
|
||||
| `cache-aware prefill: final strategy: process_tokens, reason: tokenize failed: {error}. ts_ms={ts}` | Tokenization failed (new format), falling back to process_tokens strategy | Prefill scheduling temporarily does not use cache_aware strategy; normal request processing is not affected | Request content, Tokenizer service status |
|
||||
|
||||
### Info-Level Logs
|
||||
|
||||
@@ -42,6 +50,42 @@ For basic Router usage, please refer to [Load-Balancing Scheduling Router](route
|
||||
| `No instances found in config file {path}` | No instances found in the registration config file | Check whether register.yaml is empty |
|
||||
| `Request completed successfully.` | Request processing completed | Normal operation log |
|
||||
| `Request failed, retrying...` | Request failed, retrying | Router will retry up to 3 times |
|
||||
| `select worker (prefill): {url}, tokens: {tokens}` | Prefill scheduler selected a worker, showing current token processing count | Normal operation log |
|
||||
| `select worker ({type}): {url}, count: {count}` | Decode/Mixed scheduler selected a worker, showing current request concurrency | Normal operation log |
|
||||
| `release worker: {url}, count: {count}` | Request ended, worker counter released | Normal operation log |
|
||||
| `release prefill tokens: {url}, tokens: {tokens}` | Prefill request ended, token load released | Normal operation log |
|
||||
| `cleanup unhealthy worker counter: {url}` | Cleaned up counter for unhealthy worker | Normal operation log |
|
||||
| `removed counters for {count} unhealthy workers: {urls}` | Batch cleanup of counters for unhealthy workers | Normal operation log |
|
||||
| `[stats] total_running={n}, workers: [{loads}], cache_hit_rate={rate}% (hits={hits}/total={total})` | Periodic stats: total requests, worker loads, cache hit rate | Normal operation log, useful for monitoring and tuning |
|
||||
| `Parsing completed; starting worker selection.` | Request parsing completed, starting worker selection | Normal operation log |
|
||||
| `Request completed with an error.` | Request processing completed with an error | Check backend instance status |
|
||||
| `[SelectWorkerPair] decode selection failed, releasing prefill counter url={url}` | Decode selection failed in PD disaggregated mode, releasing Prefill counter | Error handling log |
|
||||
| `[prefill] first chunk received, release counter url={url}` | Prefill streaming response received first chunk, counter released | Normal operation log |
|
||||
| `[prefill] non-stream prefill response done, release counter url={url}` | Prefill non-streaming response completed, counter released | Normal operation log |
|
||||
| `[prefill] backendResp is nil or backendResp.Body is nil, url={url}` | Prefill backend response is nil | May indicate backend connection issue |
|
||||
| `[prefill] release in defer (fallback) url={url}, isStream={bool}` | Fallback resource release when Prefill request exits abnormally | Error handling log |
|
||||
| `[prefill] release in CommonCompletions defer (error path) url={url}` | Prefill resource release on error path | Error handling log |
|
||||
| `cache-aware prefill: final strategy: process_tokens, reason: strategy not initialized` | cache_aware strategy not initialized, falling back to process_tokens | Check cache_aware configuration |
|
||||
| `cache-aware prefill: final strategy: process_tokens, reason: load imbalanced, loads={loads}. ts_ms={ts}` | Load imbalanced across instances, falling back to process_tokens strategy | Normal operation log, automatic load balancing switch |
|
||||
| `cache-aware prefill: final strategy: cache_aware_scoring, selected={url}, loads={loads}, hitRatios={ratios}. ts_ms={ts}` | cache_aware scoring strategy selected a worker | Normal operation log, showing loads and hit ratios |
|
||||
| `[{method}] {path} {proto} {status} {latency} {clientIP}` | HTTP request access log | Normal operation log, records basic info for each request |
|
||||
| `before SelectWorker prefill. ts_ms={ts}` | Starting Prefill worker selection in PD disaggregated mode | Normal operation log, for performance tracing |
|
||||
| `before SelectWorker decode, after prefill. ts_ms={ts}` | Starting Decode worker selection after Prefill selection | Normal operation log, for performance tracing |
|
||||
| `after SelectWorker decode, before return. ts_ms={ts}` | Decode worker selection completed | Normal operation log, for performance tracing |
|
||||
|
||||
### Debug-Level Logs
|
||||
|
||||
> Debug-level logs are only output when the log level is set to `debug`, typically used for development debugging.
|
||||
|
||||
| Log Message | Meaning | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `Healthy instances: prefill={urls}, decode={urls}, mixed={urls}` | Lists healthy instances for each role | Useful for verifying instance discovery |
|
||||
| `cache-aware prefill: hashes={n} workers={n} load={loads} hit={hits}` | Hash count, worker count, and load info for cache_aware strategy | Useful for debugging cache hits |
|
||||
| `cache-aware prefill: tokenizer tokens={tokens}` | Tokenizer tokenization result | Useful for debugging tokenization results |
|
||||
| `cache-aware score: worker={url} hit={hit} loadRatio={ratio} score={score}` | Scoring details for cache_aware strategy | Useful for debugging scheduling decisions |
|
||||
| `radix match: hashes={n} matched_len={n} node_children={n}` | Radix tree match details | Useful for debugging cache matching |
|
||||
| `radix record: worker={url} hashes={n} node_depth={n}` | Radix tree record details | Useful for debugging cache recording |
|
||||
| `radix eviction: removed={n} nodeCount={n}` | Radix tree eviction details | Useful for debugging cache eviction |
|
||||
|
||||
## Common Response Output Analysis
|
||||
|
||||
@@ -189,9 +233,10 @@ If `Failed to start server` appears in startup logs, check:
|
||||
|
||||
### Tokenizer Service (cache_aware Strategy)
|
||||
|
||||
When using the `cache_aware` scheduling strategy, the Router calls a Tokenizer service to tokenize requests for cache hit ratio computation. When the Tokenizer service is unavailable, the Router will log a Warn-level message: `tokenizer failed, fallback to process_tokens`.
|
||||
When using the `cache_aware` scheduling strategy, the Router calls a Tokenizer service to tokenize requests for cache hit ratio computation. When the Tokenizer service is unavailable, the Router has a two-level degradation mechanism:
|
||||
|
||||
**This does not affect normal request processing** — the Router has a built-in degradation mechanism that automatically falls back to the `process_tokens` strategy for continued scheduling. The only impact is the temporary loss of cache-aware optimization.
|
||||
1. **Fallback to character-based tokenization** (common case): The log will show `tokenizer failed, fallback to char tokens`. The cache_aware strategy remains active, using character-based tokenization for cache matching instead of the Tokenizer. Cache hit accuracy may decrease, but normal request processing is not affected.
|
||||
2. **Fallback to process_tokens strategy** (extreme case): When tokenization completely fails (e.g., empty request content), the log will show `tokenize failed, fallback to process_tokens`. The cache_aware strategy temporarily becomes inactive, and scheduling falls back to token processing volume. Normal request processing is not affected.
|
||||
|
||||
To restore full cache_aware functionality:
|
||||
|
||||
|
||||
@@ -190,7 +190,7 @@ server:
|
||||
splitwise: true # true代表开启pd分离模式,false代表开启非pd分离模式
|
||||
|
||||
scheduler:
|
||||
policy: "power_of_two" # 调度策略(可选): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, fd_metrics_score; 默认: request_num
|
||||
policy: "power_of_two" # 调度策略(可选): random, power_of_two, round_robin, process_tokens, request_num, cache_aware, remote_cache_aware, fd_metrics_score, fd_remote_metrics_score; 默认: request_num
|
||||
prefill-policy: "cache_aware" # pd分离模式下prefill节点调度策略; 默认: process_tokens
|
||||
decode-policy: "fd_metrics_score" # pd分离模式下decode节点调度策略; 默认: request_num
|
||||
eviction-interval-secs: 60 # cache-aware策略清理过期cache的间隔时间
|
||||
@@ -199,9 +199,12 @@ scheduler:
|
||||
hit-ratio-weight: 1.0 # cache-aware策略命中率权重
|
||||
load-balance-weight: 0.05 # cache-aware策略负载均衡权重
|
||||
cache-block-size: 4 # cache-aware策略cache block大小
|
||||
tokenizer-url: "http://0.0.0.0:8098" # tokenizer服务地址(可选)
|
||||
tokenizer-timeout-secs: 2 # tokenizer服务超时时间
|
||||
# tokenizer-url: "http://0.0.0.0:8098" # tokenizer服务地址(可选), 不配置时cache_aware策略自动使用字符级分词。
|
||||
# 注意:配置此项会在每次调度时同步调用远程tokenizer服务,引入额外网络时延,
|
||||
# 仅在需要精确token级分词以提升cache命中率时再考虑启用。
|
||||
# tokenizer-timeout-secs: 2 # tokenizer服务超时时间; 默认: 2
|
||||
waiting-weight: 10 # cache-aware策略等待权重
|
||||
stats-interval-secs: 5 # 日志统计信息打印间隔时间(秒), 包含负载和缓存命中率等统计数据; 默认: 5
|
||||
|
||||
manager:
|
||||
health-failure-threshold: 3 # 健康检查失败次数,超过次数后认为节点不健康
|
||||
@@ -265,10 +268,12 @@ Router 支持以下调度策略,可通过配置文件中的 `policy`(mixed
|
||||
| `random` | 通用 | 从所有可用实例中随机选择一个,无状态感知,适合轻量场景。 |
|
||||
| `round_robin` | 通用 | 使用原子计数器对实例列表循环取模,按顺序均匀分发请求。 |
|
||||
| `power_of_two` | 通用 | 随机选取两个实例,比较其当前并发请求数,选择负载较低的一个。 |
|
||||
| `process_tokens` | **prefill(默认)** | 遍历所有实例,选择当前正在处理的 token 数最少的实例,适合 prefill 阶段的长请求负载均衡。 |
|
||||
| `request_num` | **mixed / decode(默认)** | 遍历所有实例,选择当前并发请求数最少的实例,适合 decode 及 mixed 场景的请求均衡。 |
|
||||
| `fd_metrics_score` | mixed / decode | 实时从各实例的 metrics 接口获取 running/waiting 请求数,按 `running + waiting × waitingWeight` 打分,选择得分最低的实例。 |
|
||||
| `cache_aware` | prefill | 基于 Radix Tree 维护各实例的 KV Cache 前缀命中情况,综合命中率与负载打分选择实例;负载严重不均衡时自动回退至 `process_tokens`。 |
|
||||
| `process_tokens` | **prefill(默认)** | 遍历所有实例,选择当前正在处理的 token 数最少的实例(内存计数),适合 prefill 阶段的长请求负载均衡。 |
|
||||
| `request_num` | **mixed / decode(默认)** | 遍历所有实例,选择当前并发请求数最少的实例(内存计数),适合 decode 及 mixed 场景的请求均衡。 |
|
||||
| `fd_metrics_score` | mixed / decode | 基于内存计数获取 running/waiting 请求数,按 `running + waiting × waitingWeight` 打分,选择得分最低的实例。 |
|
||||
| `fd_remote_metrics_score` | mixed / decode | 实时从各实例的远程 `/metrics` 接口获取 running/waiting 请求数,按 `running + waiting × waitingWeight` 打分,选择得分最低的实例。需要实例注册时提供 `metrics_port`。**注意:每次调度时会同步发起远程 HTTP 请求,在实例数量较多或网络条件较差时会显著增加调度时延,请结合实际情况评估后再启用。** |
|
||||
| `cache_aware` | prefill | 基于 Radix Tree 维护各实例的 KV Cache 前缀命中情况,综合命中率与负载打分(内存计数)选择实例;负载严重不均衡时自动回退至 `process_tokens`。 |
|
||||
| `remote_cache_aware` | prefill | 与 `cache_aware` 相同的缓存感知策略,但使用远程 `/metrics` 接口获取实例负载数据。需要实例注册时提供 `metrics_port`。**注意:每次调度时会同步发起远程 HTTP 请求,在实例数量较多或网络条件较差时会显著增加调度时延,请结合实际情况评估后再启用。** |
|
||||
|
||||
## 常见问题排查
|
||||
|
||||
|
||||
@@ -23,6 +23,12 @@ Router 的基本使用方式请参考 [负载均衡调度 Router](router.md)。
|
||||
| `Failed to register instance from index {index}: {error}` | 配置文件中第 {index} 个实例注册失败 | 该实例未能成功注册 | 健康状况、注册参数 |
|
||||
| `failed to send request to {url} with error: {error}` | 健康检查请求发送失败 | 该实例可能被判定为不健康 | 网络连通性、代理设置 |
|
||||
| `scanner error: {error}` | 读取后端流式响应时发生错误 | 当前请求可能失败 | 后端实例状态 |
|
||||
| `[prefill] scanner error: {error}, message={message}` | 读取 Prefill 后端流式响应时发生错误 | 当前 Prefill 请求可能失败 | 后端实例状态 |
|
||||
| `[prefill] copy error: {error}, message={message}` | 复制 Prefill 响应数据时发生错误 | 当前 Prefill 请求可能失败 | 后端实例状态 |
|
||||
| `Panic recovered: {error}` | 请求处理过程中发生 panic 并被恢复 | 当前请求失败,但服务继续运行 | 后端实例状态、请求内容 |
|
||||
| `empty baseURL provided` | 健康检查时传入了空的基础 URL | 健康检查无法执行 | 注册参数 |
|
||||
| `failed to create request: {error}` | 创建健康检查请求失败 | 该实例可能被判定为不健康 | 网络环境 |
|
||||
| `failed to read response body: {error}` | 读取健康检查响应体失败 | 该实例可能被判定为不健康 | 后端实例状态 |
|
||||
|
||||
### Warn 级别日志
|
||||
|
||||
@@ -30,7 +36,9 @@ Router 的基本使用方式请参考 [负载均衡调度 Router](router.md)。
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| `Server {url} is not healthy` | 该 URL 对应的实例未通过健康检查 | Router 无法注册该实例,或将该实例从已注册列表中移除 | 健康状况 |
|
||||
| `Instance {url} role is unknown` | 实例角色无法识别 | 该实例不会被加入调度列表 | 注册参数 |
|
||||
| `cache-aware prefill: tokenizer failed, fallback to process_tokens: {error}` | Tokenizer 服务调用失败,已自动回退至 process_tokens 策略 | Prefill 调度暂时不使用 cache_aware 策略,不影响正常请求处理 | Tokenizer 服务状态 |
|
||||
| `cache-aware prefill: tokenizer failed, fallback to char tokens: {error}` | Tokenizer 服务调用失败,已自动回退至字符级分词 | cache_aware 策略仍然生效,使用字符级分词代替 Tokenizer 进行缓存匹配,不影响正常请求处理 | Tokenizer 服务状态 |
|
||||
| `cache-aware prefill: tokenize failed, fallback to process_tokens: {error}` | 分词彻底失败(如输入为空),回退至 process_tokens 策略 | Prefill 调度暂时不使用 cache_aware 策略,不影响正常请求处理 | 请求内容、Tokenizer 服务状态 |
|
||||
| `cache-aware prefill: final strategy: process_tokens, reason: tokenize failed: {error}. ts_ms={ts}` | 分词失败(新格式),回退至 process_tokens 策略 | Prefill 调度暂时不使用 cache_aware 策略,不影响正常请求处理 | 请求内容、Tokenizer 服务状态 |
|
||||
|
||||
### Info 级别日志
|
||||
|
||||
@@ -42,6 +50,42 @@ Router 的基本使用方式请参考 [负载均衡调度 Router](router.md)。
|
||||
| `No instances found in config file {path}` | 注册配置文件中未找到实例信息 | 请检查 register.yaml 内容是否为空 |
|
||||
| `Request completed successfully.` | 请求处理完成 | 正常运行日志 |
|
||||
| `Request failed, retrying...` | 请求失败,正在进行重试 | Router 最多重试 3 次 |
|
||||
| `select worker (prefill): {url}, tokens: {tokens}` | Prefill 调度选中 Worker,显示当前 token 处理量 | 正常运行日志 |
|
||||
| `select worker ({type}): {url}, count: {count}` | Decode/Mixed 调度选中 Worker,显示当前请求并发数 | 正常运行日志 |
|
||||
| `release worker: {url}, count: {count}` | 请求结束,释放 Worker 计数器 | 正常运行日志 |
|
||||
| `release prefill tokens: {url}, tokens: {tokens}` | Prefill 请求结束,释放 token 负载 | 正常运行日志 |
|
||||
| `cleanup unhealthy worker counter: {url}` | 清理不健康 Worker 的计数器 | 正常运行日志 |
|
||||
| `removed counters for {count} unhealthy workers: {urls}` | 批量清理不健康 Worker 的计数器 | 正常运行日志 |
|
||||
| `[stats] total_running={n}, workers: [{loads}], cache_hit_rate={rate}% (hits={hits}/total={total})` | 周期性统计:总请求数、各 Worker 负载、缓存命中率 | 正常运行日志,用于监控调优 |
|
||||
| `Parsing completed; starting worker selection.` | 请求解析完成,开始选择 Worker | 正常运行日志 |
|
||||
| `Request completed with an error.` | 请求处理完成但发生错误 | 请排查后端实例状态 |
|
||||
| `[SelectWorkerPair] decode selection failed, releasing prefill counter url={url}` | PD 分离模式下 Decode 选择失败,释放 Prefill 计数器 | 异常处理日志 |
|
||||
| `[prefill] first chunk received, release counter url={url}` | Prefill 流式响应收到首个数据块,释放计数器 | 正常运行日志 |
|
||||
| `[prefill] non-stream prefill response done, release counter url={url}` | Prefill 非流式响应完成,释放计数器 | 正常运行日志 |
|
||||
| `[prefill] backendResp is nil or backendResp.Body is nil, url={url}` | Prefill 后端响应为空 | 可能表示后端连接异常 |
|
||||
| `[prefill] release in defer (fallback) url={url}, isStream={bool}` | Prefill 请求异常退出时的兜底资源释放 | 异常处理日志 |
|
||||
| `[prefill] release in CommonCompletions defer (error path) url={url}` | 请求出错时释放 Prefill 资源 | 异常处理日志 |
|
||||
| `cache-aware prefill: final strategy: process_tokens, reason: strategy not initialized` | cache_aware 策略未初始化,回退至 process_tokens | 请检查 cache_aware 相关配置 |
|
||||
| `cache-aware prefill: final strategy: process_tokens, reason: load imbalanced, loads={loads}. ts_ms={ts}` | 实例间负载不均衡,回退至 process_tokens 策略 | 正常运行日志,负载均衡自动切换 |
|
||||
| `cache-aware prefill: final strategy: cache_aware_scoring, selected={url}, loads={loads}, hitRatios={ratios}. ts_ms={ts}` | cache_aware 打分策略选中 Worker | 正常运行日志,显示负载和命中率 |
|
||||
| `[{method}] {path} {proto} {status} {latency} {clientIP}` | HTTP 请求访问日志 | 正常运行日志,记录每个请求的基本信息 |
|
||||
| `before SelectWorker prefill. ts_ms={ts}` | PD 分离模式下开始选择 Prefill Worker | 正常运行日志,用于性能追踪 |
|
||||
| `before SelectWorker decode, after prefill. ts_ms={ts}` | Prefill 选择完成后开始选择 Decode Worker | 正常运行日志,用于性能追踪 |
|
||||
| `after SelectWorker decode, before return. ts_ms={ts}` | Decode Worker 选择完成 | 正常运行日志,用于性能追踪 |
|
||||
|
||||
### Debug 级别日志
|
||||
|
||||
> Debug 级别日志需要将日志级别设置为 `debug` 才会输出,通常用于开发调试。
|
||||
|
||||
| 日志表现 | 日志含义 | 说明 |
|
||||
| :--- | :--- | :--- |
|
||||
| `Healthy instances: prefill={urls}, decode={urls}, mixed={urls}` | 列出当前各角色的健康实例列表 | 用于确认实例发现是否正常 |
|
||||
| `cache-aware prefill: hashes={n} workers={n} load={loads} hit={hits}` | cache_aware 策略的 hash 数量、Worker 数量及负载信息 | 用于调试缓存命中 |
|
||||
| `cache-aware prefill: tokenizer tokens={tokens}` | Tokenizer 分词结果 | 用于调试分词结果 |
|
||||
| `cache-aware score: worker={url} hit={hit} loadRatio={ratio} score={score}` | cache_aware 策略的打分详情 | 用于调试调度决策 |
|
||||
| `radix match: hashes={n} matched_len={n} node_children={n}` | 前缀树匹配详情 | 用于调试缓存匹配 |
|
||||
| `radix record: worker={url} hashes={n} node_depth={n}` | 前缀树记录详情 | 用于调试缓存记录 |
|
||||
| `radix eviction: removed={n} nodeCount={n}` | 前缀树淘汰详情 | 用于调试缓存淘汰 |
|
||||
|
||||
## 常见返回输出分析
|
||||
|
||||
@@ -189,9 +233,10 @@ PD 分离模式下建议完整配置以下参数,以确保 KV Cache 传输正
|
||||
|
||||
### Tokenizer 服务(cache_aware 策略)
|
||||
|
||||
使用 `cache_aware` 调度策略时,Router 会调用 Tokenizer 服务对请求进行分词以计算缓存命中率。当 Tokenizer 服务不可用时,日志会出现 `tokenizer failed, fallback to process_tokens` 的 Warn 级别提示。
|
||||
使用 `cache_aware` 调度策略时,Router 会调用 Tokenizer 服务对请求进行分词以计算缓存命中率。当 Tokenizer 服务不可用时,Router 内置了两级退化机制:
|
||||
|
||||
**这不影响正常的请求处理**——Router 内置了退化机制,会自动回退至 `process_tokens` 策略继续调度,只是暂时无法利用缓存感知的优化能力。
|
||||
1. **回退至字符级分词**(常见情况):日志出现 `tokenizer failed, fallback to char tokens`。此时 cache_aware 策略仍然生效,只是使用字符级分词代替 Tokenizer 进行缓存匹配,缓存命中精度会有所下降,但不影响正常请求处理。
|
||||
2. **回退至 process_tokens 策略**(极端情况):当分词彻底失败(如请求内容为空)时,日志出现 `tokenize failed, fallback to process_tokens`。此时 cache_aware 策略暂时不生效,改为按 token 处理量进行调度,同样不影响正常请求处理。
|
||||
|
||||
如需恢复 cache_aware 策略的完整功能:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user