[Metrics] Update time_to_first_token to include tokenization & queue time, and remove redundant metrics (#4993)

* [update] update time_to_first_tokens to include queue time, and remove first_token_latency and infer_latency

* [doc] update docs

* [ci] fix test

* [chore] delete redundant code

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
This commit is contained in:
Yonghua Li
2025-11-26 14:42:17 +08:00
committed by GitHub
parent 287751f19d
commit cead6b26fa
9 changed files with 92 additions and 139 deletions
@@ -1820,6 +1820,7 @@ class PrefixCacheManager:
# reset metrics
self.metrics.reset_metrics()
main_process_metrics.free_gpu_block_num.set(len(self.gpu_free_block_list))
main_process_metrics.available_gpu_block_num.set(len(self.gpu_free_block_list))
main_process_metrics.available_gpu_resource.set(self.available_gpu_resource)
def clear_prefix_cache(self):