[Feature] Fix counter release logic & update go-router download URL (#6280)

* [Doc] Update prerequisites in the documentation * [Feature] Enhance Router with /v1/completions, docs, scripts, and version info * [Feature] Enhance Router with /v1/completions, docs, scripts, and version info * [Feature] Enhance Router with /v1/completions, docs, scripts, and version info * [Feature] Fix counter release logic * [Feature] Update go-router download URL * [Feature] Update go-router download URL * [Feature] Update go-router download URL * [Feature] Update go-router download URL * [Feature] Update token counter logic and docs * [Feature] Update token counter logic and docs --------- Co-authored-by: mouxin <mouxin@baidu.com>
2026-04-22 16:07:51 +08:00 · 2026-02-04 15:02:38 +08:00
parent 36547cfdb3
commit 6e96bd0bd2
16 changed files with 115 additions and 51 deletions
@@ -83,6 +83,7 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU
 - [投机解码](./docs/zh/features/speculative_decoding.md)
 - [前缀缓存](./docs/zh/features/prefix_caching.md)
 - [分块预填充](./docs/zh/features/chunked_prefill.md)
+- [负载均衡调度Router](./docs/zh/online_serving/router.md)

 ## 致谢

@@ -81,6 +81,7 @@ Learn how to download models, enable using the torch format, and more:
 - [Speculative Decoding](./docs/features/speculative_decoding.md)
 - [Prefix Caching](./docs/features/prefix_caching.md)
 - [Chunked Prefill](./docs/features/chunked_prefill.md)
+- [Load-Balancing Scheduling Router](./docs/online_serving/router.md)

 ## Acknowledgement

@@ -70,11 +70,11 @@ The usage and request scheduling workflow is as follows:

 ### Launching the Router

-Start the Router service. Logs are written to `log_router/router.log`.
+Start the Router service. Logs are written to `log_router/router.log`. `fd-router` installation instructions can be found in the [Router documentation](../online_serving/router.md).

 ```shell
 export FD_LOG_DIR="log_router"
-python -m fastdeploy.router.launch \
+/usr/local/bin/fd-router \
    --host 0.0.0.0 \
    --port 30000
 ```
@@ -67,11 +67,11 @@ bash build.sh

 **Quick Start**

-Start the Router service. The `--splitwise` parameter specifies the scheduling mode as disaggregated deployment. Log information is output to `log_router/router.log`.
+Start the Router service. The `--splitwise` parameter specifies the scheduling mode as disaggregated deployment. Log information is output to `log_router/router.log`. `fd-router` installation instructions can be found in the [Router documentation](../online_serving/router.md).

 ```bash
 export FD_LOG_DIR="log_router"
-python -m fastdeploy.router.launch \
+/usr/local/bin/fd-router \
    --host 0.0.0.0 \
    --port 30000 \
    --splitwise
@@ -4,12 +4,20 @@

 FastDeploy provides a Golang-based [Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/golang_router) for request scheduling. The Router supports both centralized deployment and Prefill/Decode (PD) disaggregated deployment.。

+![go-router](images/go-router-workflow.png)
+
 ## Installation

 ### 1. Prebuilt Binaries

 Starting from FastDeploy v2.5.0, the official Docker images include the Go language environment required to build the Golang Router and also provide a precompiled Router binary. The Router binary is located by default in the `/usr/local/bin` directory and can be used directly without additional compilation. For installation details, please refer to the [FastDeploy Installation Guide](../get_started/installation/nvidia_gpu.md)

+If you need to download the Golang-based router binary separately, it can be installed using the following steps:
+```
+wget https://paddle-qa.bj.bcebos.com/paddle-pipeline/FastDeploy_ActionCE/develop/latest/fd-router
+mv fd-router /usr/local/bin/fd-router
+```
+
 ### 2. Build from Source

 You need to build the Router from source in the following scenarios:
@@ -33,7 +41,7 @@ bash build.sh

 Start the Router service. The `--port` parameter specifies the scheduling port for centralized deployment.
 ```
-./fd-router --port 30000
+/usr/local/bin/fd-router --port 30000
 ```

 Start a mixed inference instance. Compared to standalone deployment, specify the Router endpoint via `--router`. Other parameters remain unchanged.
@@ -50,7 +58,7 @@ python -m fastdeploy.entrypoints.openai.api_server \

 Start the Router service with PD disaggregation enabled using the `--splitwise` flag.
 ```
-./fd-router \
+/usr/local/bin/fd-router \
  --port 30000 \
  --splitwise
 ```
@@ -105,7 +113,7 @@ popd

 Launch the Router with the custom configuration specified via `--config_path`:
 ```
-./fd-router \
+/usr/local/bin/fd-router \
  --port 30000 \
  --splitwise \
  --config_path examples/run_with_config/config/config.yaml
@@ -55,10 +55,10 @@ FastDeploy提供[Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop
 - Router接收实例的生成结果，返回给用户

 上手示例：
- 启动Router服务，日志信息输出在`log_router/router.log`。
+- 启动Router服务，日志信息输出在`log_router/router.log`。`fd-router`的安装方法参考[Router说明文档](../online_serving/router.md)。
 ```
 export FD_LOG_DIR="log_router"
-python -m fastdeploy.router.launch \
+/usr/local/bin/fd-router \
    --host 0.0.0.0 \
    --port 30000 \
 ```
@@ -65,10 +65,10 @@ bash build.sh

 **快速上手**

-启动Router服务，其中`--splitwise`参数指定为分离式部署的调度方式，日志信息输出在`log_router/router.log`。
+启动Router服务，其中`--splitwise`参数指定为分离式部署的调度方式，日志信息输出在`log_router/router.log`。`fd-router`的安装方法参考[Router说明文档](../online_serving/router.md)。
 ```
 export FD_LOG_DIR="log_router"
-python -m fastdeploy.router.launch \
+/usr/local/bin/fd-router \
    --host 0.0.0.0 \
    --port 30000 \
    --splitwise
@@ -4,11 +4,19 @@

 FastDeploy提供Golang版本[Router](https://github.com/PaddlePaddle/FastDeploy/tree/develop/fastdeploy/golang_router)，用于实现请求的调度。Router支持集中式部署和PD分离式部署。

+![go-router](images/go-router-workflow.png)
+
 ## 安装

 ### 1. 预编译库下载

-在 FastDeploy v2.5.0 及之后版本中，官方 Docker 镜像将内置 Golang Router 编译所需的 Go 语言环境，并提供已编译完成的 Router 二进制文件。该二进制文件默认位于 `/usr/local/bin` 目录下，可直接使用。相关安装方式可参考 [FastDeploy 安装文档](../get_started/installation/nvidia_gpu.md)
+在 FastDeploy v2.5.0 及之后版本中，官方 Docker 镜像将内置 Golang Router 编译所需的 Go 语言环境，并提供已编译完成的 Router 二进制文件。该二进制文件默认位于 `/usr/local/bin` 目录下，可直接使用。相关安装方式可参考 [FastDeploy 安装文档](../get_started/installation/nvidia_gpu.md)。
+
+若需单独下载 Golang router 二进制文件，可通过以下方式：
+```
+wget https://paddle-qa.bj.bcebos.com/paddle-pipeline/FastDeploy_ActionCE/develop/latest/fd-router
+mv fd-router /usr/local/bin/fd-router
+```

 ### 2. 编译安装

@@ -27,13 +35,14 @@ FastDeploy提供Golang版本[Router](https://github.com/PaddlePaddle/FastDeploy/
 git clone https://github.com/PaddlePaddle/FastDeploy.git
 cd FastDeploy/fastdeploy/golang_router
 bash build.sh
+cp
 ```

 ## 集中式部署

 启动Router服务，其中`--port`参数指定集中式部署的调度端口.
 ```
-./fd-router \
+/usr/local/bin/fd-router \
  --port 30000
 ```

@@ -51,7 +60,7 @@ python -m fastdeploy.entrypoints.openai.api_server \

 启动Router服务，其中`--splitwise`参数指定为分离式部署的调度方式.
 ```
-./fd-router \
+/usr/local/bin/fd-router \
  --port 30000 \
  --splitwise
 ```
@@ -106,7 +115,7 @@ popd

 在Router启动Router服务，其中`--config_path`参数指定配置路径.
 ```
-./fd-router \
+/usr/local/bin/fd-router \
  --port 30000 \
  --splitwise \
  --config_path examples/run_with_config/config/config.yaml
@@ -17,8 +17,7 @@ ROUTER_PORT=52600

 FD_BIN_DIR="/usr/local/bin"
 FD_ROUTER_BIN="${FD_BIN_DIR}/fd-router"
-FD_ROUTER_URL="https://paddle-qa.bj.bcebos.com/FastDeploy/fd-router"
-FD_ROUTER_SHA256="67640aaeebdd886826d3534930b2154cd2c1441a26bc3f38c3af5f0aadba7c2d"
+FD_ROUTER_URL="https://paddle-qa.bj.bcebos.com/paddle-pipeline/FastDeploy_ActionCE/develop/latest/fd-router"

 ports=(
    $S1_PORT $((S1_PORT + 1)) $((S1_PORT + 2)) $((S1_PORT + 3))
@@ -37,20 +36,44 @@ if [ ! -x "${FD_ROUTER_BIN}" ]; then
    mkdir -p "${FD_BIN_DIR}"
    TMP_BIN="${FD_ROUTER_BIN}.tmp"

-    wget -q --no-proxy "${FD_ROUTER_URL}" -O "${TMP_BIN}" || exit 1
-
-    echo "${FD_ROUTER_SHA256}  ${TMP_BIN}" | sha256sum -c - || {
-        echo "❌ Integrity check failed"
+    wget -q --no-proxy "${FD_ROUTER_URL}" -O "${TMP_BIN}" || {
+        echo "❌ Download fd-router failed"
        rm -f "${TMP_BIN}"
        exit 1
    }

+    # ------- sanity checks (no fixed hash) -------
+
+    # 1. must be ELF binary
+    file "${TMP_BIN}" || grep -q "ELF" || {
+        echo "❌ fd-router is not an ELF binary"
+        rm -f "${TMP_BIN}"
+        exit 1
+    }
+
+    # 2. must be x86_64 architecture
+    file "${TMP_BIN}" | grep -q "x86-64" || {
+        echo "❌ fd-router architecture mismatch"
+        rm -f "${TMP_BIN}"
+        exit 1
+    }
+
+    # 3. size check (avoid HTML / empty / error pages)
+    SIZE=$(stat -c%s "${TMP_BIN}")
+    if [ "$SIZE" -lt 1000000 ]; then
+        echo "❌ fd-router size is too small ($SIZE bytes), suspicious"
+        rm -f "${TMP_BIN}"
+        exit 1
+    fi
+
+    # -------------------------------------
+
    mv "${TMP_BIN}" "${FD_ROUTER_BIN}"
    chmod +x "${FD_ROUTER_BIN}"

-    echo "fd-router installed and verified"
+    echo "✅ fd-router installed with sanity checks"
 else
-    echo "fd-router already exists"
+    echo "✅ fd-router already exists"
 fi

 # start router
@@ -24,8 +24,7 @@ LOG_DATE=$(date +%Y%m%d_%H%M%S)

 FD_BIN_DIR="/usr/local/bin"
 FD_ROUTER_BIN="${FD_BIN_DIR}/fd-router"
-FD_ROUTER_URL="https://paddle-qa.bj.bcebos.com/FastDeploy/fd-router"
-FD_ROUTER_SHA256="67640aaeebdd886826d3534930b2154cd2c1441a26bc3f38c3af5f0aadba7c2d"
+FD_ROUTER_URL="https://paddle-qa.bj.bcebos.com/paddle-pipeline/FastDeploy_ActionCE/develop/latest/fd-router"

 ports=($P_PORT $D_PORT $ROUTER_PORT)
 check_ports "${ports[@]}" || {
@@ -40,20 +39,44 @@ if [ ! -x "${FD_ROUTER_BIN}" ]; then
    mkdir -p "${FD_BIN_DIR}"
    TMP_BIN="${FD_ROUTER_BIN}.tmp"

-    wget -q --no-proxy "${FD_ROUTER_URL}" -O "${TMP_BIN}" || exit 1
-
-    echo "${FD_ROUTER_SHA256}  ${TMP_BIN}" | sha256sum -c - || {
-        echo "❌ Integrity check failed"
+    wget -q --no-proxy "${FD_ROUTER_URL}" -O "${TMP_BIN}" || {
+        echo "❌ Download fd-router failed"
        rm -f "${TMP_BIN}"
        exit 1
    }

+    # ------- sanity checks (no fixed hash) -------
+
+    # 1. must be ELF binary
+    file "${TMP_BIN}" || grep -q "ELF" || {
+        echo "❌ fd-router is not an ELF binary"
+        rm -f "${TMP_BIN}"
+        exit 1
+    }
+
+    # 2. must be x86_64 architecture
+    file "${TMP_BIN}" | grep -q "x86-64" || {
+        echo "❌ fd-router architecture mismatch"
+        rm -f "${TMP_BIN}"
+        exit 1
+    }
+
+    # 3. size check (avoid HTML / empty / error pages)
+    SIZE=$(stat -c%s "${TMP_BIN}")
+    if [ "$SIZE" -lt 1000000 ]; then
+        echo "❌ fd-router size is too small ($SIZE bytes), suspicious"
+        rm -f "${TMP_BIN}"
+        exit 1
+    fi
+
+    # -------------------------------------
+
    mv "${TMP_BIN}" "${FD_ROUTER_BIN}"
    chmod +x "${FD_ROUTER_BIN}"

-    echo "fd-router installed and verified"
+    echo "✅ fd-router installed with sanity checks"
 else
-    echo "fd-router already exists"
+    echo "✅ fd-router already exists"
 fi

 # start router
@@ -19,5 +19,5 @@ if [ -n "$PID" ]; then
 fi

 echo "Starting new fd-router process..."
-nohup ./fd-router --config_path ./config/config.yaml --splitwise > fd-router.log 2>&1 &
+nohup /usr/local/bin/fd-router --config_path ./config/config.yaml --splitwise > fd-router.log 2>&1 &
 echo "fd-router started with PID: $!"
@@ -14,5 +14,5 @@ if [ -n "$PID" ]; then
 fi

 echo "Starting new fd-router process..."
-nohup ./fd-router --config_path ./config/config.yaml --splitwise > fd-router.log 2>&1 &
+nohup /usr/local/bin/fd-router --config_path ./config/config.yaml --splitwise > fd-router.log 2>&1 &
 echo "fd-router started with PID: $!"
@@ -142,7 +142,7 @@ func extractPromptFromCompletionsRequest(rawReq map[string]any) string {
 }

 // PostToPD sends requests to both Prefill and Decode instances, only returns Decode node response
-func PostToPD(c *gin.Context, decodeURL, prefillURL string, reqBody []byte, isStream bool, completionEndpoint string) (*http.Response, error) {
+func PostToPD(c *gin.Context, decodeURL, prefillURL string, reqBody []byte, isStream bool, message string, completionEndpoint string) (*http.Response, error) {
 	ctx := c.Request.Context()

 	decodeEndpoint := fmt.Sprintf("%s/v1/%s", decodeURL, completionEndpoint)
@@ -206,13 +206,13 @@ func PostToPD(c *gin.Context, decodeURL, prefillURL string, reqBody []byte, isSt
 	}

 	if prefillRes.resp != nil {
-		go readPrefillRecv(ctx, prefillURL, isStream, prefillRes.resp)
+		go readPrefillRecv(ctx, prefillURL, isStream, message, prefillRes.resp)
 	}

 	return decodeRes.resp, nil
 }

-func readPrefillRecv(ctx context.Context, url string, isStream bool, backendResp *http.Response) {
+func readPrefillRecv(ctx context.Context, url string, isStream bool, message string, backendResp *http.Response) {
 	if backendResp == nil || backendResp.Body == nil {
 		return
 	}
@@ -231,21 +231,22 @@ func readPrefillRecv(ctx context.Context, url string, isStream bool, backendResp
 			// Fallback to ensure release
 			if !released {
 				scheduler_handler.Release(ctx, url)
+				scheduler_handler.ReleasePrefillTokens(ctx, url, message)
 				logger.Debug("[prefill] release in defer (fallback) url=%s", url)
 			}
 		}()

 		for scanner.Scan() {
-			line := scanner.Text()
+			_ = scanner.Text()

 			// First read that returns data
 			if !released {
 				scheduler_handler.Release(ctx, url)
+				scheduler_handler.ReleasePrefillTokens(ctx, url, message)
 				released = true

 				logger.Debug("[prefill] first chunk received, release scheduler url=%s", url)
 			}
-			logger.Debug("[prefill] recv result: %s", line)
 		}

 		if err := scanner.Err(); err != nil {
@@ -295,11 +296,12 @@ func CommonCompletions(c *gin.Context, extractor PromptExtractor, completionEndp
 		requestBodyData []byte
 		prefillURL      string
 		decodeURL       string
+		message         string
 	)

 	if isSplitwise {
 		// PD mode: select instances for Prefill/Decode separately
-		message := extractor(rawReq)
+		message = extractor(rawReq)

 		prefillURL, decodeURL, err = manager.SelectWorkerPair(ctx, message)
 		if err != nil {
@@ -313,9 +315,6 @@ func CommonCompletions(c *gin.Context, extractor PromptExtractor, completionEndp
 			return
 		}

-		// Prefill node token count was added in SelectWorker, release when request ends
-		defer scheduler_handler.ReleasePrefillTokens(ctx, prefillURL, message)
-
 		// Construct disaggregate_info to ensure selected P/D work in pairs within FastDeploy
 		disagg, err := manager.BuildDisaggregateInfo(ctx, prefillURL, decodeURL)
 		if err != nil {
@@ -340,7 +339,7 @@ func CommonCompletions(c *gin.Context, extractor PromptExtractor, completionEndp
 		}

 		destURL = decodeURL
-		releaseTargets = []string{prefillURL, decodeURL}
+		releaseTargets = []string{decodeURL}

 		// Expose scheduling results to caller for debugging/validating scheduling strategy
 		c.Writer.Header().Set("X-Router-Prefill-URL", prefillURL)
@@ -376,7 +375,7 @@ func CommonCompletions(c *gin.Context, extractor PromptExtractor, completionEndp
 	// Send request
 	var backendResp *http.Response
 	if isSplitwise {
-		backendResp, err = PostToPD(c, decodeURL, prefillURL, requestBodyData, isStream, completionEndpoint)
+		backendResp, err = PostToPD(c, decodeURL, prefillURL, requestBodyData, isStream, message, completionEndpoint)
 	} else {
 		backendResp, err = GetClientWithRetry(c, requestBodyData, destURL)
 	}
@@ -389,7 +388,7 @@ func CommonCompletions(c *gin.Context, extractor PromptExtractor, completionEndp
 	defer backendResp.Body.Close()

 	if isSplitwise {
-		metrics.InferenceRequests.WithLabelValues("", releaseTargets[0], destURL, strconv.Itoa(backendResp.StatusCode)).Inc()
+		metrics.InferenceRequests.WithLabelValues("", prefillURL, decodeURL, strconv.Itoa(backendResp.StatusCode)).Inc()
 	} else {
 		metrics.InferenceRequests.WithLabelValues(destURL, "", "", strconv.Itoa(backendResp.StatusCode)).Inc()
 	}
@@ -203,7 +203,7 @@ func TestPostToPD(t *testing.T) {
 		}))
 		defer decodeServer.Close()

-		resp, err := PostToPD(c, decodeServer.URL, prefillServer.URL, reqBody, false, "chat/completions")
+		resp, err := PostToPD(c, decodeServer.URL, prefillServer.URL, reqBody, false, "test message", "chat/completions")
 		assert.NoError(t, err)
 		assert.Equal(t, http.StatusOK, resp.StatusCode)
 		assert.NotNil(t, resp)
@@ -217,7 +217,7 @@ func TestPostToPD(t *testing.T) {
 		defer prefillServer.Close()

 		// Use invalid URL to simulate connection error
-		resp, err := PostToPD(c, "http://invalid-server:9999", prefillServer.URL, reqBody, false, "chat/completions")
+		resp, err := PostToPD(c, "http://invalid-server:9999", prefillServer.URL, reqBody, false, "test message", "chat/completions")
 		assert.Error(t, err)
 		assert.Nil(t, resp)
 	})
@@ -229,7 +229,7 @@ func TestPostToPD(t *testing.T) {
 		defer decodeServer.Close()

 		// Use invalid URL to simulate connection error
-		resp, err := PostToPD(c, decodeServer.URL, "http://invalid-server:9999", reqBody, false, "chat/completions")
+		resp, err := PostToPD(c, decodeServer.URL, "http://invalid-server:9999", reqBody, false, "test message", "chat/completions")
 		assert.Error(t, err)
 		assert.Nil(t, resp)
 	})
@@ -336,7 +336,7 @@ func TestReadPrefillRecv(t *testing.T) {
 	t.Run("nil response handling", func(t *testing.T) {
 		ctx := context.Background()
 		// Should handle nil response gracefully without panic
-		readPrefillRecv(ctx, "test-url", false, nil)
+		readPrefillRecv(ctx, "test-url", false, "test message", nil)
 	})

 	t.Run("nil response body handling", func(t *testing.T) {
@@ -347,7 +347,7 @@ func TestReadPrefillRecv(t *testing.T) {
 			Body:       nil,
 		}
 		// Should handle nil body gracefully without panic
-		readPrefillRecv(ctx, "test-url", false, resp)
+		readPrefillRecv(ctx, "test-url", false, "test message", resp)
 	})

 	t.Run("mock response without scheduler dependency", func(t *testing.T) {
@@ -361,7 +361,7 @@ func TestReadPrefillRecv(t *testing.T) {

 		// This test verifies basic error handling and response body consumption
 		// without triggering scheduler initialization requirements
-		readPrefillRecv(ctx, "test-url", false, resp)
+		readPrefillRecv(ctx, "test-url", false, "test message", resp)
 	})
 }