[FDConfig] Reduce FD_CUSTOM_AR_MAX_SIZE_MB default from 64 to 8 (#6997)

Most single-GPU and small-model deployments do not need 64MB custom
all-reduce buffers. Lowering the default to 8MB reduces unnecessary
shared memory allocation. Tests that require larger buffers now
explicitly set the value.

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
gongweibao
2026-03-25 17:40:01 +08:00
committed by GitHub
parent 7a6c28781b
commit 48cfb608aa
2 changed files with 3 additions and 3 deletions
+2 -2
View File
@@ -225,10 +225,10 @@ environment_variables: dict[str, Callable[[], Any]] = {
"FD_WORKER_ALIVE_TIMEOUT": lambda: int(os.getenv("FD_WORKER_ALIVE_TIMEOUT", "30")),
# File path for file storage backend
"FILE_BACKEND_STORAGE_DIR": lambda: str(os.getenv("FILE_BACKEND_STORAGE_DIR", "/tmp/fastdeploy")),
# Custom all-reduce max buffer size in MB (default 64MB).
# Custom all-reduce max buffer size in MB (default 8MB).
# Increase this to avoid NCCL fallback for large tensors in deterministic mode.
# E.g. FD_CUSTOM_AR_MAX_SIZE_MB=128 for 128MB.
"FD_CUSTOM_AR_MAX_SIZE_MB": lambda: int(os.getenv("FD_CUSTOM_AR_MAX_SIZE_MB", "64")),
"FD_CUSTOM_AR_MAX_SIZE_MB": lambda: int(os.getenv("FD_CUSTOM_AR_MAX_SIZE_MB", "8")),
# Enable deterministic inference mode for chunked prefill alignment
"FD_DETERMINISTIC_MODE": lambda: bool(int(os.getenv("FD_DETERMINISTIC_MODE", "0"))),
# Split KV block size for deterministic alignment (must be power of 2 and > 0, default 16)