Most single-GPU and small-model deployments do not need 64MB custom
all-reduce buffers. Lowering the default to 8MB reduces unnecessary
shared memory allocation. Tests that require larger buffers now
explicitly set the value.
Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>