[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660)

* enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues
2026-04-23 00:17:25 +08:00 · 2026-04-16 14:10:19 +08:00
parent e53f5184ac
commit 6b891da02b
17 changed files with 871 additions and 11 deletions
@@ -39,6 +39,7 @@ def _make_cfg(**ov):
    pc.use_internode_ll_two_stage = pc.disable_sequence_parallel_moe = False
    pc.shutdown_comm_group_if_worker_idle = False
    pc.ep_prefill_use_worst_num_tokens = False
+    pc.enable_flashinfer_allreduce_fusion = False
    sc = ns(max_num_seqs=256, max_num_batched_tokens=4096, splitwise_role="mixed", name="local")
    sc.enable_overlap_schedule = False
    cc = ns(num_gpu_blocks_override=None, gpu_memory_utilization=0.9, block_size=16, enc_dec_block_num=0)