[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660)

* enable trtllm_all_reduce fusion kernel in glm model

* fix conflict

* format update

* fix a bug

* modify test

* modify test

* support empty tensor and modify test

* fix test_linear config issues

* modify test name

* add edge test case

* modify format

* fix conflict

* modify default max token num in trtllm_allreduce_fusion

* add max token num branch for trtllm_allreduce_fusion

* fix format

* fix rmsnorm config issue

* modify 2025 to 2026

* using compat grard

* Lazily import flashinfer.comm and fix test config issue

* fix test issues

* add flashinfer cache dir clean machine

* fix some issues
This commit is contained in:
Bingoo
2026-04-16 14:10:19 +08:00
committed by GitHub
parent e53f5184ac
commit 6b891da02b
17 changed files with 871 additions and 11 deletions
+1
View File
@@ -39,6 +39,7 @@ def _make_cfg(**ov):
pc.use_internode_ll_two_stage = pc.disable_sequence_parallel_moe = False
pc.shutdown_comm_group_if_worker_idle = False
pc.ep_prefill_use_worst_num_tokens = False
pc.enable_flashinfer_allreduce_fusion = False
sc = ns(max_num_seqs=256, max_num_batched_tokens=4096, splitwise_role="mixed", name="local")
sc.enable_overlap_schedule = False
cc = ns(num_gpu_blocks_override=None, gpu_memory_utilization=0.9, block_size=16, enc_dec_block_num=0)