mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660)
* enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues
This commit is contained in:
@@ -39,6 +39,7 @@ def _make_cfg(**ov):
|
||||
pc.use_internode_ll_two_stage = pc.disable_sequence_parallel_moe = False
|
||||
pc.shutdown_comm_group_if_worker_idle = False
|
||||
pc.ep_prefill_use_worst_num_tokens = False
|
||||
pc.enable_flashinfer_allreduce_fusion = False
|
||||
sc = ns(max_num_seqs=256, max_num_batched_tokens=4096, splitwise_role="mixed", name="local")
|
||||
sc.enable_overlap_schedule = False
|
||||
cc = ns(num_gpu_blocks_override=None, gpu_memory_utilization=0.9, block_size=16, enc_dec_block_num=0)
|
||||
|
||||
Reference in New Issue
Block a user