mpgemm
1a1d048774
[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 ( #6963 )
2026-03-30 11:37:04 +08:00
RichardWooSJTU
4ed483d20b
[BugFix] Fix ep compatibility issues & Optimize permute operator ( #6821 )
...
* fix ep compatibility issues & optimize permute operator
* fix ut
* fix ut
2026-03-17 10:32:11 +08:00
RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
wangyifei
b57c960837
cuda13.0, implement changes to CCCL ( #6751 )
2026-03-10 16:47:02 +08:00
gongweibao
ddb06ff83f
init ( #6642 )
...
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-04 21:55:31 +08:00
JYChen
c745a22420
[Feature] Support Ernie FP8 on sm100 ( the fixed version) ( #6304 )
2026-02-03 17:47:38 +08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00
lizexu123
6619298b50
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models ( #6007 )
...
* update w4afp8
* build.sh ok
* support cuda_graph
* fix
* add test
* fix max_tokens_per_expert
* >=70
* fix
* compute_max_tokens_from_prefix_sum in w4afp8
* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
周周周
f15df1ec89
Revert cuda check ( #5915 )
...
* commit
* commit
2026-01-07 14:40:18 +08:00
Yuanle Liu
5e729bc2ba
[OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 ( #5890 )
2026-01-06 10:39:35 +08:00
周周周
ab553b3b8b
revert cuda_check ( #5883 )
2026-01-05 20:51:31 +08:00
lizexu123
1d3ae7c024
[BugFix] fix w4afp8 tp=8 ( #5868 )
...
* fix w4afp8 tp=8
* fix
2026-01-05 18:59:02 +08:00
周周周
e3957a5ebc
[Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel ( #5620 )
2026-01-04 11:21:15 +08:00
Sunny-bot1
598d292a69
w4afp8 fix quant ( #5830 )
2025-12-30 21:16:13 +08:00
Longzhi Wang
11329ee35e
[Model] support mode config for expert_dispatch ( #5748 )
2025-12-29 13:37:20 +08:00
Ryan
724045c426
add some op infershape&dtype ( #5762 )
2025-12-26 16:17:39 +08:00
周周周
a36d60aa18
[FIX BUG] fix bug in TP in permute_x_fp8_kernel ( #5350 )
...
* commit
* commit
* commit
* commit
* commit
* commit
2025-12-03 05:17:37 -08:00
Sunny-bot1
d5a9b75b4e
fix cutlass ep ( #5337 )
2025-12-03 14:06:01 +08:00
Sunny-bot1
3629db4129
[Quantization] Support w4afp8 MoE dynamic quantization ( #5282 )
...
* support dynamic activation quant for w4afp8
* support dynamic w4afp8
* add test
* fix
* fix
---------
Co-authored-by: zhoutianzi666 <17801055074@163.com >
2025-12-02 18:56:16 +08:00
周周周
fb7f951612
[UNITEST] add test ( #5305 )
2025-12-02 17:59:01 +08:00
chen
aa35ce449d
[Optimization] EP empty_input_forward Remove Communication ( #5254 )
2025-12-01 21:10:40 +08:00
周周周
95243f012c
[Others] add PADDLE_ENFORCE ( #5288 )
2025-11-28 14:23:35 +08:00
Jundong Liu
147b2e5eb0
[BugFix] Fix zero workspace returned by CUB size query under CUDA Graph in MoE dispatch ( #5087 )
...
* fix bug about CubKeyValueSorter::run
* pre-commit and add comment
* pre-commit
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix precommit
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-11-20 20:00:29 +08:00
yangjianfengo1
3afb717995
【Fix】fix deepep dispatch ( #5036 )
...
* fix dispatch
* fix dispatch
---------
Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com >
2025-11-17 10:34:01 +08:00
yangjianfengo1
ae7bee8122
【New Feature】W4afp8 supports per group quantization ( #4987 )
...
* w4afp8 支持per group
* code style
* fix transpose
* revert fast hardmard
---------
Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com >
Co-authored-by: plusNew001 <95567040+plusNew001@users.noreply.github.com >
2025-11-13 19:17:27 +08:00
YuBaoku
819b2dbbae
Revert "【New Feature】W4afp8 supports per group quantization ( #4272 )" ( #4854 )
...
This reverts commit 93fcf7e4ec .
2025-11-06 17:48:28 +08:00
yangjianfengo1
93fcf7e4ec
【New Feature】W4afp8 supports per group quantization ( #4272 )
...
* w4afp8 支持per group
* code style
* 精度完成
* revert append attn utils
* ffn1 动态量化
* ffn2 支持动态量化
* code style
* code style
* 修改单测
* 修改单测
* fix bug
* Implement conditional parameter creation for layers
Add parameter creation for up_gate_proj_in_scale when ep_size > 1.
* code style
* fix conflict
* code style
* code style
* 修复w4aint8 精度
* fix ci
---------
Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com >
2025-11-05 21:00:23 +08:00
Zhenghai Zhang
1712e1351b
【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation ( #4592 )
...
* autogen MoeFastHardamardImplWrapper template_instantiation
* fix codestyle
* fix codestyle
* add impl cu files
2025-10-30 10:28:36 +08:00
Haonan Luo
1b9f351d21
Support GPT-OSS-BF16 ( #4240 )
...
* [Feature] AppendAtten support sinks & HEAD_DIM=64
* fix bug
* fix bug
* fix bug
* fix bug
* [Feature] support gpt-oss
* fix bug
* add mask
* support-gpt-oss
* support-gpt-oss
* fix long seq
* support wint8
* support wint8
* support wint8
* update test
* change sliding windows init pos
---------
Co-authored-by: ming1753 <ideaminghp@163.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com >
2025-10-20 14:44:58 +08:00
gaoziyuan
896e3bb606
[NewFeture]add ep rollout model init and update/clear ep buffer ( #4039 )
...
* fix gid
* merge
* fix test
* fix bug
* fix
* fix ci
2025-09-17 20:24:53 +08:00
Sunny-bot1
442543cd6b
fix ep wint8 ( #4102 )
2025-09-16 11:05:33 +08:00
Ayakouji
453487d5b0
[Feat] ernie4_5_vl_moe support CudaGraph ( #3226 )
...
* delete dynamic control flow for decode
* coda-style
* fix scatter/gather typos and use input stream instead default stream
* support 0-Size Tensor
* update runner and model
* using static mem address as input
* fix mem leak
* refine code
* update mm_buffer
* fix typo
* fix buffersize
* fix unk token
* refine code
* refine
* support other arch
* open cudagraph in vlci
* fix
* update
* update
* update
* fix cmd
* update
---------
Co-authored-by: aquagull <hongyuh@qq.com >
Co-authored-by: Yuanle Liu <yuanlehome@163.com >
2025-09-10 13:11:57 +08:00
周周周
dbab579299
clean code ( #4020 )
2025-09-10 10:56:15 +08:00
co63oc
2033450391
rename ep_moe_prefill_func ep_moe_expert_dispatch ( #3938 )
2025-09-08 15:19:28 +08:00
Yuan Xiaolan
2cf55168ca
load hadamard_block_size from config ( #3797 )
2025-09-05 17:07:58 +08:00
co63oc
5441538173
rename fused_get_rope.cu ( #3752 )
...
* rename fused_get_rope.cu
* fix
* fix typos
* fix
* fix
2025-09-03 10:54:34 +08:00
co63oc
d6369b4d51
fix typos ( #3684 )
2025-09-01 17:50:17 +08:00
yangjianfengo1
e81046fdad
【New Feature】集中式支持w4afp8 ( #3644 )
...
* 支持tp w4afp8
* code style
2025-08-28 10:53:24 +08:00
gaoziyuan
82e64b13e1
[NewFeature]Support dp multi api server && Fix some bug in mixed ep && merge develop ( #3598 )
...
* [Feature] update ep
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* fix queue ports idx
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* fix ci
* Update engine.py
* fix ci
* fix some bug in mixed ep
* add server fix and op fix
* rm some log
* fix code style
* ltd fix
* fix
* fix
* fix some bug
* fix bug
* fix bug
* fix style
* Update config.py
* Update splitwise_connector.py
* Update cache_messager.py
* Update __init__.py
* merge and fix
* Update engine.py
* Update common_engine.py
* Update run_ci_xpu.sh
* Update ernie_processor.py
* Update ernie_processor.py
---------
Co-authored-by: ltd0924 <ltd0924@sina.com >
Co-authored-by: ltd0924 <32387785+ltd0924@users.noreply.github.com >
2025-08-26 19:59:02 +08:00
lzy
d339df2e90
Supports DP+TP+EP hybrid parallel deployment strategy ( #3489 )
...
* Support DP+TP+EP hybrid parallel deployment strategy
* Support DP+TP+EP hybrid parallel deployment strategy
* fix conflict
* add moe_tp_ep function split_allgather_out
* del tp_group in moe_cutlass_backend
* for ci
* fix parallel_config for ci
* del log
2025-08-26 00:04:01 -07:00
Yuan Xiaolan
9205c88da1
support w4afp8 EP inference ( #3044 )
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-08-25 11:27:45 +08:00
Sunny-bot1
6c1f3ff897
topk_gating_softmax support bias ( #3405 )
2025-08-15 11:57:45 +08:00
Sunny-bot1
2e7831185f
[Optimize]Add norm_weights feature for topk_gating_softmax ( #3372 )
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-08-14 15:05:23 +08:00
Sunny-bot1
8224b21525
Refactor moe_topk_select op to use apply_norm_weight as a template parameter ( #3345 )
...
* Refactor moe_topk_select op to use apply_norm_weight as a template parameter
* update test
2025-08-13 08:44:16 +08:00
Yuan Xiaolan
7ce00e597c
support qk norm ( #3145 )
2025-08-05 16:46:14 +08:00
AIbin
22fe695f1c
【Inference Optimize】Support automatic generation of marlin kernel ( #3149 )
...
* Support automatic generation of marlin kernel
2025-08-01 22:43:18 +08:00
chen
a2f5cc54f8
moe preprocess op support 160 experts and fused_moe triton kernel name add K ( #3121 )
2025-08-01 10:46:20 +08:00
Yuan Xiaolan
7d87aaace8
optimize w4a8 decoding ( #3050 )
2025-07-28 22:20:13 +08:00