Commit Graph

214 Commits

Author SHA1 Message Date
AIbin 6ce4854714 [Feature] Support MOE Cutlass backend for latent MOE (#7428)
* support moe cutlass backend  latent moe
2026-04-16 22:11:49 +08:00
周周周 5e54770b2e [Feature] 添加 MoE 层 latent mode 支持 (#7382) 2026-04-15 13:57:07 +08:00
周周周 73bd4ab318 [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361)
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
liuruyan b34708604c [TI-consistent] support quant use pow2scale (#7308)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:01:53 -07:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
周周周 225fc8d222 use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340) 2026-04-11 22:39:43 +08:00
chen 4982aa000e [RL]moe bf16 ep support paddle batch_gemm (#7337)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
fxyfxy777 39ff38aba1 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) 2026-04-09 16:17:56 +08:00
JYChen 43ace7af25 [RL] support moe-topk use topk_reduce_func (#7218)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:03 +08:00
lizhenyun01 446b26bbc0 [Feature] support blackwell gemm in ht (#7053)
* [Feature] support blackwell gemm in ht

* [Feature] support ops for convert

* fix cuda error 716

* fix cuda error

* opt memory

* remove unused code
2026-04-07 19:52:51 +08:00
lizexu123 5f612a348d [BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)
* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
2026-04-03 15:43:19 +08:00
zhangbo9674 5c60e2fc6f fix bug in cudagraph (#7069) 2026-03-30 14:24:23 +08:00
mpgemm 1a1d048774 [Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963) 2026-03-30 11:37:04 +08:00
SUN Dong 6cff780fdb [RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850)
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization

* update

* update

* update

* support custom topk inDeepGemmFusedMoeMethod  apply_tp

* apply_ep_prefill support moe_topk_select

* update

* add ut

* add ut

* add ut

* modity doc

* fix env and docs

* add ut

---------

Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>
2026-03-24 11:12:39 +08:00
Nyakku Shigure 8b6bbb3504 [Optimization] Use a separate driver when using Triton with Paddle (#6897)
---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-24 10:56:00 +08:00
sunxin d77edf8fc9 opt wfp8afp8 triton moe (#6938) 2026-03-20 11:07:25 +08:00
周周周 b1c800b64b remove load_up_proj_weight_first (#6932) 2026-03-19 17:21:34 +08:00
周周周 c184a7cb69 remove source in weight_loader in moe.py (#6892) 2026-03-19 13:31:43 +08:00
yzwu 8b890c0d72 [Iluvatar] refactor attn and moe code (#6887) 2026-03-18 10:31:00 +08:00
Longzhi Wang daaf498213 [Feature] support compute shared experts before combine for better overlap (#6697)
* [Feature] support compute shared experts before combine for better overlap

* fix test

* fix xpu

* fix
2026-03-17 15:18:51 +08:00
周周周 ea998dd26f clean clean code in _load_per_tensor_weight_scale (#6868)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-17 14:06:57 +08:00
RichardWooSJTU 4ed483d20b [BugFix] Fix ep compatibility issues & Optimize permute operator (#6821)
* fix ep compatibility issues & optimize permute operator

* fix ut

* fix ut
2026-03-17 10:32:11 +08:00
fxyfxy777 4d39232553 [BugFix] add ut for fused_moe_degemm (#6840)
* add ut

* add skip
2026-03-16 12:22:18 +08:00
liufengwei0103 62110045f3 [RL] add stream guard (#6814)
* add stream guard

* format
2026-03-13 11:22:26 +08:00
fxyfxy777 250ce40b40 [Feature] use phi permute/unpermute & rm swiglu (#6361)
* tp文字输出正常

* B eb5 mini文字输出正常

* eb5mini ep B卡 文字输出正常

* default use phi moe op

* stash

* tp H卡正常

* ep ok

* rm debug

* rm debug tool

* rm del ffn_out

* rm swiglu

* add envs to swiglu

* merge dev

* fix ci baseline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix ci baseline 2

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 02:01:57 -07:00
RAM cdaf6dd400 [RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599)
* cherry-pick  Support Fully Async and PrefixCache step 1

* copy routing_indices_cache.py from 2.4

* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)

* cherry-pick [RL] Clear Requests status of R3 (#6569)

* delete code

* fix rename bug

* fix status shape bug

* fix ci
2026-03-12 01:13:30 -07:00
RichardWooSJTU 9f0778f991 [Feature] Support EP prefill with num_worst_tokens (#6574)
* support num worst tokens

* support num worst tokens

* fix build error

* support num worst tokens: fix errors

* support num worst tokens: fix feild

* support num worst tokens: delete requiements

* replace permute and depermute op by pure cuda

* replace permute and depermute op by pure cuda

* fix ci

* fix op

* fix nan

* fix code style

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-11 17:09:07 +08:00
bukejiyu 598cce8545 [RL] Support SM100 FP8 quantization in RL (#6601)
* RL SM100 Fix

* update
2026-03-04 04:55:04 -08:00
RichardWooSJTU 61789febb9 [Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433)
* support to load static quant ue8m0 scale of deepgemm via v0_loader

* [Fix] Fix ue8m0 scale pack dimension calculation and block size validation

1. Fix pack dimension calculation in fused_moe_triton_backend.py:
   - Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division
   - This ensures sufficient pack allocation when num_scales is not a multiple of 4

2. Fix block size hardcoding in block_wise_fp8.py:
   - Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]`
   - Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 11:32:35 +08:00
RichardWooSJTU 7bd86f99a5 [BugFix] Fix tbo nan (#6439) 2026-03-02 14:28:48 +08:00
yzwu 6674131b0b [Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553) 2026-03-02 14:07:17 +08:00
RichardWooSJTU 7cfb0ffba0 fix pfcc deep ep in low latency mode (#6440) 2026-03-02 10:35:51 +08:00
Weiguo Zhu 8fb24122b8 fix reshard error (#6536) 2026-02-27 22:22:37 +08:00
sunxin 53aaac69da [Optimization] Enable BF16 gate computation for GLM and Qwen (#6457)
* gate bf16

* add gate-fp32

* fix

* update baseline

* update

* update

* fix
2026-02-26 21:08:46 -08:00
AIbin 0eb87467f8 [BugFix]fix RL bug about blockwisefp8 (#6466)
* fix RL bug about blockwisefp8

* fix  moe same bug

* fix RL FP8 bug
2026-02-12 09:15:29 +08:00
bukejiyu dc5917289d [loader]supoort wint2 backend (#6139)
* support wint2

* update
2026-02-08 22:42:36 -08:00
Mattheliu c776d483e4 [BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384)
* fix: handle 4 return values from noaux_tc_redundant op

The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)

The Python code was only unpacking 3 values, causing:
  ValueError: too many values to unpack (expected 3)

This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.

Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>

* fix: make noaux_tc_redundant return 4 values to match OP definition

The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.

This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)

Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>

---------

Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>
2026-02-09 13:17:47 +08:00
JYChen 9bcd863902 [Others] support import deepgemm/deepep from fleet ops (#6351)
* update paddleformers to v1.0

* only change import fleetpath
2026-02-09 11:53:13 +08:00
fxyfxy777 36547cfdb3 [Feature] FD_USE_PHI_FP8_QUANT (#6320)
* add ut

* add use_fd_quant env

* rm mask_per_token_quant

* add make ops list

* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true

* modify comments

* use bool type

* Add function declaration
2026-02-03 22:33:03 -08:00
RAM 5b22e5dfe7 [RL] R3 Support Fused Put the Routing of All Layers (#6099)
* fused put routing

* fix bug

* [draft commit]dynamic dtype

* fix async put & numpy bug

* fix unit8 test case
2026-02-03 04:13:16 -08:00
JYChen c745a22420 [Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304) 2026-02-03 17:47:38 +08:00
周周周 cbdb2462ea cp 1131 tbo to develop (#6281) 2026-02-03 15:23:23 +08:00
fxyfxy777 f3413c4caa [BugFix] fix fused_mask_swiglu_fp8_quant bug (#6316)
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant

* Revert "[Optimize] optimize mask_quant & swiglu (#6222)"

This reverts commit 2ada119a38.

* add block_size

* pre-commit
2026-02-03 13:54:12 +08:00
fxyfxy777 2ada119a38 [Optimize] optimize mask_quant & swiglu (#6222)
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
JYChen 6c685c9474 Revert "[Feature] Support Ernie FP8 on sm100 (#5593)" (#6275)
This reverts commit eb80724b71.
2026-01-30 11:22:01 +08:00
yuxuan 44b52701f6 [Feature] Support NVFP4 MoE on SM100 (#6003)
* fp4 dense

* [WIP] support nvfp4, dense part

* [wip] developing loading qwen model

* loading

* update

* dense fp4 OK, cudagraph error

* [WIP] moe forward part

* with flashinfer-backend

* qwen3_moe_fp4

* update

* support flashinfer-cutlass moe, qwen3-moe-fp4 OK

* support ernie4.5-fp4

* fix load error

* add some ut

* add docs

* fix CLA, test

* fix the apply() in ModelOptNvFp4FusedMoE

* fix CodeStyle

* del the PADDLE_COMPATIBLE_API

* fix broken url: nvidia_gpu.md

* fix docs

* fix token_ids

* fix CI in Hopper

* move flashinfer imports inside the function

* fix model_runner

Removed the logic for generating random padding IDs.

* Remove skip condition for CUDA version in nvfp4 test

* add test for nvfp4

* fix according to review

* Add Chinese translation link to NVFP4 documentation

* del flashinfer.py

* fix unittest

---------

Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>
2026-01-29 14:16:07 +08:00
JYChen eb80724b71 [Feature] Support Ernie FP8 on sm100 (#5593)
* Deepgemm暂时可用版本

* dense部分 e8m0 ok

* EB模型E8M0跑通的版本

* code check

* support 21b-tp2, dev_paddle

* 单机4.5T ep OK的版本

* 修复删除的代码,单机4.5T ep(非cudagraph)

* eb tp

* Support SM100 block-wise FP8 inference

* refine codes, support deepgemm on sm100

* add thirdparty PFCC/DeepGEMM

* fix ep decode

* 使用deepep ue8m0, 解决精度问题

* 修复FP8 TP精度

* Deepgemm升级适配Hopper逻辑

* add ue8m0 kernel

* add ue8m0 kernel

* fix custom_ops/gpu_ops/cpp_extensions.cc

* eb 输出正常

* eb5 text is right

* 目测精度一致

* 自测精度对齐

* 替换masked_per_token_quant, ep精度OK

* 性能提升约30%

* 暂时跑通ep但是有问题

* 自测一致

* rm test fun

* fix ep event

* 图优化算子更新Deepgemm

* fix build

* 暂时绕过deepgemm CI编译问题

* 根据SM区分deepgemm版本

* remove useless code

---------

Co-authored-by: ckl117 <ckl117@163.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com>
2026-01-29 13:49:54 +08:00
Yuanle Liu 8b05774fad [Others] enhance deep_ep import and support mixed mode flash_mask_attn (#6238)
* support flashmaskattn mixed and enhance deepep import

* update

* fix
2026-01-28 00:02:02 +08:00
Yuanle Liu 253c5cc16c Improve deep_ep import handling with logging (#6207)
* Improve deep_ep import handling with logging

Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging.

* Add traceback import to ep.py
2026-01-24 22:41:42 -08:00
GoldPancake 646aced1eb [UT] Add GLM E2E tests for non-MTP and MTP (#6163)
* add glm ut
2026-01-23 10:34:29 +08:00