Commit Graph

584 Commits

Author SHA1 Message Date
GoldPancake 183b8d325a [RL] Support GLM MTP RL Model (#6267) 2026-02-04 20:14:35 +08:00
fxyfxy777 36547cfdb3 [Feature] FD_USE_PHI_FP8_QUANT (#6320)
* add ut

* add use_fd_quant env

* rm mask_per_token_quant

* add make ops list

* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true

* modify comments

* use bool type

* Add function declaration
2026-02-03 22:33:03 -08:00
sunxin 9b0a82cfa9 [Model Runner] Support overlap schedule (#6259) 2026-02-04 10:49:44 +08:00
RAM 5b22e5dfe7 [RL] R3 Support Fused Put the Routing of All Layers (#6099)
* fused put routing

* fix bug

* [draft commit]dynamic dtype

* fix async put & numpy bug

* fix unit8 test case
2026-02-03 04:13:16 -08:00
JYChen c745a22420 [Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304) 2026-02-03 17:47:38 +08:00
bukejiyu 12d4b4cb87 [Feature]Support reorder ids to split prefill and decodes (#5779)
* support reorder ids

* perfect code

* fix

* fix unittest

* delete code

* fix

* add python api

* delete custom op

* update algorithm

* fix swap

* support condense

* support condense

* support mtp

* delete code

* update

* update

* update

* update

* update for other platfrom

* update

* fix

* fix mtp

* fix ut

* update

* fix ut

* update ut

* fix

* fix encoder_cache

* fix ci

* fix

* fix vl

* Fix performance regression

* fix

* fix

* fix mtp

* fix index->req_id mapping

* fix ut

---------

Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-03 00:28:02 -08:00
周周周 cbdb2462ea cp 1131 tbo to develop (#6281) 2026-02-03 15:23:23 +08:00
周周周 8277b95fa6 remove speculate_get_padding_offset op (#6308) 2026-02-03 15:18:12 +08:00
fxyfxy777 f3413c4caa [BugFix] fix fused_mask_swiglu_fp8_quant bug (#6316)
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant

* Revert "[Optimize] optimize mask_quant & swiglu (#6222)"

This reverts commit 2ada119a38.

* add block_size

* pre-commit
2026-02-03 13:54:12 +08:00
GoldPancake fb374238e1 Revert "[RL] Support GLM MTP RL Model (#6223)" (#6301)
This reverts commit af6c84d48d.
2026-02-02 14:08:13 +08:00
fxyfxy777 2ada119a38 [Optimize] optimize mask_quant & swiglu (#6222)
* optimize mask_quant op speed up 1.5

* fix calculate sequence

* add fused

* rm log

* push kernel code

* add ut

* accuracy ok

* add ue8m0

* add ut

* add merge develop

* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
xiaozude 030647521a [Metax] adapt to the latest develop (#6282) 2026-01-29 23:21:20 -08:00
JYChen 6c685c9474 Revert "[Feature] Support Ernie FP8 on sm100 (#5593)" (#6275)
This reverts commit eb80724b71.
2026-01-30 11:22:01 +08:00
MingkunZhang c4abb01f9c [Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) (#6266) 2026-01-29 19:24:36 +08:00
Ryan 5e78c1ac87 [Graph Optimization] Support CUDAGraph for P/PD mixed Batch using SOT subgraph spliting mode (#6196)
* refine comment && refine variable name

* replace comment
2026-01-29 16:29:54 +08:00
yuxuan 44b52701f6 [Feature] Support NVFP4 MoE on SM100 (#6003)
* fp4 dense

* [WIP] support nvfp4, dense part

* [wip] developing loading qwen model

* loading

* update

* dense fp4 OK, cudagraph error

* [WIP] moe forward part

* with flashinfer-backend

* qwen3_moe_fp4

* update

* support flashinfer-cutlass moe, qwen3-moe-fp4 OK

* support ernie4.5-fp4

* fix load error

* add some ut

* add docs

* fix CLA, test

* fix the apply() in ModelOptNvFp4FusedMoE

* fix CodeStyle

* del the PADDLE_COMPATIBLE_API

* fix broken url: nvidia_gpu.md

* fix docs

* fix token_ids

* fix CI in Hopper

* move flashinfer imports inside the function

* fix model_runner

Removed the logic for generating random padding IDs.

* Remove skip condition for CUDA version in nvfp4 test

* add test for nvfp4

* fix according to review

* Add Chinese translation link to NVFP4 documentation

* del flashinfer.py

* fix unittest

---------

Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>
2026-01-29 14:16:07 +08:00
JYChen eb80724b71 [Feature] Support Ernie FP8 on sm100 (#5593)
* Deepgemm暂时可用版本

* dense部分 e8m0 ok

* EB模型E8M0跑通的版本

* code check

* support 21b-tp2, dev_paddle

* 单机4.5T ep OK的版本

* 修复删除的代码,单机4.5T ep(非cudagraph)

* eb tp

* Support SM100 block-wise FP8 inference

* refine codes, support deepgemm on sm100

* add thirdparty PFCC/DeepGEMM

* fix ep decode

* 使用deepep ue8m0, 解决精度问题

* 修复FP8 TP精度

* Deepgemm升级适配Hopper逻辑

* add ue8m0 kernel

* add ue8m0 kernel

* fix custom_ops/gpu_ops/cpp_extensions.cc

* eb 输出正常

* eb5 text is right

* 目测精度一致

* 自测精度对齐

* 替换masked_per_token_quant, ep精度OK

* 性能提升约30%

* 暂时跑通ep但是有问题

* 自测一致

* rm test fun

* fix ep event

* 图优化算子更新Deepgemm

* fix build

* 暂时绕过deepgemm CI编译问题

* 根据SM区分deepgemm版本

* remove useless code

---------

Co-authored-by: ckl117 <ckl117@163.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com>
2026-01-29 13:49:54 +08:00
GoldPancake af6c84d48d [RL] Support GLM MTP RL Model (#6223)
* support glm mtp rl model

* fix

* fix

* fix ut

* update baseline
2026-01-28 08:28:03 -08:00
ddchenhao66 6d33d5e370 [Models][BugFix] shared experts and dense mlp layer do not require TP split (#6180)
Co-authored-by: ddchenhao66 <dhaochen163.com>
2026-01-28 18:58:19 +08:00
GoldPancake 7d6c87c29e [Others] Support constrained decoding when enable_thinking is false (#6248)
* support constrained decoding when enable_thinking is false

* fix

* fix

* fix
2026-01-28 00:05:17 -08:00
sunxin 27f8799f04 [Model Runner] Refactor execute_model for GPU async scheduling (#6176) 2026-01-28 14:19:33 +08:00
freeliuzc ce06c6dfb3 [BugFix] Fix token_penalty kernel (#6069)
* fix token_penalty kernel

* try to fix xpu

* fix xpu

* fix unit test
2026-01-28 12:03:05 +08:00
Yuanle Liu 8b05774fad [Others] enhance deep_ep import and support mixed mode flash_mask_attn (#6238)
* support flashmaskattn mixed and enhance deepep import

* update

* fix
2026-01-28 00:02:02 +08:00
周周周 aa57864c5b remove unneeded para from flash_mask_attention (#6218) 2026-01-27 14:04:27 +08:00
xiaoxiaohehe001 7ffa88bb01 [BugFix] fix mask_attn (#6214)
* [BugFix] fix mask attn

* [BugFix] fix mask attn
2026-01-26 07:46:51 -08:00
CSWYF3634076 08c411518f [Loader] support dummy load weight (#6169)
* [Loader] support dummy load weight

* [Loader] support dummy load weight v2

* [Loader] support dummy load weight unittest

* [Loader] support dummy load weight unittest v2

* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
周周周 0966df78dc [Others] remove stop_nums (#6182) 2026-01-26 12:12:47 +08:00
wangyifei 84a1780814 [build] support build sm 80,86,89,90 to one whl package (#6173)
* support build sm 80,86,89,90 to one whl package

* create tmp dir before build custom ops in FD_UNIFY_BUILD mode

* typo fix

* ignore exceptions in xpu ..
2026-01-26 11:30:02 +08:00
Yuanle Liu 253c5cc16c Improve deep_ep import handling with logging (#6207)
* Improve deep_ep import handling with logging

Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging.

* Add traceback import to ep.py
2026-01-24 22:41:42 -08:00
fxyfxy777 79f42209bf add scale_wrapper for per_block_cast_to_fp8 (#6183) 2026-01-23 00:37:20 -08:00
GoldPancake 646aced1eb [UT] Add GLM E2E tests for non-MTP and MTP (#6163)
* add glm ut
2026-01-23 10:34:29 +08:00
yinwei 1e3c35496c [XPU][Graph Optimization] XPU Support CUDAGraph (#6152)
* support cuda graph
2026-01-22 14:41:56 +08:00
Haonan Luo 82057cb71f Support MXFP4 for GPT-OSS (#5435)
* support mxfp4 in gpt-oss

* support mxfp4 in gpt-oss

* add scope for flashinfer

* remove torch code

* update envs.FD_MXFP4_BACKEND

* update process_weights_after_loading

* update env name

* support tp in gpt-oss, add e2e test

* add flashinfer-python-paddle in requirements

* fix import error

* add test

* add test

* add test

* add test
2026-01-22 14:21:01 +08:00
fxyfxy777 9c4db0ac3f [BugFix] fix weight quant op (#6137)
* fix weight quant

* fix weight quant

* bit equal

* code style
2026-01-22 09:50:57 +08:00
zccjjj 14a64e9b3b [XPU] change XPU EP interface from xDeepEP to paddle (#5706)
* add ENV VAR to controll low lantency buffer
2026-01-21 18:23:45 +08:00
K11OntheBoat 490a6551dc rename params of normalization layer (#6133)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-01-21 17:18:35 +08:00
yzwu 837ddca273 [Iluvartar][CI] Fix the error max_tokens_per_expert referenced before assignment (#6083) 2026-01-21 16:01:29 +08:00
lizexu123 f4902fe42d [BugFix] fix wint2 (#6109)
* fix

* fix

* fix
2026-01-20 21:46:21 +08:00
yinwei 51a8a2ed57 [XPU] Support CudaGraph(add block attn cuda_graph support) (#6116)
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
Ryan dda27e50f5 [Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph (#6081)
* rm static_op_get_block_shape_and_split_kv_block from cudagraph

* update max_capture_shape

* fallback: zeros -> empty to avoid coverage check

* check graph_opt_config exists

* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test

* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
zhupengyang 45ebb2efb4 [XPU] support plugin model (#6092) 2026-01-20 13:00:09 +08:00
jackyYang6 988e0bc338 [Feature] Add PaddleFormers fallback backend (#5999)
* feat(paddleformers): add dense text model fallback backend

* docs(paddleformers): add user guide and fix code review issues

* add fallback unit test

* precommit format

* fix pre-commit

* fix: address code review feedback

* docs: add PaddleFormers backend documentation (EN) and simplify installation

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-19 21:50:50 +08:00
GoldPancake 879e45f6b3 fix compute logits problem (#6093) 2026-01-19 20:12:14 +08:00
sunxin 9dc1c74d36 fix opt qknorm (#6080) 2026-01-19 12:07:20 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
fxyfxy777 4c92035f2d [Feature] Unify fp8 block_wise quant ops (#5991)
* quant stash

* blockwise_quant

* precommit

* rm tensor.cut

* tp ok

* add swiglu

* rm outdate code

* fix activate ut

* change baseline

* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc 49617d9832 [Feature]Support tag phase token enforce generation (#6034)
* support tag phase token enforce generation

* optimize note and some feature

* fix sampler unit test

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-15 03:59:55 -08:00
cmcamdy 59d8ae0a25 [XPU] Speculate Decoding + PD, benchmark fix (#6036)
* fix mtp pd

* fix kernel

* fix code style

* fix kernel

* fix test / clear debug code

* fix test / clear debug code

* fix codestyle

* fix codestyle

* fix codestyle
2026-01-15 19:19:03 +08:00
lizexu123 6619298b50 【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models (#6007)
* update w4afp8

* build.sh ok

* support cuda_graph

* fix

* add test

* fix max_tokens_per_expert

* >=70

* fix

* compute_max_tokens_from_prefix_sum in w4afp8

* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00