Commit Graph

398 Commits

Author SHA1 Message Date
CSWYF3634076 08c411518f [Loader] support dummy load weight (#6169)
* [Loader] support dummy load weight

* [Loader] support dummy load weight v2

* [Loader] support dummy load weight unittest

* [Loader] support dummy load weight unittest v2

* [Loader] support dummy load weight v3 docs and fp8
2026-01-26 13:58:53 +08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
Yuanle Liu 253c5cc16c Improve deep_ep import handling with logging (#6207)
* Improve deep_ep import handling with logging

Refactor deep_ep import logic to handle PaddleFleet and PFCCLab imports with error logging.

* Add traceback import to ep.py
2026-01-24 22:41:42 -08:00
fxyfxy777 79f42209bf add scale_wrapper for per_block_cast_to_fp8 (#6183) 2026-01-23 00:37:20 -08:00
GoldPancake 646aced1eb [UT] Add GLM E2E tests for non-MTP and MTP (#6163)
* add glm ut
2026-01-23 10:34:29 +08:00
Haonan Luo 82057cb71f Support MXFP4 for GPT-OSS (#5435)
* support mxfp4 in gpt-oss

* support mxfp4 in gpt-oss

* add scope for flashinfer

* remove torch code

* update envs.FD_MXFP4_BACKEND

* update process_weights_after_loading

* update env name

* support tp in gpt-oss, add e2e test

* add flashinfer-python-paddle in requirements

* fix import error

* add test

* add test

* add test

* add test
2026-01-22 14:21:01 +08:00
fxyfxy777 9c4db0ac3f [BugFix] fix weight quant op (#6137)
* fix weight quant

* fix weight quant

* bit equal

* code style
2026-01-22 09:50:57 +08:00
zccjjj 14a64e9b3b [XPU] change XPU EP interface from xDeepEP to paddle (#5706)
* add ENV VAR to controll low lantency buffer
2026-01-21 18:23:45 +08:00
K11OntheBoat 490a6551dc rename params of normalization layer (#6133)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-01-21 17:18:35 +08:00
yzwu 837ddca273 [Iluvartar][CI] Fix the error max_tokens_per_expert referenced before assignment (#6083) 2026-01-21 16:01:29 +08:00
lizexu123 f4902fe42d [BugFix] fix wint2 (#6109)
* fix

* fix

* fix
2026-01-20 21:46:21 +08:00
yinwei 51a8a2ed57 [XPU] Support CudaGraph(add block attn cuda_graph support) (#6116)
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
Ryan dda27e50f5 [Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph (#6081)
* rm static_op_get_block_shape_and_split_kv_block from cudagraph

* update max_capture_shape

* fallback: zeros -> empty to avoid coverage check

* check graph_opt_config exists

* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test

* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
zhupengyang 45ebb2efb4 [XPU] support plugin model (#6092) 2026-01-20 13:00:09 +08:00
jackyYang6 988e0bc338 [Feature] Add PaddleFormers fallback backend (#5999)
* feat(paddleformers): add dense text model fallback backend

* docs(paddleformers): add user guide and fix code review issues

* add fallback unit test

* precommit format

* fix pre-commit

* fix: address code review feedback

* docs: add PaddleFormers backend documentation (EN) and simplify installation

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-19 21:50:50 +08:00
sunxin 9dc1c74d36 fix opt qknorm (#6080) 2026-01-19 12:07:20 +08:00
fxyfxy777 4c92035f2d [Feature] Unify fp8 block_wise quant ops (#5991)
* quant stash

* blockwise_quant

* precommit

* rm tensor.cut

* tp ok

* add swiglu

* rm outdate code

* fix activate ut

* change baseline

* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc 49617d9832 [Feature]Support tag phase token enforce generation (#6034)
* support tag phase token enforce generation

* optimize note and some feature

* fix sampler unit test

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-01-15 03:59:55 -08:00
lizexu123 6619298b50 【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models (#6007)
* update w4afp8

* build.sh ok

* support cuda_graph

* fix

* add test

* fix max_tokens_per_expert

* >=70

* fix

* compute_max_tokens_from_prefix_sum in w4afp8

* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Cheng Yanfei fbcccaa750 [Intel HPU] enable MoE EP for hpu (#5855)
* enable HPU MoE EP

* MoE intermediate_scale stack

* enable loader_v1 esp for tensor_wise_fp8 TP or EP

* modify activation_scale name
2026-01-15 13:08:00 +08:00
zhupengyang 24ffa7c991 [XPU] fix moe num_expert (#6014) 2026-01-15 10:49:36 +08:00
RAM b3f59fd9b5 [RL][CI] Support Async R3 And Add Accuracy Test (#5937)
* add bs1 r3 test case

* async put

* r3 test case 1.0

* success run eb5

* refine test case

* pre-commit

* add eb45 & glm testcase

* format code

* add p2pstore requirements

* support only last turn

* R3 use worker log

* refine code &fix ci bug

* refine error mesg

* fix empty input bug

* Success set acc ci of eb45 and glm45

* refine code

* fix bug
2026-01-14 04:25:06 -08:00
Ryan 0d1a5e70bc [Graph Optimization] Add full_cuda_graph to control subgraph split (#6027) 2026-01-14 11:43:59 +08:00
sunxin 2533836dbb [Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel (#5880)
* qk rmsnorm fused

* inplace

* glm

* fix

* add qknorm layer

* fix

* update

* fix qwen3 vl

* update rl baseline

* fix qwen3 vl moe

* test

* fix qwen vl moe rl

* fix
2026-01-12 05:10:21 -08:00
lzy 223b2f5d86 Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917) 2026-01-12 14:09:39 +08:00
周周周 b8d9daa785 MLA clean code (#5979) 2026-01-10 21:05:00 +08:00
xiaoxiaohehe001 00a01ae024 [Feature] Support redundant expert for eplb (#5918)
* [BugFix] support redundant expert for eplb

* support redundant expert for eplb

* support redundant expert for eplb

* update

* fix ci eplb
2026-01-09 17:13:24 +08:00
zccjjj 20de04e249 [XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu (#5878) 2026-01-09 16:34:57 +08:00
Yuanle Liu d4a386dfc4 Revert "Revert "[TSP] last_norm allgather move to model.py (#5924)" (#5961)" (#5972)
This reverts commit 8c3513a410.
2026-01-09 15:58:22 +08:00
Yuanle Liu 8c3513a410 Revert "[TSP] last_norm allgather move to model.py (#5924)" (#5961)
This reverts commit 2bb838fed9.
2026-01-09 15:20:40 +08:00
xiaoluomi 2bb838fed9 [TSP] last_norm allgather move to model.py (#5924)
* support_lastnorm_gather_split_dev

* support_lastnorm_gather_split_dev1

* support_lastnorm_gather_split_dev3

* support_lastnorm_gather_split_dev4

* support_lastnorm_gather_split_dev5
2026-01-07 23:36:33 -08:00
GoldPancake a1fc4e249e [Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927)
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
lizhenyun01 2be8656c29 [BugFix] fix mtp split kv attetion (#5920)
* [BugFix] fix mtp split kv attetion

* clean code

* clean code
2026-01-07 04:07:31 -08:00
Ryan 3e74bacc5e add m_grouped_gemm_fp8_fp8_bf16_nt_contiguous_custom_python_op (#5847) 2026-01-07 16:17:55 +08:00
fmiao2372 1ee285c2d6 [Intel HPU] enable chunked prefill (#5903)
* [Intel HPU] enable chunked prefill

* fix bug by copilot comments
2026-01-06 21:01:50 +08:00
lizexu123 acdf0cd1d9 fix hadamard_block_size (#5888) 2026-01-06 14:12:14 +08:00
Neil Zhu 272a371635 [Metax] optimize flash attention backend (#5876) 2026-01-06 09:52:09 +08:00
lizexu123 1d3ae7c024 [BugFix] fix w4afp8 tp=8 (#5868)
* fix w4afp8 tp=8

* fix
2026-01-05 18:59:02 +08:00
ming1753 f50e1bcc16 [Others] enable use PFCC deep_ep (#5822)
* upstream deep_ep

* fix bug

* fix bug

* modify env name
2026-01-05 02:07:01 -08:00
周周周 dc13344ab8 [Optimization] add del to decrease peak memory in MoE prefill (#5863) 2026-01-05 14:01:48 +08:00
chen 193886e745 only cuda run triton op (#5846) 2025-12-31 14:17:31 +08:00
GoldPancake 4e10ae5d99 [Speculative Decoding] Optimize draft logprob (#5842)
* optimize draft logprob

* fix ut
2025-12-31 13:35:56 +08:00
chen 0bcf924e10 [Optimization] Optimization for gather_logprob by 10GB (#5817)
* opt logprobs gather_logprob,reduce device memory usage by 10GB when token_num=8k
2025-12-30 15:33:34 +08:00
lizexu123 44a13e4557 [Feature] support w4afp8 v1_loader and v0_loader(tp>1) (#5757)
* support

* fix

* support w4afp8 v1_loader and v0_loader

* fix

* fix test

* fix test

* fix test

* fix moe.py

* add test_ernie_4_5_w4afp8

* add test

* delete tensor

* fix test

* fix

* add

* fix test
2025-12-30 14:11:52 +08:00
CSWYF3634076 9286403570 [Models] Add Qwen3-VL Model Support (#5763)
* support v1 loader

* remove useless code

* remove useless

* [Model] support Qwen3VL images success

* [Model] support Qwen3VL rope_3d

* [Model] support Qwen3VL remove log

* [Model] support Qwen3VL RL

* [Model] support Qwen3VL tp

* [Model] support Qwen3VL video

* [Model] support Qwen3VL fix ernievl

* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds

* [Model] support Qwen3VL fix multi card

* [Model] support Qwen3VL file close

* [Model] support Qwen3VL fix ce

* [Model] support Qwen3VL fix unittest

* [Model] support Qwen3VL add unittest

---------

Co-authored-by: Ayakouji <yuhongh@qq.com>
2025-12-29 17:39:33 +08:00
Ryan eb782a0225 [BugFix] Fix return value inconsistency for ep_moe_expert_combine op (#5812) 2025-12-29 16:44:00 +08:00
周周周 03363cab4c make flash_mask attention pybind (#5783) 2025-12-26 14:31:35 +08:00
Nyakku Shigure 11227e00bb [GraphOptimization] Wrap deep gemm and triton as python op (#5673)
* [GraphOptimization] Wrap deep gemm and triton as python op

* add unitest to _base_test && compatibility

* paddle.static.MetaTensor -> "paddle.static.MetaTensor"

* mv register_custom_python_op

* rename yaml

---------

Co-authored-by: DrRyanHuang <zihaohuang@aliyun.com>
2025-12-24 15:23:46 +08:00
GoldPancake 23d488c488 [Feature] Entropy calculation support (#5692)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
* support entropy

* fix bug

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2025-12-23 21:19:47 +08:00
bukejiyu d1c6e57341 [Others] upgrade paddleformer to 0.4.0 (#5599) 2025-12-23 05:08:01 -08:00