yinwei
51a8a2ed57
[XPU] Support CudaGraph(add block attn cuda_graph support) ( #6116 )
...
* add block attn cuda_graph support
2026-01-20 19:33:11 +08:00
Ryan
dda27e50f5
[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph ( #6081 )
...
* rm static_op_get_block_shape_and_split_kv_block from cudagraph
* update max_capture_shape
* fallback: zeros -> empty to avoid coverage check
* check graph_opt_config exists
* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test
* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
zhupengyang
45ebb2efb4
[XPU] support plugin model ( #6092 )
2026-01-20 13:00:09 +08:00
jackyYang6
988e0bc338
[Feature] Add PaddleFormers fallback backend ( #5999 )
...
* feat(paddleformers): add dense text model fallback backend
* docs(paddleformers): add user guide and fix code review issues
* add fallback unit test
* precommit format
* fix pre-commit
* fix: address code review feedback
* docs: add PaddleFormers backend documentation (EN) and simplify installation
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-19 21:50:50 +08:00
sunxin
9dc1c74d36
fix opt qknorm ( #6080 )
2026-01-19 12:07:20 +08:00
fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
lizexu123
6619298b50
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models ( #6007 )
...
* update w4afp8
* build.sh ok
* support cuda_graph
* fix
* add test
* fix max_tokens_per_expert
* >=70
* fix
* compute_max_tokens_from_prefix_sum in w4afp8
* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Cheng Yanfei
fbcccaa750
[Intel HPU] enable MoE EP for hpu ( #5855 )
...
* enable HPU MoE EP
* MoE intermediate_scale stack
* enable loader_v1 esp for tensor_wise_fp8 TP or EP
* modify activation_scale name
2026-01-15 13:08:00 +08:00
zhupengyang
24ffa7c991
[XPU] fix moe num_expert ( #6014 )
2026-01-15 10:49:36 +08:00
RAM
b3f59fd9b5
[RL][CI] Support Async R3 And Add Accuracy Test ( #5937 )
...
* add bs1 r3 test case
* async put
* r3 test case 1.0
* success run eb5
* refine test case
* pre-commit
* add eb45 & glm testcase
* format code
* add p2pstore requirements
* support only last turn
* R3 use worker log
* refine code &fix ci bug
* refine error mesg
* fix empty input bug
* Success set acc ci of eb45 and glm45
* refine code
* fix bug
2026-01-14 04:25:06 -08:00
Ryan
0d1a5e70bc
[Graph Optimization] Add full_cuda_graph to control subgraph split ( #6027 )
2026-01-14 11:43:59 +08:00
sunxin
2533836dbb
[Optimization] Accelerate Qwen3 QK RMSNorm via Fused Triton Kernel ( #5880 )
...
* qk rmsnorm fused
* inplace
* glm
* fix
* add qknorm layer
* fix
* update
* fix qwen3 vl
* update rl baseline
* fix qwen3 vl moe
* test
* fix qwen vl moe rl
* fix
2026-01-12 05:10:21 -08:00
lzy
223b2f5d86
Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. ( #5917 )
2026-01-12 14:09:39 +08:00
周周周
b8d9daa785
MLA clean code ( #5979 )
2026-01-10 21:05:00 +08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
zccjjj
20de04e249
[XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu ( #5878 )
2026-01-09 16:34:57 +08:00
Yuanle Liu
d4a386dfc4
Revert "Revert "[TSP] last_norm allgather move to model.py ( #5924 )" ( #5961 )" ( #5972 )
...
This reverts commit 8c3513a410 .
2026-01-09 15:58:22 +08:00
Yuanle Liu
8c3513a410
Revert "[TSP] last_norm allgather move to model.py ( #5924 )" ( #5961 )
...
This reverts commit 2bb838fed9 .
2026-01-09 15:20:40 +08:00
xiaoluomi
2bb838fed9
[TSP] last_norm allgather move to model.py ( #5924 )
...
* support_lastnorm_gather_split_dev
* support_lastnorm_gather_split_dev1
* support_lastnorm_gather_split_dev3
* support_lastnorm_gather_split_dev4
* support_lastnorm_gather_split_dev5
2026-01-07 23:36:33 -08:00
GoldPancake
a1fc4e249e
[Bugfix] Fix mtp logprob hang problem when include stop_seq ( #5927 )
...
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
lizhenyun01
2be8656c29
[BugFix] fix mtp split kv attetion ( #5920 )
...
* [BugFix] fix mtp split kv attetion
* clean code
* clean code
2026-01-07 04:07:31 -08:00
Ryan
3e74bacc5e
add m_grouped_gemm_fp8_fp8_bf16_nt_contiguous_custom_python_op ( #5847 )
2026-01-07 16:17:55 +08:00
fmiao2372
1ee285c2d6
[Intel HPU] enable chunked prefill ( #5903 )
...
* [Intel HPU] enable chunked prefill
* fix bug by copilot comments
2026-01-06 21:01:50 +08:00
lizexu123
acdf0cd1d9
fix hadamard_block_size ( #5888 )
2026-01-06 14:12:14 +08:00
Neil Zhu
272a371635
[Metax] optimize flash attention backend ( #5876 )
2026-01-06 09:52:09 +08:00
lizexu123
1d3ae7c024
[BugFix] fix w4afp8 tp=8 ( #5868 )
...
* fix w4afp8 tp=8
* fix
2026-01-05 18:59:02 +08:00
ming1753
f50e1bcc16
[Others] enable use PFCC deep_ep ( #5822 )
...
* upstream deep_ep
* fix bug
* fix bug
* modify env name
2026-01-05 02:07:01 -08:00
周周周
dc13344ab8
[Optimization] add del to decrease peak memory in MoE prefill ( #5863 )
2026-01-05 14:01:48 +08:00
chen
193886e745
only cuda run triton op ( #5846 )
2025-12-31 14:17:31 +08:00
GoldPancake
4e10ae5d99
[Speculative Decoding] Optimize draft logprob ( #5842 )
...
* optimize draft logprob
* fix ut
2025-12-31 13:35:56 +08:00
chen
0bcf924e10
[Optimization] Optimization for gather_logprob by 10GB ( #5817 )
...
* opt logprobs gather_logprob,reduce device memory usage by 10GB when token_num=8k
2025-12-30 15:33:34 +08:00
lizexu123
44a13e4557
[Feature] support w4afp8 v1_loader and v0_loader(tp>1) ( #5757 )
...
* support
* fix
* support w4afp8 v1_loader and v0_loader
* fix
* fix test
* fix test
* fix test
* fix moe.py
* add test_ernie_4_5_w4afp8
* add test
* delete tensor
* fix test
* fix
* add
* fix test
2025-12-30 14:11:52 +08:00
CSWYF3634076
9286403570
[Models] Add Qwen3-VL Model Support ( #5763 )
...
* support v1 loader
* remove useless code
* remove useless
* [Model] support Qwen3VL images success
* [Model] support Qwen3VL rope_3d
* [Model] support Qwen3VL remove log
* [Model] support Qwen3VL RL
* [Model] support Qwen3VL tp
* [Model] support Qwen3VL video
* [Model] support Qwen3VL fix ernievl
* [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds
* [Model] support Qwen3VL fix multi card
* [Model] support Qwen3VL file close
* [Model] support Qwen3VL fix ce
* [Model] support Qwen3VL fix unittest
* [Model] support Qwen3VL add unittest
---------
Co-authored-by: Ayakouji <yuhongh@qq.com >
2025-12-29 17:39:33 +08:00
Ryan
eb782a0225
[BugFix] Fix return value inconsistency for ep_moe_expert_combine op ( #5812 )
2025-12-29 16:44:00 +08:00
周周周
03363cab4c
make flash_mask attention pybind ( #5783 )
2025-12-26 14:31:35 +08:00
Nyakku Shigure
11227e00bb
[GraphOptimization] Wrap deep gemm and triton as python op ( #5673 )
...
* [GraphOptimization] Wrap deep gemm and triton as python op
* add unitest to _base_test && compatibility
* paddle.static.MetaTensor -> "paddle.static.MetaTensor"
* mv register_custom_python_op
* rename yaml
---------
Co-authored-by: DrRyanHuang <zihaohuang@aliyun.com >
2025-12-24 15:23:46 +08:00
GoldPancake
23d488c488
[Feature] Entropy calculation support ( #5692 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
* support entropy
* fix bug
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-23 21:19:47 +08:00
bukejiyu
d1c6e57341
[Others] upgrade paddleformer to 0.4.0 ( #5599 )
2025-12-23 05:08:01 -08:00
RuohengMa
2c3c983b96
[XPU] modify speculate_verify ( #5522 )
2025-12-23 14:50:30 +08:00
Sunny-bot1
04035e4ebf
support w4afp8 two stage ( #5608 )
2025-12-22 15:13:05 +08:00
Sunny-bot1
40f3897a4e
support w4afp8 moe offline permute & load ( #5613 )
2025-12-22 15:12:57 +08:00
yzwu
ac013803f3
[Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode ( #5555 )
2025-12-18 02:14:25 -08:00
Longzhi Wang
d8587e987e
[Model] tp+ep support v1_loader ( #5465 )
...
* [Model] tp+ep support v1_loader
* fix
* fix mtp_linear
* fix mtp_linear
* fix
* fix
* fix v0 loader
* fix
* Add get_tensor for ep
* fix linear weight_loader
* fix typo
* fix
2025-12-18 14:31:54 +08:00
zhupengyang
8735cb5045
[XPU] refactor moe ffn ( #5501 )
...
- remove BKCL_DISPATCH_ALL_GATHER
- support sparse mode
- support moe quant_method
2025-12-18 14:14:05 +08:00
fmiao2372
404cf0ece4
[Intel HPU] enable tensor_wise_fp8 ( #5324 )
...
* [Intel HPU] enable tensor_wise_fp8
* update code based on comments
* fix code style issue
* fix bug about RP 5138
* mv kv_cache modifications to HPU backend
* fix FP8 Precision Issues
* fix FP8 Precision Issues
* Add quantization UT
---------
Co-authored-by: yanfeich <yanfei.cheng@intel.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-17 16:45:03 +08:00
freeliuzc
15f5112ecb
[Speculative Decoding]Support different inferseed in speculate decoding ( #5568 )
...
* fix mtp entropy drop in RL
* optimize usage and fix unit test
* optimize padding_sampling_params speed(vectorized)
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-17 16:14:29 +08:00
RAM
6fc5eccf83
[RL] R3 Support RDMA Store ( #5467 )
...
* [RL] R3 support rdma store
* refine notes
* refine code
* disable prefix cache
* support preempted task and put cpu tensor
2025-12-16 16:50:13 +08:00
Yuanle Liu
b8e4828373
[BugFix] fix dynamic c8 in v1 loader ( #5562 )
2025-12-15 04:07:54 -08:00
zhang-chenyi
77f8ba06e7
[Metax] fix release2.4 and support cudagraph ( #5547 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Co-authored-by: xiaozude <xiaozude@outlook.com >
2025-12-15 14:23:33 +08:00