AIbin
c3aceb6bdc
[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM ( #6689 )
...
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
sunxin
28f7727a3d
[Feature] Set overlap schedule as default ( #6668 )
...
* overlap default
2026-03-09 22:34:54 +08:00
周周周
3897a0b4fc
nvfp4 clean code ( #6671 )
2026-03-09 18:00:34 +08:00
gongweibao
30f9f33f34
[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM ( #6610 )
...
* add fa deter
* add ut
* add long sentence
* fix basic
* fix bugs
* fix adn
* fix first
* fix single
* fix single
* fix single test
* refine
* add more test
* refine comments
* add comments of bmm
* fix ci
* remove probe
* add
* remove not need
* refine tests
* fix comments and refine code
* refine code
* refine test
* refine test
* mv 4cards tests
* fix tests
* add
* fix comments
* fix cover
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-09 10:27:53 +08:00
周周周
cebe6f7dae
clean nvfp4 related code ( #6644 )
2026-03-05 15:48:33 +08:00
ming1753
81e04bf5d1
[BugFix] fix flash attn mtp rope emb bug ( #6649 )
2026-03-04 21:19:12 +08:00
bukejiyu
598cce8545
[RL] Support SM100 FP8 quantization in RL ( #6601 )
...
* RL SM100 Fix
* update
2026-03-04 04:55:04 -08:00
zhupengyang
1256fd3806
[XPU] weight only quant method support QKVGate_proj ( #6641 )
2026-03-04 18:25:03 +08:00
yzwu
3345641f4e
[Iluvatar][CI] fix the dim error of seq_lens_encoder and seq_lens_decoder ( #6637 )
2026-03-04 14:00:40 +08:00
ming1753
02d32eea3b
Revert "[Bug Fix] Fix MM mtp incorrect rope emb ( #6581 )" ( #6631 )
...
This reverts commit c5eb6b65e7 .
2026-03-04 11:23:28 +08:00
ming1753
c5eb6b65e7
[Bug Fix] Fix MM mtp incorrect rope emb ( #6581 )
...
* [Bug Fix] Fix MM mtp incorrect rope emb
2026-03-03 19:28:59 +08:00
RichardWooSJTU
61789febb9
[Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader ( #6433 )
...
* support to load static quant ue8m0 scale of deepgemm via v0_loader
* [Fix] Fix ue8m0 scale pack dimension calculation and block size validation
1. Fix pack dimension calculation in fused_moe_triton_backend.py:
- Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division
- This ensures sufficient pack allocation when num_scales is not a multiple of 4
2. Fix block size hardcoding in block_wise_fp8.py:
- Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]`
- Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-03 11:32:35 +08:00
chen
1cae7a0d53
weight only quant method support QKVGate_proj ( #6612 )
2026-03-03 11:19:32 +08:00
周周周
3cc09418f1
support dsv3 use flashmla ( #6593 )
2026-03-03 11:09:43 +08:00
ming1753
33d6d2403c
[BugFix] fix bug when seq_lens_this_time is 2D ( #6613 )
2026-03-02 23:52:03 +08:00
MingkunZhang
3cf7c6c281
[Metax][Fix] fix ci error based pr#6535 ( #6600 )
2026-03-02 18:50:16 +08:00
ming1753
344db8c8af
[BugFix] Fix mtp when token_ids_all is None ( #6591 )
...
* [BugFix] Fix mtp when token_ids_all is None
* fix bug
2026-03-02 01:23:44 -08:00
RichardWooSJTU
7bd86f99a5
[BugFix] Fix tbo nan ( #6439 )
2026-03-02 14:28:48 +08:00
yzwu
6674131b0b
[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding ( #6553 )
2026-03-02 14:07:17 +08:00
周周周
d957ccd46d
seq_lens related tensor shape -> [max_num_seqs] ( #6535 )
2026-03-02 11:18:30 +08:00
chen
5382fb2c60
[BugFix] lazy enable_torch_proxy for cutlass ( #6523 )
...
* lazy enable_torch_proxy for cutlass
* test init_flash_attn_version
2026-03-02 10:43:58 +08:00
RichardWooSJTU
7cfb0ffba0
fix pfcc deep ep in low latency mode ( #6440 )
2026-03-02 10:35:51 +08:00
AIbin
59b578c337
[Feature]Supports SWA based on appendattn ( #6547 )
2026-03-01 19:02:08 +08:00
zccjjj
a2072fe20c
[XPU] support warmup with ep & remove apply_tp_fused_op ( #6289 )
2026-02-28 15:40:36 +08:00
ming1753
97eee75677
[Feature] GPU Memory Optimization and Retirement of V0 Scheduler ( #6407 )
...
* Optim GPU Mem Usage
---------
Co-authored-by: huzesen <huzesen@baidu.com >
2026-02-28 15:07:43 +08:00
YuBaoku
54f7d9f621
[CI] Sync mm_batch_invariant with paddle.mm update ( #6557 )
2026-02-28 14:56:42 +08:00
Weiguo Zhu
8fb24122b8
fix reshard error ( #6536 )
2026-02-27 22:22:37 +08:00
JYChen
c6d8fbe526
[BugFix] fix log with paddlefleet.ops ( #6528 )
2026-02-27 14:34:29 +08:00
sunxin
53aaac69da
[Optimization] Enable BF16 gate computation for GLM and Qwen ( #6457 )
...
* gate bf16
* add gate-fp32
* fix
* update baseline
* update
* update
* fix
2026-02-26 21:08:46 -08:00
gongweibao
edd31e8849
[Feature] Add Deterministic Inference Support ( #6476 )
...
* add
* [tests] Add Paddle attention determinism tests and refactor resource manager
Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* add
* add
* add
* add
* add more
* add more
* fixsome
* fixsome
* fix bugs
* fix bugs
* only in gpu
* add docs
* fix comments
* fix some
* fix some
* fix comments
* add more
* fix potential problem
* remove not need
* remove not need
* remove no need
* fix bug
* fix bugs
* fix comments
* fix comments
* Update tests/ce/deterministic/test_determinism_verification.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/inter_communicator/test_ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/engine/test_sampling_params_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism_standalone.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix comments
* fix import error
* fix a bug
* fix bugs
* fix bugs
* fix coverage
* refine codes
* refine code
* fix comments
* fix comments
* fix comments
* rm not need
* fix allreduce large tensor bug
* mv log files
* mv log files
* add files
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-26 19:31:51 -08:00
zccjjj
c34cb2a8c2
[XPU] [bugfix] fix moe_ffn_quant_type_map bugs about datatype and tensorshape ( #6337 )
2026-02-27 09:55:41 +08:00
chen
2d1531f3cb
dev opensource model support fa4/flashmasV2/V3 ( #6518 )
2026-02-26 17:46:05 +08:00
zhupengyang
a303eacf62
[XPU] support norm before rope ( #6475 )
2026-02-25 18:43:44 +08:00
Longzhi Wang
22566168c3
[Feature] support qkv&gate linear fusion ( #6455 )
...
* [Feature] support qkv&gate linear fusion
* add test
2026-02-24 15:20:29 +08:00
AIbin
0eb87467f8
[BugFix]fix RL bug about blockwisefp8 ( #6466 )
...
* fix RL bug about blockwisefp8
* fix moe same bug
* fix RL FP8 bug
2026-02-12 09:15:29 +08:00
JYChen
40c952e7b5
fix deepgemm import ( #6451 )
...
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-02-11 20:10:01 +08:00
zhupengyang
4a8c54926b
[XPU] topk_method=noaux_tc ( #6355 )
2026-02-11 16:12:20 +08:00
yzwu
60e75ea8e8
[Iluvatar][CI] Fix cannot import get_stop ( #6165 )
2026-02-10 16:57:23 +08:00
chen
d937d6ebfd
check ( #6424 )
2026-02-10 15:55:17 +08:00
chen
a8ffcaa068
fix fa4 test ( #6408 )
2026-02-10 10:57:21 +08:00
bukejiyu
5bfc0938e2
[BugFix] PD reorder fix and add ut ( #6375 )
2026-02-09 04:42:48 -08:00
sunxin
783d56e28a
[Optimization] Support logprob async copy ( #6362 )
...
* support logprob async copy
* fix prompt logprob
* fix xpu
2026-02-09 17:32:12 +08:00
bukejiyu
dc5917289d
[loader]supoort wint2 backend ( #6139 )
...
* support wint2
* update
2026-02-08 22:42:36 -08:00
Mattheliu
c776d483e4
[BugFix]fix handle 4 return values from noaux_tc_redundant op ( #6384 )
...
* fix: handle 4 return values from noaux_tc_redundant op
The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)
The Python code was only unpacking 3 values, causing:
ValueError: too many values to unpack (expected 3)
This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
* fix: make noaux_tc_redundant return 4 values to match OP definition
The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.
This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
---------
Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com >
2026-02-09 13:17:47 +08:00
JYChen
9bcd863902
[Others] support import deepgemm/deepep from fleet ops ( #6351 )
...
* update paddleformers to v1.0
* only change import fleetpath
2026-02-09 11:53:13 +08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
K11OntheBoat
116e2aea7a
Support Norm before Rope ( #6332 )
...
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2026-02-05 15:28:52 +08:00
chen
29a313a402
[Optimization] Support FA2/FA3/FA4 with attn_mask_q ( #6354 )
...
* support FA4 sm100
* flash attn backend support mask
* flash attn backend run flashmask correct
* add test for flash_attn_backend and flash_attn_func
* check
* add test for fa4
* requirements.txt add fa4 whl
* check test on sm100
* fix CI conflict
* add enable_torch_proxy for flash_mask
* lazy import fa4
* check
* fix tests import
* check test_load_mpt import
2026-02-05 14:39:00 +08:00
GoldPancake
183b8d325a
[RL] Support GLM MTP RL Model ( #6267 )
2026-02-04 20:14:35 +08:00
fxyfxy777
36547cfdb3
[Feature] FD_USE_PHI_FP8_QUANT ( #6320 )
...
* add ut
* add use_fd_quant env
* rm mask_per_token_quant
* add make ops list
* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true
* modify comments
* use bool type
* Add function declaration
2026-02-03 22:33:03 -08:00