yzwu
3b9d6c60d3
[Iiluvatar] fix ci error and update readme ( #7453 )
2026-04-17 20:42:56 +08:00
AIbin
6ce4854714
[Feature] Support MOE Cutlass backend for latent MOE ( #7428 )
...
* support moe cutlass backend latent moe
2026-04-16 22:11:49 +08:00
ShaneGZhu
2d8338f9e4
[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. ( #7367 )
2026-04-16 19:54:12 +08:00
Jiajun Ji
29495b2cf1
[XPU] Unify Spec and non-spec branch.( #6947 ) ( #7180 )
...
* [XPU] cherry-pick PR-6947
* [XPU] use unified_update_model_status.
* refactor xpu_model_runner.
* refactor sampler.
* fix codestyle.
* Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct
WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path.
* fix codestyle.
* replace output_padding_offset with is_speculative flag in gather_next_token.
* rename hiddden_states.
* unify cu_seqlens_q_output and batch_id_per_token_output init.
---------
Co-authored-by: cmcamdy <1027740945@qq.com >
2026-04-16 14:58:38 +08:00
RuohengMa
de0c5e68fb
[XPU] Split the block_attn operator into smaller operators ( #6798 )
...
* spliced block_attn
* adapt to latest vllm
* fix unit tests
* delete mtp+cudagraph 4 cards test
* fix vl model
* fix mtp
* fix slot mapping
2026-04-16 14:28:40 +08:00
Bingoo
6b891da02b
[Optimization] enable trtllm_all_reduce fusion kernel in glm model ( #6660 )
...
* enable trtllm_all_reduce fusion kernel in glm model
* fix conflict
* format update
* fix a bug
* modify test
* modify test
* support empty tensor and modify test
* fix test_linear config issues
* modify test name
* add edge test case
* modify format
* fix conflict
* modify default max token num in trtllm_allreduce_fusion
* add max token num branch for trtllm_allreduce_fusion
* fix format
* fix rmsnorm config issue
* modify 2025 to 2026
* using compat grard
* Lazily import flashinfer.comm and fix test config issue
* fix test issues
* add flashinfer cache dir clean machine
* fix some issues
2026-04-16 14:10:19 +08:00
周周周
5e54770b2e
[Feature] 添加 MoE 层 latent mode 支持 ( #7382 )
2026-04-15 13:57:07 +08:00
lonelygsh
f7a2418ce2
[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler ( #7402 )
2026-04-15 12:45:23 +08:00
chen
616b29ce08
check init_flash_attn_version log ( #7399 )
2026-04-15 11:05:10 +08:00
xiaoxiaohehe001
abba29b348
[BugFix] fix mm rope ( #7274 )
2026-04-14 11:36:08 +08:00
zhupengyang
27b00cf385
[XPU] glm-4.5-air ( #7071 )
2026-04-14 11:31:49 +08:00
周周周
73bd4ab318
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 ( #7361 )
...
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
liuruyan
b34708604c
[TI-consistent] support quant use pow2scale ( #7308 )
...
* support quant use pow2scale
* fix
* fix
2026-04-13 00:01:53 -07:00
AIbin
6213ad5340
[Docs][BugFix] fix mla log ( #7243 )
...
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure
d659099415
[Cleanup] Replace torch proxy alias with public compat API ( #7348 )
2026-04-13 11:43:26 +08:00
周周周
225fc8d222
use self.hidden_size not use self.fd_config.model_config.hidden_size ( #7340 )
2026-04-11 22:39:43 +08:00
chen
4982aa000e
[RL]moe bf16 ep support paddle batch_gemm ( #7337 )
...
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
JYChen
076ab07528
[RL] change glm rope_emb calculation ( #7316 )
...
* change glm rope_emb calculation
* glm without EnforceFmulRN
* fix ci
2026-04-11 18:36:28 +08:00
K11OntheBoat
870dbac370
Use triton qk_norm both in Prefill and Decode ( #7213 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-04-10 15:44:01 +08:00
GoldPancake
c1fb3112f8
[FDConfig] Support CLI args for quantization params and add cudagraph validation ( #7281 )
...
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123
613f92ee8f
[Feature] support nvfp4 tbo ( #7259 )
2026-04-09 17:29:39 +08:00
fxyfxy777
39ff38aba1
[OP]Unify MoE op with moe_permute path for bf16 GLM ( #7164 )
2026-04-09 16:17:56 +08:00
xiaoxiaohehe001
51efe27d76
[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn ( #7210 )
...
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen
43ace7af25
[RL] support moe-topk use topk_reduce_func ( #7218 )
...
* support moe-topk use topk_reduce_func
* fix ep error
* fix ut
* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu
7005404ce3
[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend ( #7253 )
...
* Delete contiguous ops.
* fix scale
* Delete unnecessary comments
* fix style
2026-04-09 11:00:13 +08:00
K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
lizhenyun01
446b26bbc0
[Feature] support blackwell gemm in ht ( #7053 )
...
* [Feature] support blackwell gemm in ht
* [Feature] support ops for convert
* fix cuda error 716
* fix cuda error
* opt memory
* remove unused code
2026-04-07 19:52:51 +08:00
周周周
18f012457d
[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel ( #7201 )
2026-04-07 11:21:57 +08:00
Bingoo
2068656a85
[Optimization] merge matmul and add ( #6986 )
...
* merge matmul and add
* modify format
* using paddle.nn.functional.linear
* using _C_ops.linear
* using paddle.nn.functional.linear
* add FLAGS_use_legacy_linear env var in test case
* fix format
* add assert and remove env
* modify format
* using matmul for no bias
* modify accurate baseline
2026-04-03 18:02:03 +08:00
AIbin
1090f8b123
[Models]support GLM4.7 Flash && Ernie_MLA ( #7139 )
...
* support GLM4.7 Flash && Ernie_MLA
2026-04-03 17:41:33 +08:00
lizexu123
5f612a348d
[BugFix] fix flashinfer-cutedsl moe nvfp4 ( #7120 )
...
* fix nvfp4
* fix
* add document
* fix nvfp4
* support eb5
* support bka
* support eb5
* support xpu
* fix
* fix
* add import cutedsl
* fix
* fix
* fix test
* fix H卡
* update document
* fix
* update document
* update document
* fix
2026-04-03 15:43:19 +08:00
Bingoo
410988d9ec
[OP] support deepgeem for sm103 ( #7073 )
...
* support deepgeem for sm103
* add assert
* modify code style
* add assert
* modify sm version condition
* remove assert
2026-04-01 21:01:09 +08:00
cmcamdy
7a2e33098f
[XPU] Refactor pre process ( #6993 )
...
* [XPU] support speculate_pre_process
* merge develop
* fix codestype
* fix mtp, support cu_seqlens_q_output
* fix mtp, support cu_seqlens_q_output
* fix test
---------
Co-authored-by: lizan1999 <lizan03@baidu.com >
2026-04-01 20:29:55 +08:00
sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
yzwu
8789329457
[Iluvatar] Support wi4a16 group_gemm ( #7078 )
2026-03-30 19:03:51 +08:00
zhangbo9674
5c60e2fc6f
fix bug in cudagraph ( #7069 )
2026-03-30 14:24:23 +08:00
mpgemm
1a1d048774
[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 ( #6963 )
2026-03-30 11:37:04 +08:00
Longzhi Wang
2eea6fa97a
[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend ( #7028 )
...
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend
* add constexpr and code style clean
* add test
* fix code style
* fix test
2026-03-30 11:17:15 +08:00
mpgemm
7a20eaebe8
[Feature] Support cute cpp Encoder FA4 ( #7016 )
...
* add cute cpp fa4
* 删掉注释
* 修正合并错误
* sm_version放到函数内
* ci错误
2026-03-30 10:54:56 +08:00
huicongyao
25d64efdc4
[Speculative Decoding] Refactor Eagle MTP hidden states copy ( #6812 )
...
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states
* readibility
* fix xpu bug
* fix coverage failure
* change luanch params & parallelize position_map compute
* Fix MTP-related bugs in FastDeploy centralized inference
* fix
* refactor mtp hidden_states process
* fix
* add unittest & optimize kernel
* remove useless code
* fix
2026-03-25 22:54:31 -07:00
SUN Dong
6cff780fdb
[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment ( #6850 )
...
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization
* update
* update
* update
* support custom topk inDeepGemmFusedMoeMethod apply_tp
* apply_ep_prefill support moe_topk_select
* update
* add ut
* add ut
* add ut
* modity doc
* fix env and docs
* add ut
---------
Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com >
2026-03-24 11:12:39 +08:00
Nyakku Shigure
8b6bbb3504
[Optimization] Use a separate driver when using Triton with Paddle ( #6897 )
...
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-24 10:56:00 +08:00
freeliuzc
e87ce4b8cd
[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess ( #6973 )
...
* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
周周周
5416da8c6e
remove assert ( #6970 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-23 14:22:03 +08:00
周周周
1c38da2118
Make seq_lens_this_time/decoder/encoder equal shape ( #6942 )
2026-03-20 15:31:52 +08:00
sunxin
d77edf8fc9
opt wfp8afp8 triton moe ( #6938 )
2026-03-20 11:07:25 +08:00
周周周
b1c800b64b
remove load_up_proj_weight_first ( #6932 )
2026-03-19 17:21:34 +08:00
sunxin
33e01f22a8
[Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 ( #6894 )
...
* update sampling
* fix
* fix
* fix mtp
* fix test
2026-03-19 01:43:10 -07:00
JYChen
f95d8ca7df
[RL] support qkrmsnorm use proxy-norm ( #6862 )
...
* support qkrmsnorm use paddle.nn.functional.rms_norm
* remove flags in fd
2026-03-18 23:27:26 -07:00
周周周
1a05744c4e
nvfp4.py support ep ( #6920 )
2026-03-19 14:07:46 +08:00