Echo-Nie
8819a039c9
[Others] Fix typo ( #7280 )
...
* typo
* typo
* typo
* typo
2026-04-14 17:28:22 +08:00
xiaoxiaohehe001
abba29b348
[BugFix] fix mm rope ( #7274 )
2026-04-14 11:36:08 +08:00
zhupengyang
27b00cf385
[XPU] glm-4.5-air ( #7071 )
2026-04-14 11:31:49 +08:00
周周周
73bd4ab318
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 ( #7361 )
...
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
freeliuzc
31e2a8bbad
[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap ( #7323 )
...
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
AIbin
1fb8194191
[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 ( #7359 )
...
* dsk del prefill mask
* dsk support 1M+ seq_len rope
* update rope tests
* Replace max_position_embeddings with max_model_len
* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
liuruyan
b34708604c
[TI-consistent] support quant use pow2scale ( #7308 )
...
* support quant use pow2scale
* fix
* fix
2026-04-13 00:01:53 -07:00
AIbin
6213ad5340
[Docs][BugFix] fix mla log ( #7243 )
...
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure
d659099415
[Cleanup] Replace torch proxy alias with public compat API ( #7348 )
2026-04-13 11:43:26 +08:00
Jiajun Ji
cb03958b52
[XPU] Refactor get_padding_offset to single kernel. ( #7029 )
...
* [XPU] Refactor get_padding_offset to single kernel.
* add unittest.
* fix codestyle.
* remove cum_offsets_now.
* remove max_len.
2026-04-13 11:04:50 +08:00
周周周
225fc8d222
use self.hidden_size not use self.fd_config.model_config.hidden_size ( #7340 )
2026-04-11 22:39:43 +08:00
chen
4982aa000e
[RL]moe bf16 ep support paddle batch_gemm ( #7337 )
...
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin
ba01d7a823
[Optimization] [OP] [Models] dsk del prefill mask ( #7313 )
...
* dsk del prefill mask
* dsk support 1M+ seq_len rope
* update rope tests
2026-04-11 19:32:27 +08:00
JYChen
076ab07528
[RL] change glm rope_emb calculation ( #7316 )
...
* change glm rope_emb calculation
* glm without EnforceFmulRN
* fix ci
2026-04-11 18:36:28 +08:00
zhangbo9674
627f0d9cc8
[RL] change rms norm for glm ( #7269 )
...
* change rms norm for glm
* refine code
* refine code
* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat
870dbac370
Use triton qk_norm both in Prefill and Decode ( #7213 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-04-10 15:44:01 +08:00
bukejiyu
14d46181b8
[Loader] add multi-thread model loading ( #6877 )
...
* multi-thread-loader
* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake
c1fb3112f8
[FDConfig] Support CLI args for quantization params and add cudagraph validation ( #7281 )
...
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123
613f92ee8f
[Feature] support nvfp4 tbo ( #7259 )
2026-04-09 17:29:39 +08:00
fxyfxy777
39ff38aba1
[OP]Unify MoE op with moe_permute path for bf16 GLM ( #7164 )
2026-04-09 16:17:56 +08:00
xiaoxiaohehe001
51efe27d76
[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn ( #7210 )
...
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen
43ace7af25
[RL] support moe-topk use topk_reduce_func ( #7218 )
...
* support moe-topk use topk_reduce_func
* fix ep error
* fix ut
* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu
7005404ce3
[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend ( #7253 )
...
* Delete contiguous ops.
* fix scale
* Delete unnecessary comments
* fix style
2026-04-09 11:00:13 +08:00
AIbin
48d2bbeb74
fix dsa ( #7252 )
2026-04-08 20:21:38 +08:00
K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
lizhenyun01
446b26bbc0
[Feature] support blackwell gemm in ht ( #7053 )
...
* [Feature] support blackwell gemm in ht
* [Feature] support ops for convert
* fix cuda error 716
* fix cuda error
* opt memory
* remove unused code
2026-04-07 19:52:51 +08:00
sunxin
ae2f9f4d22
[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL ( #7130 )
...
* rl gate fp32
* clean
2026-04-06 21:07:38 -07:00
周周周
18f012457d
[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel ( #7201 )
2026-04-07 11:21:57 +08:00
Bingoo
2068656a85
[Optimization] merge matmul and add ( #6986 )
...
* merge matmul and add
* modify format
* using paddle.nn.functional.linear
* using _C_ops.linear
* using paddle.nn.functional.linear
* add FLAGS_use_legacy_linear env var in test case
* fix format
* add assert and remove env
* modify format
* using matmul for no bias
* modify accurate baseline
2026-04-03 18:02:03 +08:00
AIbin
1090f8b123
[Models]support GLM4.7 Flash && Ernie_MLA ( #7139 )
...
* support GLM4.7 Flash && Ernie_MLA
2026-04-03 17:41:33 +08:00
lizexu123
5f612a348d
[BugFix] fix flashinfer-cutedsl moe nvfp4 ( #7120 )
...
* fix nvfp4
* fix
* add document
* fix nvfp4
* support eb5
* support bka
* support eb5
* support xpu
* fix
* fix
* add import cutedsl
* fix
* fix
* fix test
* fix H卡
* update document
* fix
* update document
* update document
* fix
2026-04-03 15:43:19 +08:00
fxyfxy777
9f3b3ce7f5
[Optimization] merge_allreduce ( #7039 )
2026-04-02 19:52:13 +08:00
Bingoo
410988d9ec
[OP] support deepgeem for sm103 ( #7073 )
...
* support deepgeem for sm103
* add assert
* modify code style
* add assert
* modify sm version condition
* remove assert
2026-04-01 21:01:09 +08:00
cmcamdy
7a2e33098f
[XPU] Refactor pre process ( #6993 )
...
* [XPU] support speculate_pre_process
* merge develop
* fix codestype
* fix mtp, support cu_seqlens_q_output
* fix mtp, support cu_seqlens_q_output
* fix test
---------
Co-authored-by: lizan1999 <lizan03@baidu.com >
2026-04-01 20:29:55 +08:00
yzwu
ceaf5df350
[Iluvatar] Fix cuda graph error for tp > 1 in ernie models ( #7126 )
2026-04-01 19:13:34 +08:00
sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
YilongGuo
dd61e7e421
[Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration ( #7086 )
...
Add clear_grpah_opt_backend method that delegates to the underlying model
to clear cuda graph optimization backend.
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com >
2026-03-31 13:48:25 +08:00
yzwu
8789329457
[Iluvatar] Support wi4a16 group_gemm ( #7078 )
2026-03-30 19:03:51 +08:00
zhangbo9674
5c60e2fc6f
fix bug in cudagraph ( #7069 )
2026-03-30 14:24:23 +08:00
mpgemm
1a1d048774
[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 ( #6963 )
2026-03-30 11:37:04 +08:00
Longzhi Wang
2eea6fa97a
[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend ( #7028 )
...
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend
* add constexpr and code style clean
* add test
* fix code style
* fix test
2026-03-30 11:17:15 +08:00
mpgemm
7a20eaebe8
[Feature] Support cute cpp Encoder FA4 ( #7016 )
...
* add cute cpp fa4
* 删掉注释
* 修正合并错误
* sm_version放到函数内
* ci错误
2026-03-30 10:54:56 +08:00
huicongyao
25d64efdc4
[Speculative Decoding] Refactor Eagle MTP hidden states copy ( #6812 )
...
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states
* readibility
* fix xpu bug
* fix coverage failure
* change luanch params & parallelize position_map compute
* Fix MTP-related bugs in FastDeploy centralized inference
* fix
* refactor mtp hidden_states process
* fix
* add unittest & optimize kernel
* remove useless code
* fix
2026-03-25 22:54:31 -07:00
freeliuzc
7a6c28781b
[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug ( #7005 )
...
* optimize attn_mask_offset and optimize mtp usage
* delete useless branch
* fix kernel format
* fix kernel runner
2026-03-25 01:52:06 -07:00
SUN Dong
6cff780fdb
[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment ( #6850 )
...
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization
* update
* update
* update
* support custom topk inDeepGemmFusedMoeMethod apply_tp
* apply_ep_prefill support moe_topk_select
* update
* add ut
* add ut
* add ut
* modity doc
* fix env and docs
* add ut
---------
Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com >
2026-03-24 11:12:39 +08:00
Nyakku Shigure
8b6bbb3504
[Optimization] Use a separate driver when using Triton with Paddle ( #6897 )
...
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-24 10:56:00 +08:00
freeliuzc
e87ce4b8cd
[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess ( #6973 )
...
* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
周周周
5416da8c6e
remove assert ( #6970 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-23 14:22:03 +08:00
jackyYang6
634d23a38a
[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow ( #6934 )
...
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow
* [Docs] Fix thinking_budget markdown formatting
* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
jackyYang6
00eb12f656
[BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path ( #6555 )
...
* [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching
* [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests
2026-03-20 16:37:58 +08:00