Commit Graph

738 Commits

Author SHA1 Message Date
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
zhupengyang 27b00cf385 [XPU] glm-4.5-air (#7071) 2026-04-14 11:31:49 +08:00
周周周 73bd4ab318 [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361)
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
AIbin 1fb8194191 [OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests

* Replace max_position_embeddings with max_model_len

* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
liuruyan b34708604c [TI-consistent] support quant use pow2scale (#7308)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:01:53 -07:00
AIbin 6213ad5340 [Docs][BugFix] fix mla log (#7243)
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
Jiajun Ji cb03958b52 [XPU] Refactor get_padding_offset to single kernel. (#7029)
* [XPU] Refactor get_padding_offset to single kernel.

* add unittest.

* fix codestyle.

* remove cum_offsets_now.

* remove max_len.
2026-04-13 11:04:50 +08:00
周周周 225fc8d222 use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340) 2026-04-11 22:39:43 +08:00
chen 4982aa000e [RL]moe bf16 ep support paddle batch_gemm (#7337)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin ba01d7a823 [Optimization] [OP] [Models] dsk del prefill mask (#7313)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests
2026-04-11 19:32:27 +08:00
JYChen 076ab07528 [RL] change glm rope_emb calculation (#7316)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:36:28 +08:00
zhangbo9674 627f0d9cc8 [RL] change rms norm for glm (#7269)
* change rms norm for glm

* refine code

* refine code

* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat 870dbac370 Use triton qk_norm both in Prefill and Decode (#7213)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-10 15:44:01 +08:00
bukejiyu 14d46181b8 [Loader] add multi-thread model loading (#6877)
* multi-thread-loader

* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake c1fb3112f8 [FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281)
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123 613f92ee8f [Feature] support nvfp4 tbo (#7259) 2026-04-09 17:29:39 +08:00
fxyfxy777 39ff38aba1 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) 2026-04-09 16:17:56 +08:00
xiaoxiaohehe001 51efe27d76 [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210)
* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen 43ace7af25 [RL] support moe-topk use topk_reduce_func (#7218)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu 7005404ce3 [DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253)
* Delete contiguous ops.

* fix scale

* Delete unnecessary comments

* fix style
2026-04-09 11:00:13 +08:00
AIbin 48d2bbeb74 fix dsa (#7252) 2026-04-08 20:21:38 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
lizhenyun01 446b26bbc0 [Feature] support blackwell gemm in ht (#7053)
* [Feature] support blackwell gemm in ht

* [Feature] support ops for convert

* fix cuda error 716

* fix cuda error

* opt memory

* remove unused code
2026-04-07 19:52:51 +08:00
sunxin ae2f9f4d22 [BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130)
* rl gate fp32

* clean
2026-04-06 21:07:38 -07:00
周周周 18f012457d [OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201) 2026-04-07 11:21:57 +08:00
Bingoo 2068656a85 [Optimization] merge matmul and add (#6986)
* merge matmul and add

* modify format

* using paddle.nn.functional.linear

* using _C_ops.linear

* using paddle.nn.functional.linear

* add FLAGS_use_legacy_linear env var in test case

* fix format

* add assert and remove env

* modify format

* using matmul for no bias

* modify accurate baseline
2026-04-03 18:02:03 +08:00
AIbin 1090f8b123 [Models]support GLM4.7 Flash && Ernie_MLA (#7139)
* support GLM4.7 Flash && Ernie_MLA
2026-04-03 17:41:33 +08:00
lizexu123 5f612a348d [BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)
* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
2026-04-03 15:43:19 +08:00
fxyfxy777 9f3b3ce7f5 [Optimization] merge_allreduce (#7039) 2026-04-02 19:52:13 +08:00
Bingoo 410988d9ec [OP] support deepgeem for sm103 (#7073)
* support deepgeem for sm103

* add assert

* modify code style

* add assert

* modify sm version condition

* remove assert
2026-04-01 21:01:09 +08:00
cmcamdy 7a2e33098f [XPU] Refactor pre process (#6993)
* [XPU] support speculate_pre_process

* merge develop

* fix codestype

* fix mtp, support cu_seqlens_q_output

* fix mtp, support cu_seqlens_q_output

* fix test

---------

Co-authored-by: lizan1999 <lizan03@baidu.com>
2026-04-01 20:29:55 +08:00
yzwu ceaf5df350 [Iluvatar] Fix cuda graph error for tp > 1 in ernie models (#7126) 2026-04-01 19:13:34 +08:00
sunxin c29e86fc9d [Feature] Support mtp overlap schedule (#7001) 2026-04-01 14:24:26 +08:00
YilongGuo dd61e7e421 [Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration (#7086)
Add clear_grpah_opt_backend method that delegates to the underlying model
to clear cuda graph optimization backend.

Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-03-31 13:48:25 +08:00
yzwu 8789329457 [Iluvatar] Support wi4a16 group_gemm (#7078) 2026-03-30 19:03:51 +08:00
zhangbo9674 5c60e2fc6f fix bug in cudagraph (#7069) 2026-03-30 14:24:23 +08:00
mpgemm 1a1d048774 [Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 (#6963) 2026-03-30 11:37:04 +08:00
Longzhi Wang 2eea6fa97a [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028)
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend

* add constexpr and code style clean

* add test

* fix code style

* fix test
2026-03-30 11:17:15 +08:00
mpgemm 7a20eaebe8 [Feature] Support cute cpp Encoder FA4 (#7016)
* add cute cpp fa4

* 删掉注释

* 修正合并错误

* sm_version放到函数内

* ci错误
2026-03-30 10:54:56 +08:00
huicongyao 25d64efdc4 [Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812)
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states

* readibility

* fix xpu bug

* fix coverage failure

* change luanch params & parallelize position_map compute

* Fix MTP-related bugs in FastDeploy centralized inference

* fix

* refactor mtp hidden_states process

* fix

* add unittest & optimize kernel

* remove useless code

* fix
2026-03-25 22:54:31 -07:00
freeliuzc 7a6c28781b [Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005)
* optimize attn_mask_offset and optimize mtp usage

* delete useless branch

* fix kernel format

* fix kernel runner
2026-03-25 01:52:06 -07:00
SUN Dong 6cff780fdb [RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850)
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization

* update

* update

* update

* support custom topk inDeepGemmFusedMoeMethod  apply_tp

* apply_ep_prefill support moe_topk_select

* update

* add ut

* add ut

* add ut

* modity doc

* fix env and docs

* add ut

---------

Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>
2026-03-24 11:12:39 +08:00
Nyakku Shigure 8b6bbb3504 [Optimization] Use a separate driver when using Triton with Paddle (#6897)
---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-24 10:56:00 +08:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
周周周 5416da8c6e remove assert (#6970)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-23 14:22:03 +08:00
jackyYang6 634d23a38a [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934)
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

* [Docs] Fix thinking_budget markdown formatting

* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
jackyYang6 00eb12f656 [BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555)
* [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching

* [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests
2026-03-20 16:37:58 +08:00