Longzhi Wang
2eea6fa97a
[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend ( #7028 )
...
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend
* add constexpr and code style clean
* add test
* fix code style
* fix test
2026-03-30 11:17:15 +08:00
mpgemm
7a20eaebe8
[Feature] Support cute cpp Encoder FA4 ( #7016 )
...
* add cute cpp fa4
* 删掉注释
* 修正合并错误
* sm_version放到函数内
* ci错误
2026-03-30 10:54:56 +08:00
Nyakku Shigure
8b6bbb3504
[Optimization] Use a separate driver when using Triton with Paddle ( #6897 )
...
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-24 10:56:00 +08:00
周周周
5416da8c6e
remove assert ( #6970 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-23 14:22:03 +08:00
周周周
1c38da2118
Make seq_lens_this_time/decoder/encoder equal shape ( #6942 )
2026-03-20 15:31:52 +08:00
yzwu
8b890c0d72
[Iluvatar] refactor attn and moe code ( #6887 )
2026-03-18 10:31:00 +08:00
gongweibao
a6351dea0b
[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages ( #6533 )
...
* init
* init
* fix format
* add
* add files
* add ut
* fix some
* add ut
* add more
* add
* fix pre-commit
* fix pre-commit
* fix cover
* skip long seq
* add
* add
* fix
* remove not need
* fix set attr
* fix comments
* fix comments
* fix failed tests
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-16 21:32:43 +08:00
ming1753
bb925c605f
[Other] Adjust GPUModelRunner to enhance compatibility ( #6851 )
2026-03-16 14:49:19 +08:00
gongweibao
3fabba0dc7
[Feature] Add Triton unified attention kernel for deterministic inference ( #6795 )
...
* [Feature] Add Triton unified attention kernel for deterministic inference
Add a Triton-based unified extend attention kernel that processes both
prefix (cached) and extend (new) KV tokens through a single kernel with
unified kv_indices, ensuring identical accumulation order regardless of
cache hit/miss patterns.
Key components:
- _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV
cache support, and causal masking for prefix+extend
- Index building utilities: triton_cumsum_with_zero_prefix,
build_kv_indices_from_block_tables, build_unified_kv_indices,
_scatter_extend_kv_indices_kernel (all CUDA Graph compatible)
- pre_cache_len_concat_triton: GPU-only replacement for C++ op
- Reference implementations (_ref variants) for correctness validation
- Comprehensive tests: kernel correctness, split invariance,
determinism, production-scale, cross-validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* Vectorize causal mask in test references for ~26x speedup
Replace triple Python for-loop with paddle.where vectorized mask in
naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 14:29:45 +08:00
周周周
091e3c815d
Dsa clean code,add dsk_attn_write_cache baseline ( #6855 )
2026-03-16 11:01:14 +08:00
周周周
820eb60ec6
[Others] clean code ( #6839 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-14 11:09:28 +08:00
周周周
8c1a2827d3
DSA clean code ( #6827 )
2026-03-13 16:39:47 +08:00
RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
AIbin
c3aceb6bdc
[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM ( #6689 )
...
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
gongweibao
30f9f33f34
[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM ( #6610 )
...
* add fa deter
* add ut
* add long sentence
* fix basic
* fix bugs
* fix adn
* fix first
* fix single
* fix single
* fix single test
* refine
* add more test
* refine comments
* add comments of bmm
* fix ci
* remove probe
* add
* remove not need
* refine tests
* fix comments and refine code
* refine code
* refine test
* refine test
* mv 4cards tests
* fix tests
* add
* fix comments
* fix cover
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-09 10:27:53 +08:00
ming1753
81e04bf5d1
[BugFix] fix flash attn mtp rope emb bug ( #6649 )
2026-03-04 21:19:12 +08:00
yzwu
3345641f4e
[Iluvatar][CI] fix the dim error of seq_lens_encoder and seq_lens_decoder ( #6637 )
2026-03-04 14:00:40 +08:00
ming1753
02d32eea3b
Revert "[Bug Fix] Fix MM mtp incorrect rope emb ( #6581 )" ( #6631 )
...
This reverts commit c5eb6b65e7 .
2026-03-04 11:23:28 +08:00
ming1753
c5eb6b65e7
[Bug Fix] Fix MM mtp incorrect rope emb ( #6581 )
...
* [Bug Fix] Fix MM mtp incorrect rope emb
2026-03-03 19:28:59 +08:00
周周周
3cc09418f1
support dsv3 use flashmla ( #6593 )
2026-03-03 11:09:43 +08:00
yzwu
6674131b0b
[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding ( #6553 )
2026-03-02 14:07:17 +08:00
周周周
d957ccd46d
seq_lens related tensor shape -> [max_num_seqs] ( #6535 )
2026-03-02 11:18:30 +08:00
chen
5382fb2c60
[BugFix] lazy enable_torch_proxy for cutlass ( #6523 )
...
* lazy enable_torch_proxy for cutlass
* test init_flash_attn_version
2026-03-02 10:43:58 +08:00
AIbin
59b578c337
[Feature]Supports SWA based on appendattn ( #6547 )
2026-03-01 19:02:08 +08:00
chen
2d1531f3cb
dev opensource model support fa4/flashmasV2/V3 ( #6518 )
2026-02-26 17:46:05 +08:00
chen
d937d6ebfd
check ( #6424 )
2026-02-10 15:55:17 +08:00
chen
a8ffcaa068
fix fa4 test ( #6408 )
2026-02-10 10:57:21 +08:00
bukejiyu
5bfc0938e2
[BugFix] PD reorder fix and add ut ( #6375 )
2026-02-09 04:42:48 -08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
K11OntheBoat
116e2aea7a
Support Norm before Rope ( #6332 )
...
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2026-02-05 15:28:52 +08:00
chen
29a313a402
[Optimization] Support FA2/FA3/FA4 with attn_mask_q ( #6354 )
...
* support FA4 sm100
* flash attn backend support mask
* flash attn backend run flashmask correct
* add test for flash_attn_backend and flash_attn_func
* check
* add test for fa4
* requirements.txt add fa4 whl
* check test on sm100
* fix CI conflict
* add enable_torch_proxy for flash_mask
* lazy import fa4
* check
* fix tests import
* check test_load_mpt import
2026-02-05 14:39:00 +08:00
GoldPancake
183b8d325a
[RL] Support GLM MTP RL Model ( #6267 )
2026-02-04 20:14:35 +08:00
bukejiyu
12d4b4cb87
[Feature]Support reorder ids to split prefill and decodes ( #5779 )
...
* support reorder ids
* perfect code
* fix
* fix unittest
* delete code
* fix
* add python api
* delete custom op
* update algorithm
* fix swap
* support condense
* support condense
* support mtp
* delete code
* update
* update
* update
* update
* update for other platfrom
* update
* fix
* fix mtp
* fix ut
* update
* fix ut
* update ut
* fix
* fix encoder_cache
* fix ci
* fix
* fix vl
* Fix performance regression
* fix
* fix
* fix mtp
* fix index->req_id mapping
* fix ut
---------
Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-02-03 00:28:02 -08:00
Yuanle Liu
8b05774fad
[Others] enhance deep_ep import and support mixed mode flash_mask_attn ( #6238 )
...
* support flashmaskattn mixed and enhance deepep import
* update
* fix
2026-01-28 00:02:02 +08:00
周周周
aa57864c5b
remove unneeded para from flash_mask_attention ( #6218 )
2026-01-27 14:04:27 +08:00
xiaoxiaohehe001
7ffa88bb01
[BugFix] fix mask_attn ( #6214 )
...
* [BugFix] fix mask attn
* [BugFix] fix mask attn
2026-01-26 07:46:51 -08:00
sunxin
adc69c15d0
[Model Runner] Prepare token count and move FA3 initialization into the graph ( #6170 )
...
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
Haonan Luo
82057cb71f
Support MXFP4 for GPT-OSS ( #5435 )
...
* support mxfp4 in gpt-oss
* support mxfp4 in gpt-oss
* add scope for flashinfer
* remove torch code
* update envs.FD_MXFP4_BACKEND
* update process_weights_after_loading
* update env name
* support tp in gpt-oss, add e2e test
* add flashinfer-python-paddle in requirements
* fix import error
* add test
* add test
* add test
* add test
2026-01-22 14:21:01 +08:00
Ryan
dda27e50f5
[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph ( #6081 )
...
* rm static_op_get_block_shape_and_split_kv_block from cudagraph
* update max_capture_shape
* fallback: zeros -> empty to avoid coverage check
* check graph_opt_config exists
* add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test
* add use_cudagraph flag to control step_use_cudagraph
2026-01-20 14:05:18 +08:00
jackyYang6
988e0bc338
[Feature] Add PaddleFormers fallback backend ( #5999 )
...
* feat(paddleformers): add dense text model fallback backend
* docs(paddleformers): add user guide and fix code review issues
* add fallback unit test
* precommit format
* fix pre-commit
* fix: address code review feedback
* docs: add PaddleFormers backend documentation (EN) and simplify installation
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-19 21:50:50 +08:00
Ryan
0d1a5e70bc
[Graph Optimization] Add full_cuda_graph to control subgraph split ( #6027 )
2026-01-14 11:43:59 +08:00
周周周
b8d9daa785
MLA clean code ( #5979 )
2026-01-10 21:05:00 +08:00
zccjjj
20de04e249
[XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu ( #5878 )
2026-01-09 16:34:57 +08:00
lizhenyun01
2be8656c29
[BugFix] fix mtp split kv attetion ( #5920 )
...
* [BugFix] fix mtp split kv attetion
* clean code
* clean code
2026-01-07 04:07:31 -08:00
周周周
03363cab4c
make flash_mask attention pybind ( #5783 )
2025-12-26 14:31:35 +08:00
yzwu
ac013803f3
[Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode ( #5555 )
2025-12-18 02:14:25 -08:00
Longzhi Wang
d8587e987e
[Model] tp+ep support v1_loader ( #5465 )
...
* [Model] tp+ep support v1_loader
* fix
* fix mtp_linear
* fix mtp_linear
* fix
* fix
* fix v0 loader
* fix
* Add get_tensor for ep
* fix linear weight_loader
* fix typo
* fix
2025-12-18 14:31:54 +08:00
Ryan
d01cb274d6
[Graph Optimization][CI] Add ERNIE45T 21B sot test ( #5538 )
CE Compile Job / ce_job_pre_check (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
2025-12-13 00:43:15 +08:00
Lucas
888c4b992d
[XPU] refactor of block_attn param 'pos_emb_type' ( #5511 )
2025-12-12 14:30:09 +08:00