FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
Longzhi Wang	2eea6fa97a	[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028 ) * [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend * add constexpr and code style clean * add test * fix code style * fix test	2026-03-30 11:17:15 +08:00
mpgemm	7a20eaebe8	[Feature] Support cute cpp Encoder FA4 (#7016 ) * add cute cpp fa4 * 删掉注释 * 修正合并错误 * sm_version放到函数内 * ci错误	2026-03-30 10:54:56 +08:00
Nyakku Shigure	8b6bbb3504	[Optimization] Use a separate driver when using Triton with Paddle (#6897 ) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-03-24 10:56:00 +08:00
周周周	5416da8c6e	remove assert (#6970 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-23 14:22:03 +08:00
周周周	1c38da2118	Make seq_lens_this_time/decoder/encoder equal shape (#6942 )	2026-03-20 15:31:52 +08:00
yzwu	8b890c0d72	[Iluvatar] refactor attn and moe code (#6887 )	2026-03-18 10:31:00 +08:00
gongweibao	a6351dea0b	[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533 ) * init * init * fix format * add * add files * add ut * fix some * add ut * add more * add * fix pre-commit * fix pre-commit * fix cover * skip long seq * add * add * fix * remove not need * fix set attr * fix comments * fix comments * fix failed tests --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-16 21:32:43 +08:00
ming1753	bb925c605f	[Other] Adjust GPUModelRunner to enhance compatibility (#6851 )	2026-03-16 14:49:19 +08:00
gongweibao	3fabba0dc7	[Feature] Add Triton unified attention kernel for deterministic inference (#6795 ) * [Feature] Add Triton unified attention kernel for deterministic inference Add a Triton-based unified extend attention kernel that processes both prefix (cached) and extend (new) KV tokens through a single kernel with unified kv_indices, ensuring identical accumulation order regardless of cache hit/miss patterns. Key components: - _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV cache support, and causal masking for prefix+extend - Index building utilities: triton_cumsum_with_zero_prefix, build_kv_indices_from_block_tables, build_unified_kv_indices, _scatter_extend_kv_indices_kernel (all CUDA Graph compatible) - pre_cache_len_concat_triton: GPU-only replacement for C++ op - Reference implementations (_ref variants) for correctness validation - Comprehensive tests: kernel correctness, split invariance, determinism, production-scale, cross-validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Vectorize causal mask in test references for ~26x speedup Replace triple Python for-loop with paddle.where vectorized mask in naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix cover --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:29:45 +08:00
周周周	091e3c815d	Dsa clean code，add dsk_attn_write_cache baseline (#6855 )	2026-03-16 11:01:14 +08:00
周周周	820eb60ec6	[Others] clean code (#6839 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-14 11:09:28 +08:00
周周周	8c1a2827d3	DSA clean code (#6827 )	2026-03-13 16:39:47 +08:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00
AIbin	c3aceb6bdc	[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689 ) * Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM	2026-03-10 15:05:14 +08:00
gongweibao	30f9f33f34	[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM (#6610 ) * add fa deter * add ut * add long sentence * fix basic * fix bugs * fix adn * fix first * fix single * fix single * fix single test * refine * add more test * refine comments * add comments of bmm * fix ci * remove probe * add * remove not need * refine tests * fix comments and refine code * refine code * refine test * refine test * mv 4cards tests * fix tests * add * fix comments * fix cover * fix cover --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-09 10:27:53 +08:00
ming1753	81e04bf5d1	[BugFix] fix flash attn mtp rope emb bug (#6649 )	2026-03-04 21:19:12 +08:00
yzwu	3345641f4e	[Iluvatar][CI] fix the dim error of seq_lens_encoder and seq_lens_decoder (#6637 )	2026-03-04 14:00:40 +08:00
ming1753	02d32eea3b	Revert "[Bug Fix] Fix MM mtp incorrect rope emb (#6581 )" (#6631 ) This reverts commit `c5eb6b65e7`.	2026-03-04 11:23:28 +08:00
ming1753	c5eb6b65e7	[Bug Fix] Fix MM mtp incorrect rope emb (#6581 ) * [Bug Fix] Fix MM mtp incorrect rope emb	2026-03-03 19:28:59 +08:00
周周周	3cc09418f1	support dsv3 use flashmla (#6593 )	2026-03-03 11:09:43 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
周周周	d957ccd46d	seq_lens related tensor shape -> [max_num_seqs] (#6535 )	2026-03-02 11:18:30 +08:00
chen	5382fb2c60	[BugFix] lazy enable_torch_proxy for cutlass (#6523 ) * lazy enable_torch_proxy for cutlass * test init_flash_attn_version	2026-03-02 10:43:58 +08:00
AIbin	59b578c337	[Feature]Supports SWA based on appendattn (#6547 )	2026-03-01 19:02:08 +08:00
chen	2d1531f3cb	dev opensource model support fa4/flashmasV2/V3 (#6518 )	2026-02-26 17:46:05 +08:00
chen	d937d6ebfd	check (#6424 )	2026-02-10 15:55:17 +08:00
chen	a8ffcaa068	fix fa4 test (#6408 )	2026-02-10 10:57:21 +08:00
bukejiyu	5bfc0938e2	[BugFix] PD reorder fix and add ut (#6375 )	2026-02-09 04:42:48 -08:00
周周周	2b4748de4f	[MTP] refactor MTP pre_process (#6358 )	2026-02-09 10:47:15 +08:00
K11OntheBoat	116e2aea7a	Support Norm before Rope (#6332 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2026-02-05 15:28:52 +08:00
chen	29a313a402	[Optimization] Support FA2/FA3/FA4 with attn_mask_q (#6354 ) * support FA4 sm100 * flash attn backend support mask * flash attn backend run flashmask correct * add test for flash_attn_backend and flash_attn_func * check * add test for fa4 * requirements.txt add fa4 whl * check test on sm100 * fix CI conflict * add enable_torch_proxy for flash_mask * lazy import fa4 * check * fix tests import * check test_load_mpt import	2026-02-05 14:39:00 +08:00
GoldPancake	183b8d325a	[RL] Support GLM MTP RL Model (#6267 )	2026-02-04 20:14:35 +08:00
bukejiyu	12d4b4cb87	[Feature]Support reorder ids to split prefill and decodes (#5779 ) * support reorder ids * perfect code * fix * fix unittest * delete code * fix * add python api * delete custom op * update algorithm * fix swap * support condense * support condense * support mtp * delete code * update * update * update * update * update for other platfrom * update * fix * fix mtp * fix ut * update * fix ut * update ut * fix * fix encoder_cache * fix ci * fix * fix vl * Fix performance regression * fix * fix * fix mtp * fix index->req_id mapping * fix ut --------- Co-authored-by: root <root@yqlcc01-sys-rpm12rzmwjd.yqlcc01.baidu.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-03 00:28:02 -08:00
Yuanle Liu	8b05774fad	[Others] enhance deep_ep import and support mixed mode flash_mask_attn (#6238 ) * support flashmaskattn mixed and enhance deepep import * update * fix	2026-01-28 00:02:02 +08:00
周周周	aa57864c5b	remove unneeded para from flash_mask_attention (#6218 )	2026-01-27 14:04:27 +08:00
xiaoxiaohehe001	7ffa88bb01	[BugFix] fix mask_attn (#6214 ) * [BugFix] fix mask attn * [BugFix] fix mask attn	2026-01-26 07:46:51 -08:00
sunxin	adc69c15d0	[Model Runner] Prepare token count and move FA3 initialization into the graph (#6170 ) * prepare for token num and put FA3 init in graph	2026-01-26 12:16:57 +08:00
Haonan Luo	82057cb71f	Support MXFP4 for GPT-OSS (#5435 ) * support mxfp4 in gpt-oss * support mxfp4 in gpt-oss * add scope for flashinfer * remove torch code * update envs.FD_MXFP4_BACKEND * update process_weights_after_loading * update env name * support tp in gpt-oss, add e2e test * add flashinfer-python-paddle in requirements * fix import error * add test * add test * add test * add test	2026-01-22 14:21:01 +08:00
Ryan	dda27e50f5	[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph (#6081 ) * rm static_op_get_block_shape_and_split_kv_block from cudagraph * update max_capture_shape * fallback: zeros -> empty to avoid coverage check * check graph_opt_config exists * add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test * add use_cudagraph flag to control step_use_cudagraph	2026-01-20 14:05:18 +08:00
jackyYang6	988e0bc338	[Feature] Add PaddleFormers fallback backend (#5999 ) * feat(paddleformers): add dense text model fallback backend * docs(paddleformers): add user guide and fix code review issues * add fallback unit test * precommit format * fix pre-commit * fix: address code review feedback * docs: add PaddleFormers backend documentation (EN) and simplify installation --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-19 21:50:50 +08:00
Ryan	0d1a5e70bc	[Graph Optimization] Add `full_cuda_graph` to control subgraph split (#6027 )	2026-01-14 11:43:59 +08:00
周周周	b8d9daa785	MLA clean code (#5979 )	2026-01-10 21:05:00 +08:00
zccjjj	20de04e249	[XPU] move xpu_attn_backend.py to FastDeploy/fastdeploy/model_executor/layers/backends/xpu (#5878 )	2026-01-09 16:34:57 +08:00
lizhenyun01	2be8656c29	[BugFix] fix mtp split kv attetion (#5920 ) * [BugFix] fix mtp split kv attetion * clean code * clean code	2026-01-07 04:07:31 -08:00
周周周	03363cab4c	make flash_mask attention pybind (#5783 )	2025-12-26 14:31:35 +08:00
yzwu	ac013803f3	[Iluvatar] Support V1_KVCACHE_SCHEDULER and paddleocr-vl rope mode (#5555 )	2025-12-18 02:14:25 -08:00
Longzhi Wang	d8587e987e	[Model] tp+ep support v1_loader (#5465 ) * [Model] tp+ep support v1_loader * fix * fix mtp_linear * fix mtp_linear * fix * fix * fix v0 loader * fix * Add get_tensor for ep * fix linear weight_loader * fix typo * fix	2025-12-18 14:31:54 +08:00
Ryan	d01cb274d6	[Graph Optimization][CI] Add ERNIE45T 21B sot test (#5538 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FD Image Build (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details	2025-12-13 00:43:15 +08:00
Lucas	888c4b992d	[XPU] refactor of block_attn param 'pos_emb_type' (#5511 )	2025-12-12 14:30:09 +08:00

1 2 3

139 Commits