Commit Graph

97 Commits

Author SHA1 Message Date
周周周 18f012457d [OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201) 2026-04-07 11:21:57 +08:00
周周周 fd44bb7cbf cpmmot (#7105)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-31 16:13:44 +08:00
周周周 76cf5e9496 [append attention] clean code (#7062) 2026-03-30 15:07:53 +08:00
Longzhi Wang 2eea6fa97a [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend (#7028)
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend

* add constexpr and code style clean

* add test

* fix code style

* fix test
2026-03-30 11:17:15 +08:00
chen 1502b6f43e add instantiations for decoder rope enfore_fmul_rn=true (#7009) 2026-03-25 22:22:10 +08:00
chen c92e277cf1 [RL] RoPE without fmad opt (#6901)
* env FD_ENABLE_RL=1 do fmul_rn(a*b) in rope
2026-03-24 21:19:53 +08:00
周周周 820eb60ec6 [Others] clean code (#6839)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-14 11:09:28 +08:00
周周周 8c1a2827d3 DSA clean code (#6827) 2026-03-13 16:39:47 +08:00
gongweibao 8906e09e0f [Feature][OP] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path (#6749)
* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path

- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance

* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size

- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
  correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
  FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add RMSNorm layer batch_invariant_mode unit test for coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add pragma no cover for Triton kernel and multi-GPU embedding path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 14:34:44 +08:00
AIbin c3aceb6bdc [Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689)
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
gongweibao 30f9f33f34 [Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM (#6610)
* add fa deter

* add ut

* add long sentence

* fix basic

* fix bugs

* fix adn

* fix first

* fix single

* fix single

* fix single test

* refine

* add more test

* refine comments

* add comments of bmm

* fix ci

* remove probe

* add

* remove not need

* refine tests

* fix comments and refine code

* refine code

* refine test

* refine test

* mv 4cards tests

* fix tests

* add

* fix comments

* fix cover

* fix cover

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-09 10:27:53 +08:00
gongweibao ddb06ff83f init (#6642)
Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-04 21:55:31 +08:00
周周周 3cc09418f1 support dsv3 use flashmla (#6593) 2026-03-03 11:09:43 +08:00
AIbin 59b578c337 [Feature]Supports SWA based on appendattn (#6547) 2026-03-01 19:02:08 +08:00
AIbin 983be007f5 [Feature]support swa & sink Based on appendattn (#6410)
* support swa & sink Based on  appendattn
2026-02-10 18:28:03 +08:00
sunxin adc69c15d0 [Model Runner] Prepare token count and move FA3 initialization into the graph (#6170)
* prepare for token num and put FA3 init in graph
2026-01-26 12:16:57 +08:00
GoldPancake bda38aa519 [Speculative Decoding] Support MTP for GLM-4.5-Air (#6047)
* glm mtp
* add spec neox partial rope
2026-01-16 14:35:24 +08:00
周周周 ad8d05a8de [Optimization] Do not compute ATTN padding part in In Cuda graph mode (#5985) 2026-01-13 11:32:27 +08:00
sunxin 17ef3920f3 remove decoder_num_blocks_device memset (#5982) 2026-01-10 21:22:06 +08:00
周周周 b8d9daa785 MLA clean code (#5979) 2026-01-10 21:05:00 +08:00
lizhenyun01 2be8656c29 [BugFix] fix mtp split kv attetion (#5920)
* [BugFix] fix mtp split kv attetion

* clean code

* clean code
2026-01-07 04:07:31 -08:00
chen ac39c0f887 support fa3 qwen-vl rope (#5869) 2026-01-05 15:29:34 +08:00
chen 27ef3610b5 support glm fa3 (#5586) 2025-12-16 19:33:27 +08:00
freeliuzc 532f9ba227 [BugFix][Speculative Decoding](Spend many dyas to solve)Fix write qknorm cache bug in speculative decoding (#5491)
* [liuzichang spend 10 dyas]fix write qknorm cache bug

* fix 'fix cachekv bug''
2025-12-15 18:27:11 +08:00
chen a389bb7c5c [Feature][Optimization] Qwen Support Dynamic block_wise_fp8 cache (#5486) 2025-12-12 17:10:17 +08:00
lzy 99f607eef5 [Others] Maintain the mtp branch temporarily. (#5446) 2025-12-09 19:17:53 +08:00
xiaozude df67379bc3 [Metax] modify wrapSize to WARP_SIZE (#5442) 2025-12-09 01:44:02 -08:00
周周周 31410415db FA3 support qwen3 (#5441) 2025-12-09 16:16:16 +08:00
周周周 2aea8a3a60 [Others] Remove useless code (#5404) 2025-12-08 13:59:46 +08:00
lzy c71a44c7e5 supports mtp split_kv_attn (#5343) 2025-12-03 12:40:16 +08:00
K11OntheBoat 2e1680838f [PD Disaggregation] Support PD deployment of DeepSeekv3. (#5251)
* Support deepseekv3 cache transfer for PD deploy

* clean some log info

---------

Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
2025-12-02 14:11:50 +08:00
lizhenyun01 aba4fc657f [Feature] support flash_mask_attention backend (#5134)
* [Feature] suppert flash_mask_attention backend

* fix unittest

* clean code
2025-11-28 10:12:16 +08:00
freeliuzc 2d1dade5e2 [Speculative Decoding][MTP] Support static CacheKV C8 quantization and optimize memory usage (#5155)
* support static cachekv c8 quantization in mtp mode

* optimize memory allocation
2025-11-21 15:10:13 +08:00
周周周 385fe6dade [Others] clean code (#5133) 2025-11-20 18:44:08 +08:00
周周周 6fa34102e8 [Others]get_block_shape_and_split_kv_block clean code (#5123) 2025-11-20 16:40:04 +08:00
chen d58c1db8a0 [Feature][OP] Append Attn Support CUDA-PDL (#5072) 2025-11-17 20:47:33 +08:00
周周周 b23e684b67 revert group size 3 (#5079) 2025-11-17 18:54:13 +08:00
Sunny-bot1 8a4ddb29df Revert "[BugFix] Revert skip capture (#5023)" (#5080) 2025-11-17 16:14:55 +08:00
Sunny-bot1 249feca65a [BugFix] Revert skip capture (#5023)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* Revert "[BugFix][Metax] Fix metax compile issue in get_block_shape_and_split_kv_block (#5000)"

This reverts commit 05da8e34c0.

* Revert "skip DtoH capture (#4988)"

This reverts commit 5b24013d46.
2025-11-13 23:52:51 -08:00
周周周 c0a4393d72 [ATTENTION] unitest (#4962) 2025-11-14 13:45:53 +08:00
carryyu 6c3d1da62f fix conflicts 2025-11-13 20:30:29 +08:00
Sunny-bot1 05da8e34c0 [BugFix][Metax] Fix metax compile issue in get_block_shape_and_split_kv_block (#5000)
* fix metax compile

* fix
2025-11-13 00:55:06 -08:00
Sunny-bot1 5b24013d46 skip DtoH capture (#4988) 2025-11-13 10:57:44 +08:00
周周周 6e01be28e0 format code (#4720)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Run Accuracy Tests (push) Has been cancelled
CI Images Build / Run Stable Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
2025-11-01 19:13:50 +08:00
Zhenghai Zhang 1712e1351b 【Hackathon 9th No.86】autogen MoeFastHardamardImplWrapper template_instantiation (#4592)
* autogen MoeFastHardamardImplWrapper template_instantiation

* fix codestyle

* fix codestyle

* add impl cu files
2025-10-30 10:28:36 +08:00
RAM 86d5006a57 [Graph Optimization][Speculative Decoding] Update yaml and fix typo (#4612) 2025-10-28 11:43:26 +08:00
xiaozude f7069b8057 [Metax] adapt DeepSeek (#4498) 2025-10-24 10:14:53 +08:00
Sunny-bot1 8718fa34b2 support static C8 (#4568)
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Run Accuracy Tests (push) Has been cancelled
CI Images Build / Run Stable Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
2025-10-23 22:01:03 +08:00
RAM 9dc5c3e370 [Graph Optimization] Support CUDAGraph Padding + MTP (#4545)
* Support CUDAGraph Padding + MTP

* support orther write cache kernel
2025-10-23 20:57:26 +08:00
Haonan Luo 1b9f351d21 Support GPT-OSS-BF16 (#4240)
* [Feature] AppendAtten support sinks & HEAD_DIM=64

* fix bug

* fix bug

* fix bug

* fix bug

* [Feature] support gpt-oss

* fix bug

* add mask

* support-gpt-oss

* support-gpt-oss

* fix long seq

* support wint8

* support wint8

* support wint8

* update test

* change sliding windows init pos

---------

Co-authored-by: ming1753 <ideaminghp@163.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>
2025-10-20 14:44:58 +08:00