AIbin
cb6819d086
[Optimization][OP]support per_token_group_fp8_quant cuda kernel ( #6865 )
...
* support per_token_group_fp8_quant cuda kernel
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com >
* update code
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com >
2026-03-17 19:17:51 +08:00
RichardWooSJTU
4ed483d20b
[BugFix] Fix ep compatibility issues & Optimize permute operator ( #6821 )
...
* fix ep compatibility issues & optimize permute operator
* fix ut
* fix ut
2026-03-17 10:32:11 +08:00
gongweibao
a6351dea0b
[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages ( #6533 )
...
* init
* init
* fix format
* add
* add files
* add ut
* fix some
* add ut
* add more
* add
* fix pre-commit
* fix pre-commit
* fix cover
* skip long seq
* add
* add
* fix
* remove not need
* fix set attr
* fix comments
* fix comments
* fix failed tests
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-16 21:32:43 +08:00
AIbin
c9f7f5234e
[Optimization][BugFix]Optimize Deepseek networking code ( #6861 )
...
* update dsk model
* update dsk model
2026-03-16 16:52:43 +08:00
mayang002
72ff7bf4cd
[XPU] Fix wrapper files ( #6830 )
...
- Add WRAPPER_CHECK_PTR for pointer validity checks
- Add WRAPPER_ASSERT_GT/GE/LE for parameter range validation
- Simplify wrapper function calls to direct return pattern
2026-03-16 14:39:40 +08:00
Yonghua Li
7c8c0a3c02
[BugFix] replace ftok with custom_ftok in get_output/save_output ops ( #6822 )
...
* [BugFix] replace ftok with custom_ftok in get_output/save_output ops
* [Test] add unit test for custom_ftok
* [Chore] create custom_ftok.h
* [Chore] reorganize header file
* [Fix] fix cache messager msg_queue_id+rank_id conflict
2026-03-16 14:22:18 +08:00
周周周
820eb60ec6
[Others] clean code ( #6839 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-14 11:09:28 +08:00
cmcamdy
7591e0d6bc
fix eb5 mtp(mix) ( #6800 )
2026-03-13 17:36:57 +08:00
周周周
8c1a2827d3
DSA clean code ( #6827 )
2026-03-13 16:39:47 +08:00
freeliuzc
12f412448b
[Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch ( #6825 )
2026-03-12 23:48:24 -07:00
gongweibao
8906e09e0f
[Feature][OP] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path ( #6749 )
...
* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path
- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance
* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size
- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* Add RMSNorm layer batch_invariant_mode unit test for coverage
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* Add pragma no cover for Triton kernel and multi-GPU embedding path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 14:34:44 +08:00
mayang002
1f9f889e37
[XPU] refactor: XPU plugin namespace migration ( #6799 )
...
* [XPU] refactor: XPU plugin namespace migration
- Migrate wrapper layer namespace from baidu::xpu::api::plugin to fastdeploy::plugin
- Migrate kernel layer namespace from xpu3::plugin to fd_xpu3
- Add api:: prefix for types (Context, SUCCESS, XPUIndexType, ctx_guard)
- Remove XPU2 support, keep only XPU3
- Update ops/ directory to use new namespace
Total: 137 files changed
* [XPU] fix: add return value check and correct error messages
- Add PADDLE_ENFORCE_XDNN_SUCCESS check for speculate_get_logits and update_attn_mask_offsets
- Fix empty error message in draft_model_postprocess
- Correct function name in speculate_schedule_cache error message
- Update error messages from 'xpu::plugin::' to 'fastdeploy::plugin::'
2026-03-13 10:21:51 +08:00
huicongyao
2e63d88f7a
[Optimization][Speculative Decoding]Fuse padding sampling params ( #6765 )
...
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
2026-03-12 05:05:15 -07:00
yzwu
901b38c936
[Iluvatar] Optimize decode group_gemm and Support cuda graph for ernie ( #6803 )
2026-03-12 19:21:17 +08:00
cmcamdy
3543088d3e
[XPU] rm stop nums ( #6651 )
...
* rm stop nums
* fix conflict
---------
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-03-12 14:05:58 +08:00
Jiajun Ji
88c4fbf8e1
[XPU] Add speculate_limit_thinking_content_length Op. ( #6627 )
...
* [XPU] Add speculate_limit_thinking_content_length OP for xpu.
* add unittest.
* format codes.
* format codes.
* format codes.
* Fix unused kernel launch return value.
---------
Co-authored-by: cmcamdy <1027740945@qq.com >
2026-03-11 17:30:17 +08:00
RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
AIbin
1118351b27
[Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest ( #6762 )
...
* add deepseek model doc
* update deepseek model doc
* update deepseek model doc
* update deepseek model doc
* cwb suppor DSK_V32 Model
* update DSK_V32_DSA modeling
* Ibin Support DSK_DSA
* update kernel
* update yaml
* update requirements
* update pre_commit
* update model-runner
* fix CI bug
* del start.sh
* fix iluvatar_model_runner
* update DSA & add unitest
* update import deep_gemm
2026-03-11 15:52:54 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
wangyifei
b57c960837
cuda13.0, implement changes to CCCL ( #6751 )
2026-03-10 16:47:02 +08:00
AIbin
c3aceb6bdc
[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM ( #6689 )
...
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
mayang002
ecc5032176
[XPU] Add return value checks for all XPU kernel launches ( #6666 )
...
* [XPU] Add return value checks for all XPU kernel launches
- Add -fxpu-launch-return compiler flag in CMakeLists.txt to enable
kernel launch return values
- Add KERNEL_ASSERT_SUCCESS(ctx, ret_xre) checks after every XPU
kernel launch across 45 wrapper files (55 launch sites total)
- Covers both main wrapper/ and mtp_wrapper/ directories
- Properly handles multiple kernel launches in the same function
scope by reusing the ret_xre variable
* [XPU] code style fix
2026-03-10 10:45:18 +08:00
gongweibao
30f9f33f34
[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM ( #6610 )
...
* add fa deter
* add ut
* add long sentence
* fix basic
* fix bugs
* fix adn
* fix first
* fix single
* fix single
* fix single test
* refine
* add more test
* refine comments
* add comments of bmm
* fix ci
* remove probe
* add
* remove not need
* refine tests
* fix comments and refine code
* refine code
* refine test
* refine test
* mv 4cards tests
* fix tests
* add
* fix comments
* fix cover
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-09 10:27:53 +08:00
sunxin
0dc7034ce0
[Model Runner] Deprecate not_need_stop ( #6356 )
...
* Deprecate not_need_stop
2026-03-05 10:55:42 +08:00
gongweibao
ddb06ff83f
init ( #6642 )
...
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-04 21:55:31 +08:00
MingkunZhang
e8e18cecce
[Metax][Fix] fix ci error based pr#6501 ( #6636 )
2026-03-04 11:09:57 +08:00
lizan1999
c637692427
[XPU] support MTP Step > 1 ( #6609 )
...
Co-authored-by: lizan1999 <lizan03@baidu.com >
2026-03-04 10:07:37 +08:00
Jiajun Ji
4ff3f4212f
[XPU] Add update_attn_mask_offsets op for xpu. ( #6556 )
...
* add update_attn_mask_offsets op for xpu.
* format code style.
* format codes with pre-commit.
2026-03-03 18:00:05 +08:00
周周周
3cc09418f1
support dsv3 use flashmla ( #6593 )
2026-03-03 11:09:43 +08:00
huicongyao
0f718baaf2
[Speculative Decoding]Reformat input preprocess for spec decode ( #6501 )
...
* add speculate_pre_process kernel
* reduce one slice
* make d2h async && fix mtp bug for new pre_process
* fix
* add unitest
* fix: code stype formatting
* fix
* fix: thread race in speculate_preprocess && rename d2h event
2026-03-03 10:22:07 +08:00
yzwu
6674131b0b
[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding ( #6553 )
2026-03-02 14:07:17 +08:00
AIbin
59b578c337
[Feature]Supports SWA based on appendattn ( #6547 )
2026-03-01 19:02:08 +08:00
ming1753
97eee75677
[Feature] GPU Memory Optimization and Retirement of V0 Scheduler ( #6407 )
...
* Optim GPU Mem Usage
---------
Co-authored-by: huzesen <huzesen@baidu.com >
2026-02-28 15:07:43 +08:00
cmcamdy
13447279aa
[XPU] Fix PD + MTP ( #6495 )
...
* fix pd + mtp
* fix code style
* fix PD + MTP, D get P's first token
* add anno for gpu(speculate_update)
* update draft insertv1
* fix wapper & kernel
* fix wapper
* fix code stype
2026-02-27 19:07:35 +08:00
gongweibao
edd31e8849
[Feature] Add Deterministic Inference Support ( #6476 )
...
* add
* [tests] Add Paddle attention determinism tests and refactor resource manager
Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* add
* add
* add
* add
* add more
* add more
* fixsome
* fixsome
* fix bugs
* fix bugs
* only in gpu
* add docs
* fix comments
* fix some
* fix some
* fix comments
* add more
* fix potential problem
* remove not need
* remove not need
* remove no need
* fix bug
* fix bugs
* fix comments
* fix comments
* Update tests/ce/deterministic/test_determinism_verification.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/inter_communicator/test_ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/engine/test_sampling_params_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism_standalone.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix comments
* fix import error
* fix a bug
* fix bugs
* fix bugs
* fix coverage
* refine codes
* refine code
* fix comments
* fix comments
* fix comments
* rm not need
* fix allreduce large tensor bug
* mv log files
* mv log files
* add files
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-26 19:31:51 -08:00
GoldPancake
2178f2829b
[Speculative Decoding] Support suffix decoding ( #6403 )
...
* support suffix decoding
2026-02-26 11:42:05 +08:00
Yuanle Liu
6d3fede240
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 ( #6493 )
...
* Initial plan
* Migrate PRs #6311 , #6129 , #6305 to develop and merge unit tests
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix
* update
* fix
* fix ci
* fix ci
* Initial plan
* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add disable-thinking case to test_chat_with_response_max_tokens
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add both reasoning_max_tokens and response_max_tokens case
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix ci
* fix ci
* fix ci
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2026-02-25 21:36:50 +08:00
sunxin
51f812aaa4
fix empty get_padding_offset ( #6462 )
2026-02-12 12:34:23 +08:00
AIbin
983be007f5
[Feature]support swa & sink Based on appendattn ( #6410 )
...
* support swa & sink Based on appendattn
2026-02-10 18:28:03 +08:00
yzwu
60e75ea8e8
[Iluvatar][CI] Fix cannot import get_stop ( #6165 )
2026-02-10 16:57:23 +08:00
GoldPancake
8b1dd0f360
fix bug ( #6422 )
2026-02-10 14:58:50 +08:00
Mattheliu
c776d483e4
[BugFix]fix handle 4 return values from noaux_tc_redundant op ( #6384 )
...
* fix: handle 4 return values from noaux_tc_redundant op
The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)
The Python code was only unpacking 3 values, causing:
ValueError: too many values to unpack (expected 3)
This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
* fix: make noaux_tc_redundant return 4 values to match OP definition
The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.
This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com >
---------
Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com >
2026-02-09 13:17:47 +08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
jc
d6b3c722c1
[KVCache] Storage cache supports c8 model ( #6298 )
...
* Refine cache transfer manager
* Storage cache supports c8 model
2026-02-06 12:01:17 +08:00
周周周
e3fb8796b4
Remove MTP rebuil_padding useless code ( #6336 )
2026-02-05 16:28:44 +08:00
chen
29a313a402
[Optimization] Support FA2/FA3/FA4 with attn_mask_q ( #6354 )
...
* support FA4 sm100
* flash attn backend support mask
* flash attn backend run flashmask correct
* add test for flash_attn_backend and flash_attn_func
* check
* add test for fa4
* requirements.txt add fa4 whl
* check test on sm100
* fix CI conflict
* add enable_torch_proxy for flash_mask
* lazy import fa4
* check
* fix tests import
* check test_load_mpt import
2026-02-05 14:39:00 +08:00
lizan1999
72edd394d9
[XPU] support noaux_tc ( #6326 )
2026-02-05 12:04:16 +08:00
fxyfxy777
36547cfdb3
[Feature] FD_USE_PHI_FP8_QUANT ( #6320 )
...
* add ut
* add use_fd_quant env
* rm mask_per_token_quant
* add make ops list
* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true
* modify comments
* use bool type
* Add function declaration
2026-02-03 22:33:03 -08:00
周周周
6225439778
add PADDLE_ENFORCE ( #6321 )
2026-02-04 10:47:19 +08:00
JYChen
c745a22420
[Feature] Support Ernie FP8 on sm100 ( the fixed version) ( #6304 )
2026-02-03 17:47:38 +08:00