Commit Graph

4818 Commits

Author SHA1 Message Date
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
Jiang-Jia-Jun d113397b09 Simplify available_blocks assignment logic (#6819) 2026-03-16 20:12:30 +08:00
Longzhi Wang 5c92f4d0cd [Feature] Add deepgemm bias epilogue for SM100 (#6857)
* [Feature] Add deepgemm bias epilogue for SM100

* fix
2026-03-16 20:12:00 +08:00
Jiang-Jia-Jun bd4b6092dd Update title and activity section in README_CN.md 2026-03-16 19:21:50 +08:00
Jiang-Jia-Jun c5f402e7aa Update title and release note in README_CN.md 2026-03-16 19:17:38 +08:00
AIbin c9f7f5234e [Optimization][BugFix]Optimize Deepseek networking code (#6861)
* update dsk model

* update dsk model
2026-03-16 16:52:43 +08:00
ming1753 bb925c605f [Other] Adjust GPUModelRunner to enhance compatibility (#6851) 2026-03-16 14:49:19 +08:00
jc 04fde3b227 [PD Disaggregation] Prefill and decode support cache storage (#6768)
* Prefill and decode support cache storage

* up

* up

* update docs and refine mooncake store

* up
2026-03-16 14:44:49 +08:00
mayang002 72ff7bf4cd [XPU] Fix wrapper files (#6830)
- Add WRAPPER_CHECK_PTR for pointer validity checks
- Add WRAPPER_ASSERT_GT/GE/LE for parameter range validation
- Simplify wrapper function calls to direct return pattern
2026-03-16 14:39:40 +08:00
gongweibao 3fabba0dc7 [Feature] Add Triton unified attention kernel for deterministic inference (#6795)
* [Feature] Add Triton unified attention kernel for deterministic inference

Add a Triton-based unified extend attention kernel that processes both
prefix (cached) and extend (new) KV tokens through a single kernel with
unified kv_indices, ensuring identical accumulation order regardless of
cache hit/miss patterns.

Key components:
- _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV
  cache support, and causal masking for prefix+extend
- Index building utilities: triton_cumsum_with_zero_prefix,
  build_kv_indices_from_block_tables, build_unified_kv_indices,
  _scatter_extend_kv_indices_kernel (all CUDA Graph compatible)
- pre_cache_len_concat_triton: GPU-only replacement for C++ op
- Reference implementations (_ref variants) for correctness validation
- Comprehensive tests: kernel correctness, split invariance,
  determinism, production-scale, cross-validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Vectorize causal mask in test references for ~26x speedup

Replace triple Python for-loop with paddle.where vectorized mask in
naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix cover

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 14:29:45 +08:00
Yonghua Li 7c8c0a3c02 [BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822)
* [BugFix] replace ftok with custom_ftok in get_output/save_output ops

* [Test] add unit test for custom_ftok

* [Chore] create custom_ftok.h

* [Chore] reorganize header file

* [Fix] fix cache messager msg_queue_id+rank_id conflict
2026-03-16 14:22:18 +08:00
fxyfxy777 4d39232553 [BugFix] add ut for fused_moe_degemm (#6840)
* add ut

* add skip
2026-03-16 12:22:18 +08:00
周周周 091e3c815d Dsa clean code,add dsk_attn_write_cache baseline (#6855) 2026-03-16 11:01:14 +08:00
周周周 820eb60ec6 [Others] clean code (#6839)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-14 11:09:28 +08:00
yinwei 3f4441b4b7 [XPU]add mtp cudagraph support (#6831) 2026-03-13 19:46:53 +08:00
cmcamdy 7591e0d6bc fix eb5 mtp(mix) (#6800) 2026-03-13 17:36:57 +08:00
周周周 8c1a2827d3 DSA clean code (#6827) 2026-03-13 16:39:47 +08:00
mouxin 49fe68a518 [Docs] Update Golang Router FAQ (#6829) 2026-03-13 15:48:36 +08:00
freeliuzc 12f412448b [Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch (#6825) 2026-03-12 23:48:24 -07:00
gongweibao 8906e09e0f [Feature][OP] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path (#6749)
* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path

- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance

* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size

- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
  correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
  FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add RMSNorm layer batch_invariant_mode unit test for coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add pragma no cover for Triton kernel and multi-GPU embedding path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 14:34:44 +08:00
fxyfxy777 8eb177147c [BugFix]rm draft code for glm (#6810)
* rm draft code for glm

* fix baseline

* fix baseline 2
2026-03-12 23:26:05 -07:00
AIbin 2b8a5b0d81 update indexer model (#6791) 2026-03-13 14:11:39 +08:00
kesmeey d935752be7 [CI] 【Hackathon 10th Spring No.20】功能模块 fastdeploy/engine/common_engine.py 单测补充 (#6292)
* style: format tests/engine/test_common_engine.py with black

* test: expand common engine coverage

* test: add coverage helper for common_engine

* style: format test_common_engine with pre-commit

* Remove test_force_coverage_for_common_engine test

* Update common engine coverage tests

Expand common engine tests and helpers while
aligning setup and cleanup behavior.


* Fix test_schedule_request_to_worker_v1 by mocking num_tasks to return 0

* Sync test_common_engine with branch 26

* chore: fix codestyle in common engine tests

---------

Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-03-13 13:16:07 +08:00
liufengwei0103 62110045f3 [RL] add stream guard (#6814)
* add stream guard

* format
2026-03-13 11:22:26 +08:00
bukejiyu 586e6f38b1 [Others]Limit transformers version (#6806) 2026-03-12 20:20:15 -07:00
MingkunZhang cb5a742298 [Metax][Test] enable paddleocr using cudagraph (#6820) 2026-03-13 10:47:25 +08:00
mayang002 1f9f889e37 [XPU] refactor: XPU plugin namespace migration (#6799)
* [XPU] refactor: XPU plugin namespace migration

- Migrate wrapper layer namespace from baidu::xpu::api::plugin to fastdeploy::plugin
- Migrate kernel layer namespace from xpu3::plugin to fd_xpu3
- Add api:: prefix for types (Context, SUCCESS, XPUIndexType, ctx_guard)
- Remove XPU2 support, keep only XPU3
- Update ops/ directory to use new namespace

Total: 137 files changed

* [XPU] fix: add return value check and correct error messages

- Add PADDLE_ENFORCE_XDNN_SUCCESS check for speculate_get_logits and update_attn_mask_offsets
- Fix empty error message in draft_model_postprocess
- Correct function name in speculate_schedule_cache error message
- Update error messages from 'xpu::plugin::' to 'fastdeploy::plugin::'
2026-03-13 10:21:51 +08:00
YuBaoku d73fd876ba [CI] Add daily build_linux jobs for CUDA 13.0 (#6809) 2026-03-12 22:04:58 +08:00
YuBaoku ab0eacb1ab [CI] Update _build_linux_rl.yml to use Paddle installation method with URL 2026-03-12 20:37:51 +08:00
huicongyao 2e63d88f7a [Optimization][Speculative Decoding]Fuse padding sampling params (#6765)
* optimize speculate pre process unit test

* Add CUDA kernel for building sampling params in speculative decoding

* init infer seed in device

* format code

* add unittest & fix

* fix

* format-code

* format-code

* fix rebase

* .

* fix unitest
2026-03-12 05:05:15 -07:00
MingkunZhang a9ace998db [Metax][Fix] fix ci error based pr#6805 caused by pr#6685 (#6807) 2026-03-12 19:30:16 +08:00
yzwu 901b38c936 [Iluvatar] Optimize decode group_gemm and Support cuda graph for ernie (#6803) 2026-03-12 19:21:17 +08:00
fxyfxy777 250ce40b40 [Feature] use phi permute/unpermute & rm swiglu (#6361)
* tp文字输出正常

* B eb5 mini文字输出正常

* eb5mini ep B卡 文字输出正常

* default use phi moe op

* stash

* tp H卡正常

* ep ok

* rm debug

* rm debug tool

* rm del ffn_out

* rm swiglu

* add envs to swiglu

* merge dev

* fix ci baseline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix ci baseline 2

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 02:01:57 -07:00
Jiaxin Sui a3d7979711 [XPU][CI]Rename test_ep4tp1_online.py to run_ep4tp1_online.py (#6805) 2026-03-12 16:16:20 +08:00
RAM cdaf6dd400 [RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599)
* cherry-pick  Support Fully Async and PrefixCache step 1

* copy routing_indices_cache.py from 2.4

* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)

* cherry-pick [RL] Clear Requests status of R3 (#6569)

* delete code

* fix rename bug

* fix status shape bug

* fix ci
2026-03-12 01:13:30 -07:00
mouxin 1ed6073d94 [Feature] Update logging for Golang Router (#6801) 2026-03-12 15:18:31 +08:00
qwes5s5 e0febf36be fix debug log (#6766) 2026-03-12 14:46:01 +08:00
cmcamdy 3543088d3e [XPU] rm stop nums (#6651)
* rm stop nums

* fix conflict

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-03-12 14:05:58 +08:00
yinwei 7d31a728d1 Add PD+EP cudagraph Support 2026-03-12 13:20:59 +08:00
Jiang-Jia-Jun 1fef825997 Fix environment variable name for KV cache lock 2026-03-12 11:24:07 +08:00
YuBaoku deff121a5f [CI] Update _build_linux_rl.yml to use cu129 nighlty 2026-03-11 23:58:07 +08:00
yzwu f0ab8ee793 [Iluvatar][CI] add triton in requirements_iluvatar.txt (#6788) 2026-03-11 20:39:03 +08:00
Jiajun Ji 88c4fbf8e1 [XPU] Add speculate_limit_thinking_content_length Op. (#6627)
* [XPU] Add speculate_limit_thinking_content_length OP for xpu.

* add unittest.

* format codes.

* format codes.

* format codes.

* Fix unused kernel launch return value.

---------

Co-authored-by: cmcamdy <1027740945@qq.com>
2026-03-11 17:30:17 +08:00
RichardWooSJTU 9f0778f991 [Feature] Support EP prefill with num_worst_tokens (#6574)
* support num worst tokens

* support num worst tokens

* fix build error

* support num worst tokens: fix errors

* support num worst tokens: fix feild

* support num worst tokens: delete requiements

* replace permute and depermute op by pure cuda

* replace permute and depermute op by pure cuda

* fix ci

* fix op

* fix nan

* fix code style

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-11 17:09:07 +08:00
jc 0466c7e8a8 Set MC_TCP_BIND_ADDRESS for mooncake store (#6782) 2026-03-11 16:56:39 +08:00
AIbin 1118351b27 [Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest (#6762)
* add deepseek model doc

* update deepseek model doc

* update deepseek model doc

* update deepseek model doc

* cwb suppor DSK_V32 Model

* update DSK_V32_DSA modeling

* Ibin Support DSK_DSA

* update kernel

* update yaml

* update requirements

* update pre_commit

* update model-runner

* fix CI bug

* del start.sh

* fix iluvatar_model_runner

* update DSA & add unitest

* update import deep_gemm
2026-03-11 15:52:54 +08:00
CSWYF3634076 97a4b3631e [Processor]add qwen3vl prompt_token_ids support (#6764)
* [Processor]add qwen3vl prompt_token_ids support

* [Processor]add qwen3vl prompt_token_ids support unittest

* [Processor]add qwen3vl prompt_token_ids support precommit
2026-03-11 15:08:56 +08:00
bukejiyu cffa8c246c [Others]update paddleformer 1.0.0 (#6496)
* update paddleformer 1.0.0

* update
2026-03-11 15:06:29 +08:00
Yonghua Li 7811eeccaa [fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) 2026-03-11 15:02:32 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00