Commit Graph

461 Commits

Author SHA1 Message Date
sunxin 7a78001be2 fix execute_model_normal in empty run (#6968) 2026-03-23 14:07:46 +08:00
周周周 1c38da2118 Make seq_lens_this_time/decoder/encoder equal shape (#6942) 2026-03-20 15:31:52 +08:00
yzwu 8b890c0d72 [Iluvatar] refactor attn and moe code (#6887) 2026-03-18 10:31:00 +08:00
qwes5s5 3b7507a4c2 test_abort (#6743) 2026-03-17 14:06:40 +08:00
huicongyao eab429d05e fix performance drop while no spec (#6866) 2026-03-17 13:06:36 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
ming1753 bb925c605f [Other] Adjust GPUModelRunner to enhance compatibility (#6851) 2026-03-16 14:49:19 +08:00
huicongyao 2e63d88f7a [Optimization][Speculative Decoding]Fuse padding sampling params (#6765)
* optimize speculate pre process unit test

* Add CUDA kernel for building sampling params in speculative decoding

* init infer seed in device

* format code

* add unittest & fix

* fix

* format-code

* format-code

* fix rebase

* .

* fix unitest
2026-03-12 05:05:15 -07:00
MingkunZhang a9ace998db [Metax][Fix] fix ci error based pr#6805 caused by pr#6685 (#6807) 2026-03-12 19:30:16 +08:00
RAM cdaf6dd400 [RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599)
* cherry-pick  Support Fully Async and PrefixCache step 1

* copy routing_indices_cache.py from 2.4

* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)

* cherry-pick [RL] Clear Requests status of R3 (#6569)

* delete code

* fix rename bug

* fix status shape bug

* fix ci
2026-03-12 01:13:30 -07:00
cmcamdy 3543088d3e [XPU] rm stop nums (#6651)
* rm stop nums

* fix conflict

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-03-12 14:05:58 +08:00
RichardWooSJTU 9f0778f991 [Feature] Support EP prefill with num_worst_tokens (#6574)
* support num worst tokens

* support num worst tokens

* fix build error

* support num worst tokens: fix errors

* support num worst tokens: fix feild

* support num worst tokens: delete requiements

* replace permute and depermute op by pure cuda

* replace permute and depermute op by pure cuda

* fix ci

* fix op

* fix nan

* fix code style

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-11 17:09:07 +08:00
Yonghua Li 7811eeccaa [fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) 2026-03-11 15:02:32 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
Jiang-Jia-Jun b05a6c4206 [BugFix][KVCache] Add inter-process lock to fix NaN error under DP+EP (#6724)
* [BugFix] Support  to fix NaN bug in EP

* Optimze notion for all the funs

* Fix potential lock contention failure issues

* Update fastdeploy/inter_communicator/ipc_signal.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update envs.py

* Update default value for USE_KVCACHE_LOCK

Change default value of USE_KVCACHE_LOCK from 1 to 0.

* Update worker_process.py

* Fix suffix wrong

* Update test_prefix_cache_manager.py

---------

Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-10 21:55:32 +08:00
sunxin 812657beee fix pd overlap (#6753) 2026-03-10 20:29:54 +08:00
zhupengyang 18b0716ddb [XPU] fix wint4 (#6757) 2026-03-10 19:50:31 +08:00
jc 79ad949594 [BugFix] Fix updating weight when enable cache storage (#6719)
* Fix updating weight when enable cache storage

* up

* up
2026-03-10 16:49:16 +08:00
AIbin 54581b8653 [BugFix]fix iluvatar_model_runner about dsa_cache (#6733)
* fix iluvatar_model_runner
2026-03-10 16:10:35 +08:00
AIbin c3aceb6bdc [Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689)
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
sunxin 28f7727a3d [Feature] Set overlap schedule as default (#6668)
* overlap default
2026-03-09 22:34:54 +08:00
zccjjj ae71ada6fe reduce warmup input_length for cudagragh (#6701) 2026-03-09 14:06:43 +08:00
gongweibao 30f9f33f34 [Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM (#6610)
* add fa deter

* add ut

* add long sentence

* fix basic

* fix bugs

* fix adn

* fix first

* fix single

* fix single

* fix single test

* refine

* add more test

* refine comments

* add comments of bmm

* fix ci

* remove probe

* add

* remove not need

* refine tests

* fix comments and refine code

* refine code

* refine test

* refine test

* mv 4cards tests

* fix tests

* add

* fix comments

* fix cover

* fix cover

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-09 10:27:53 +08:00
yzwu 81acdb62bd [Iluvatar][CI] Do not specify FD_LOG_DIR (#6665) 2026-03-06 11:54:44 +08:00
jc b0fd242add [BugFix] Fix error in dynamic c8 cache (#6544)
* [BugFix] Fix error in dynamic c8 cache

* fix device id
2026-03-06 10:11:23 +08:00
sunxin 839bc834eb [BugFix] Fix EB5 model runner compatibility check in worker process (#6673) 2026-03-05 19:49:28 +08:00
sunxin a79b82ce68 [BugFix] fix seq_lens_this_time init (#6670) 2026-03-05 17:07:26 +08:00
sunxin 0dc7034ce0 [Model Runner] Deprecate not_need_stop (#6356)
* Deprecate not_need_stop
2026-03-05 10:55:42 +08:00
ming1753 02d32eea3b Revert "[Bug Fix] Fix MM mtp incorrect rope emb (#6581)" (#6631)
This reverts commit c5eb6b65e7.
2026-03-04 11:23:28 +08:00
sunxin aee97e3aae fix exist_prefill_flag when preempted task (#6629) 2026-03-04 11:11:40 +08:00
MingkunZhang e8e18cecce [Metax][Fix] fix ci error based pr#6501 (#6636) 2026-03-04 11:09:57 +08:00
cmcamdy 29d9cb10e9 fix tp4 dp1 (#6624) 2026-03-04 10:12:34 +08:00
ming1753 c5eb6b65e7 [Bug Fix] Fix MM mtp incorrect rope emb (#6581)
* [Bug Fix] Fix MM mtp incorrect rope emb
2026-03-03 19:28:59 +08:00
qwes5s5 375b5b7b21 [Feature]Log Format Normalization and Trace Log Optimization (#6370)
* log refactor

* log refactor 2

* log refactor 3
2026-03-03 11:31:45 +08:00
周周周 3cc09418f1 support dsv3 use flashmla (#6593) 2026-03-03 11:09:43 +08:00
huicongyao 0f718baaf2 [Speculative Decoding]Reformat input preprocess for spec decode (#6501)
* add speculate_pre_process kernel

* reduce one slice

* make d2h async && fix mtp bug for new pre_process

* fix

* add unitest

* fix: code stype formatting

* fix

* fix: thread race in speculate_preprocess && rename d2h event
2026-03-03 10:22:07 +08:00
ming1753 344db8c8af [BugFix] Fix mtp when token_ids_all is None (#6591)
* [BugFix] Fix mtp when token_ids_all is None

* fix bug
2026-03-02 01:23:44 -08:00
yzwu 6674131b0b [Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553) 2026-03-02 14:07:17 +08:00
周周周 d957ccd46d seq_lens related tensor shape -> [max_num_seqs] (#6535) 2026-03-02 11:18:30 +08:00
MingkunZhang 16a2a323eb [Metax][Fix] fix error based pr#6407 (#6584) 2026-03-02 10:55:39 +08:00
zccjjj a2072fe20c [XPU] support warmup with ep & remove apply_tp_fused_op (#6289) 2026-02-28 15:40:36 +08:00
ming1753 97eee75677 [Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407)
* Optim GPU Mem Usage

---------

Co-authored-by: huzesen <huzesen@baidu.com>
2026-02-28 15:07:43 +08:00
cmcamdy 13447279aa [XPU] Fix PD + MTP (#6495)
* fix pd + mtp

* fix code style

* fix PD + MTP, D get P's first token

* add anno for gpu(speculate_update)

* update draft insertv1

* fix wapper & kernel

* fix wapper

* fix code stype
2026-02-27 19:07:35 +08:00
sunxin 53aaac69da [Optimization] Enable BF16 gate computation for GLM and Qwen (#6457)
* gate bf16

* add gate-fp32

* fix

* update baseline

* update

* update

* fix
2026-02-26 21:08:46 -08:00
gongweibao edd31e8849 [Feature] Add Deterministic Inference Support (#6476)
* add

* [tests] Add Paddle attention determinism tests and refactor resource manager

Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* add

* add

* add

* add

* add more

* add more

* fixsome

* fixsome

* fix bugs

* fix bugs

* only in gpu

* add docs

* fix comments

* fix some

* fix some

* fix comments

* add more

* fix potential problem

* remove not need

* remove not need

* remove no need

* fix bug

* fix bugs

* fix comments

* fix comments

* Update tests/ce/deterministic/test_determinism_verification.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/inter_communicator/test_ipc_signal.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/engine/test_sampling_params_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/layers/test_paddle_attention_determinism_standalone.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix comments

* fix import error

* fix a bug

* fix bugs

* fix bugs

* fix coverage

* refine codes

* refine code

* fix comments

* fix comments

* fix comments

* rm not need

* fix allreduce large tensor bug

* mv log files

* mv log files

* add files

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-26 19:31:51 -08:00
MingkunZhang c369f7139f [Metax][Fix] fix error based pr #6493 (#6521) 2026-02-26 18:41:35 +08:00
GoldPancake 2178f2829b [Speculative Decoding] Support suffix decoding (#6403)
* support suffix decoding
2026-02-26 11:42:05 +08:00
Yuanle Liu 6d3fede240 [OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 (#6493)
* Initial plan

* Migrate PRs #6311, #6129, #6305 to develop and merge unit tests

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix

* update

* fix

* fix ci

* fix ci

* Initial plan

* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add disable-thinking case to test_chat_with_response_max_tokens

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* test: add both reasoning_max_tokens and response_max_tokens case

Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>

* fix ci

* fix ci

* fix ci

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>
2026-02-25 21:36:50 +08:00
jackyYang6 a29ee57e15 [Feature] Support ThinkingBudget Logits processor to control thinking content length (#6367)
* feat: add thinking budget logits processor

* add unittest

* fix pre-commit

* add unittest

* docs: clarify operator-level vs logits processor usage and conflict guidance

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-02-25 14:17:09 +08:00
Yonghua Li e2332a1112 [BugFix] fix num_cpu_blocks computation (#6438)
* [BugFix] fix num_cpu_blocks computation

* [fix] fix syntax and log

* [fix] pre-commit

* [fix] use getattr

* [fix] ci test
2026-02-13 11:05:14 +08:00