sunxin
7a78001be2
fix execute_model_normal in empty run ( #6968 )
2026-03-23 14:07:46 +08:00
周周周
1c38da2118
Make seq_lens_this_time/decoder/encoder equal shape ( #6942 )
2026-03-20 15:31:52 +08:00
yzwu
8b890c0d72
[Iluvatar] refactor attn and moe code ( #6887 )
2026-03-18 10:31:00 +08:00
qwes5s5
3b7507a4c2
test_abort ( #6743 )
2026-03-17 14:06:40 +08:00
huicongyao
eab429d05e
fix performance drop while no spec ( #6866 )
2026-03-17 13:06:36 +08:00
gongweibao
a6351dea0b
[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages ( #6533 )
...
* init
* init
* fix format
* add
* add files
* add ut
* fix some
* add ut
* add more
* add
* fix pre-commit
* fix pre-commit
* fix cover
* skip long seq
* add
* add
* fix
* remove not need
* fix set attr
* fix comments
* fix comments
* fix failed tests
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-16 21:32:43 +08:00
ming1753
bb925c605f
[Other] Adjust GPUModelRunner to enhance compatibility ( #6851 )
2026-03-16 14:49:19 +08:00
huicongyao
2e63d88f7a
[Optimization][Speculative Decoding]Fuse padding sampling params ( #6765 )
...
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
2026-03-12 05:05:15 -07:00
MingkunZhang
a9ace998db
[Metax][Fix] fix ci error based pr#6805 caused by pr#6685 ( #6807 )
2026-03-12 19:30:16 +08:00
RAM
cdaf6dd400
[RL][Cherry-Pick] Support Fully Async and PrefixCache ( #6599 )
...
* cherry-pick Support Fully Async and PrefixCache step 1
* copy routing_indices_cache.py from 2.4
* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388 )
* cherry-pick [RL] Clear Requests status of R3 (#6569 )
* delete code
* fix rename bug
* fix status shape bug
* fix ci
2026-03-12 01:13:30 -07:00
cmcamdy
3543088d3e
[XPU] rm stop nums ( #6651 )
...
* rm stop nums
* fix conflict
---------
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-03-12 14:05:58 +08:00
RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
Yonghua Li
7811eeccaa
[fix] resolve get_save_output_v1 socket name conflicts between multiple instances ( #6758 )
2026-03-11 15:02:32 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
Jiang-Jia-Jun
b05a6c4206
[BugFix][KVCache] Add inter-process lock to fix NaN error under DP+EP ( #6724 )
...
* [BugFix] Support to fix NaN bug in EP
* Optimze notion for all the funs
* Fix potential lock contention failure issues
* Update fastdeploy/inter_communicator/ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update envs.py
* Update default value for USE_KVCACHE_LOCK
Change default value of USE_KVCACHE_LOCK from 1 to 0.
* Update worker_process.py
* Fix suffix wrong
* Update test_prefix_cache_manager.py
---------
Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-10 21:55:32 +08:00
sunxin
812657beee
fix pd overlap ( #6753 )
2026-03-10 20:29:54 +08:00
zhupengyang
18b0716ddb
[XPU] fix wint4 ( #6757 )
2026-03-10 19:50:31 +08:00
jc
79ad949594
[BugFix] Fix updating weight when enable cache storage ( #6719 )
...
* Fix updating weight when enable cache storage
* up
* up
2026-03-10 16:49:16 +08:00
AIbin
54581b8653
[BugFix]fix iluvatar_model_runner about dsa_cache ( #6733 )
...
* fix iluvatar_model_runner
2026-03-10 16:10:35 +08:00
AIbin
c3aceb6bdc
[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM ( #6689 )
...
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
2026-03-10 15:05:14 +08:00
sunxin
28f7727a3d
[Feature] Set overlap schedule as default ( #6668 )
...
* overlap default
2026-03-09 22:34:54 +08:00
zccjjj
ae71ada6fe
reduce warmup input_length for cudagragh ( #6701 )
2026-03-09 14:06:43 +08:00
gongweibao
30f9f33f34
[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM ( #6610 )
...
* add fa deter
* add ut
* add long sentence
* fix basic
* fix bugs
* fix adn
* fix first
* fix single
* fix single
* fix single test
* refine
* add more test
* refine comments
* add comments of bmm
* fix ci
* remove probe
* add
* remove not need
* refine tests
* fix comments and refine code
* refine code
* refine test
* refine test
* mv 4cards tests
* fix tests
* add
* fix comments
* fix cover
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-09 10:27:53 +08:00
yzwu
81acdb62bd
[Iluvatar][CI] Do not specify FD_LOG_DIR ( #6665 )
2026-03-06 11:54:44 +08:00
jc
b0fd242add
[BugFix] Fix error in dynamic c8 cache ( #6544 )
...
* [BugFix] Fix error in dynamic c8 cache
* fix device id
2026-03-06 10:11:23 +08:00
sunxin
839bc834eb
[BugFix] Fix EB5 model runner compatibility check in worker process ( #6673 )
2026-03-05 19:49:28 +08:00
sunxin
a79b82ce68
[BugFix] fix seq_lens_this_time init ( #6670 )
2026-03-05 17:07:26 +08:00
sunxin
0dc7034ce0
[Model Runner] Deprecate not_need_stop ( #6356 )
...
* Deprecate not_need_stop
2026-03-05 10:55:42 +08:00
ming1753
02d32eea3b
Revert "[Bug Fix] Fix MM mtp incorrect rope emb ( #6581 )" ( #6631 )
...
This reverts commit c5eb6b65e7 .
2026-03-04 11:23:28 +08:00
sunxin
aee97e3aae
fix exist_prefill_flag when preempted task ( #6629 )
2026-03-04 11:11:40 +08:00
MingkunZhang
e8e18cecce
[Metax][Fix] fix ci error based pr#6501 ( #6636 )
2026-03-04 11:09:57 +08:00
cmcamdy
29d9cb10e9
fix tp4 dp1 ( #6624 )
2026-03-04 10:12:34 +08:00
ming1753
c5eb6b65e7
[Bug Fix] Fix MM mtp incorrect rope emb ( #6581 )
...
* [Bug Fix] Fix MM mtp incorrect rope emb
2026-03-03 19:28:59 +08:00
qwes5s5
375b5b7b21
[Feature]Log Format Normalization and Trace Log Optimization ( #6370 )
...
* log refactor
* log refactor 2
* log refactor 3
2026-03-03 11:31:45 +08:00
周周周
3cc09418f1
support dsv3 use flashmla ( #6593 )
2026-03-03 11:09:43 +08:00
huicongyao
0f718baaf2
[Speculative Decoding]Reformat input preprocess for spec decode ( #6501 )
...
* add speculate_pre_process kernel
* reduce one slice
* make d2h async && fix mtp bug for new pre_process
* fix
* add unitest
* fix: code stype formatting
* fix
* fix: thread race in speculate_preprocess && rename d2h event
2026-03-03 10:22:07 +08:00
ming1753
344db8c8af
[BugFix] Fix mtp when token_ids_all is None ( #6591 )
...
* [BugFix] Fix mtp when token_ids_all is None
* fix bug
2026-03-02 01:23:44 -08:00
yzwu
6674131b0b
[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding ( #6553 )
2026-03-02 14:07:17 +08:00
周周周
d957ccd46d
seq_lens related tensor shape -> [max_num_seqs] ( #6535 )
2026-03-02 11:18:30 +08:00
MingkunZhang
16a2a323eb
[Metax][Fix] fix error based pr#6407 ( #6584 )
2026-03-02 10:55:39 +08:00
zccjjj
a2072fe20c
[XPU] support warmup with ep & remove apply_tp_fused_op ( #6289 )
2026-02-28 15:40:36 +08:00
ming1753
97eee75677
[Feature] GPU Memory Optimization and Retirement of V0 Scheduler ( #6407 )
...
* Optim GPU Mem Usage
---------
Co-authored-by: huzesen <huzesen@baidu.com >
2026-02-28 15:07:43 +08:00
cmcamdy
13447279aa
[XPU] Fix PD + MTP ( #6495 )
...
* fix pd + mtp
* fix code style
* fix PD + MTP, D get P's first token
* add anno for gpu(speculate_update)
* update draft insertv1
* fix wapper & kernel
* fix wapper
* fix code stype
2026-02-27 19:07:35 +08:00
sunxin
53aaac69da
[Optimization] Enable BF16 gate computation for GLM and Qwen ( #6457 )
...
* gate bf16
* add gate-fp32
* fix
* update baseline
* update
* update
* fix
2026-02-26 21:08:46 -08:00
gongweibao
edd31e8849
[Feature] Add Deterministic Inference Support ( #6476 )
...
* add
* [tests] Add Paddle attention determinism tests and refactor resource manager
Add comprehensive determinism tests for Paddle attention layer and refactor
resource manager for deterministic mode support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* add
* add
* add
* add
* add more
* add more
* fixsome
* fixsome
* fix bugs
* fix bugs
* only in gpu
* add docs
* fix comments
* fix some
* fix some
* fix comments
* add more
* fix potential problem
* remove not need
* remove not need
* remove no need
* fix bug
* fix bugs
* fix comments
* fix comments
* Update tests/ce/deterministic/test_determinism_verification.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/inter_communicator/test_ipc_signal.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/engine/test_sampling_params_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/layers/test_paddle_attention_determinism_standalone.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix comments
* fix import error
* fix a bug
* fix bugs
* fix bugs
* fix coverage
* refine codes
* refine code
* fix comments
* fix comments
* fix comments
* rm not need
* fix allreduce large tensor bug
* mv log files
* mv log files
* add files
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-02-26 19:31:51 -08:00
MingkunZhang
c369f7139f
[Metax][Fix] fix error based pr #6493 ( #6521 )
2026-02-26 18:41:35 +08:00
GoldPancake
2178f2829b
[Speculative Decoding] Support suffix decoding ( #6403 )
...
* support suffix decoding
2026-02-26 11:42:05 +08:00
Yuanle Liu
6d3fede240
[OP][Feature] 统一 limit_thinking_content_length CUDA 算子,支持回复长度限制与注入序列 ( #6493 )
...
* Initial plan
* Migrate PRs #6311 , #6129 , #6305 to develop and merge unit tests
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix
* update
* fix
* fix ci
* fix ci
* Initial plan
* test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add disable-thinking case to test_chat_with_response_max_tokens
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* test: add both reasoning_max_tokens and response_max_tokens case
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
* fix ci
* fix ci
* fix ci
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com >
2026-02-25 21:36:50 +08:00
jackyYang6
a29ee57e15
[Feature] Support ThinkingBudget Logits processor to control thinking content length ( #6367 )
...
* feat: add thinking budget logits processor
* add unittest
* fix pre-commit
* add unittest
* docs: clarify operator-level vs logits processor usage and conflict guidance
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-02-25 14:17:09 +08:00
Yonghua Li
e2332a1112
[BugFix] fix num_cpu_blocks computation ( #6438 )
...
* [BugFix] fix num_cpu_blocks computation
* [fix] fix syntax and log
* [fix] pre-commit
* [fix] use getattr
* [fix] ci test
2026-02-13 11:05:14 +08:00