sunxin
c29e86fc9d
[Feature] Support mtp overlap schedule ( #7001 )
2026-04-01 14:24:26 +08:00
huicongyao
25d64efdc4
[Speculative Decoding] Refactor Eagle MTP hidden states copy ( #6812 )
...
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states
* readibility
* fix xpu bug
* fix coverage failure
* change luanch params & parallelize position_map compute
* Fix MTP-related bugs in FastDeploy centralized inference
* fix
* refactor mtp hidden_states process
* fix
* add unittest & optimize kernel
* remove useless code
* fix
2026-03-25 22:54:31 -07:00
Nyakku Shigure
8b6bbb3504
[Optimization] Use a separate driver when using Triton with Paddle ( #6897 )
...
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-24 10:56:00 +08:00
freeliuzc
e87ce4b8cd
[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess ( #6973 )
...
* support new mtp
* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process
* fix cuda-graph for spec-decoding
* fix xpu mtp and fix some note
* fix unittest and optmize note
* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
sunxin
33e01f22a8
[Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 ( #6894 )
...
* update sampling
* fix
* fix
* fix mtp
* fix test
2026-03-19 01:43:10 -07:00
gongweibao
fb6c56dfd5
[BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 ( #6748 )
...
* [BugFix] Force top_k=1 for greedy decoding when temperature=0
When temperature is set to 0 (greedy decoding), only setting temperature
to a small epsilon is insufficient — the sampling kernel may still pick
non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee
argmax behavior.
Additionally, add argmax fast-path in top_k_top_p_sampling() under
FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that
ignore top_k parameter.
* Extract greedy decoding from FD_DETERMINISTIC_MODE guard
top_k=1 → argmax is a correctness optimization, not deterministic-specific.
Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and
mixed-batch override work unconditionally.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* Update test_torch_model.py
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-18 17:36:43 +08:00
cmcamdy
7591e0d6bc
fix eb5 mtp(mix) ( #6800 )
2026-03-13 17:36:57 +08:00
huicongyao
2e63d88f7a
[Optimization][Speculative Decoding]Fuse padding sampling params ( #6765 )
...
* optimize speculate pre process unit test
* Add CUDA kernel for building sampling params in speculative decoding
* init infer seed in device
* format code
* add unittest & fix
* fix
* format-code
* format-code
* fix rebase
* .
* fix unitest
2026-03-12 05:05:15 -07:00
cmcamdy
3543088d3e
[XPU] rm stop nums ( #6651 )
...
* rm stop nums
* fix conflict
---------
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-03-12 14:05:58 +08:00
freeliuzc
cf7934a4b2
[Speculative Decoding] Unify Spec and non-spec branch ( #6685 )
...
* optimize spec-inference architecture
* delete debug log
* optimize spec_method usage && fix unit_test
* add claude unit-test skill
* fix some ugly bug
* enhance robustness and bounds check
* unify method & spec_method to method to avoid bug
* activate CI
* fix unit test
* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel
* fix logprob bug && optimize verify kernel
* fix exist_decode() judge
2026-03-10 23:58:44 -07:00
sunxin
28f7727a3d
[Feature] Set overlap schedule as default ( #6668 )
...
* overlap default
2026-03-09 22:34:54 +08:00
gongweibao
30f9f33f34
[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM ( #6610 )
...
* add fa deter
* add ut
* add long sentence
* fix basic
* fix bugs
* fix adn
* fix first
* fix single
* fix single
* fix single test
* refine
* add more test
* refine comments
* add comments of bmm
* fix ci
* remove probe
* add
* remove not need
* refine tests
* fix comments and refine code
* refine code
* refine test
* refine test
* mv 4cards tests
* fix tests
* add
* fix comments
* fix cover
* fix cover
---------
Co-authored-by: gongweibao <gognweibao@baidu.com >
2026-03-09 10:27:53 +08:00
ming1753
33d6d2403c
[BugFix] fix bug when seq_lens_this_time is 2D ( #6613 )
2026-03-02 23:52:03 +08:00
ming1753
344db8c8af
[BugFix] Fix mtp when token_ids_all is None ( #6591 )
...
* [BugFix] Fix mtp when token_ids_all is None
* fix bug
2026-03-02 01:23:44 -08:00
周周周
d957ccd46d
seq_lens related tensor shape -> [max_num_seqs] ( #6535 )
2026-03-02 11:18:30 +08:00
ming1753
97eee75677
[Feature] GPU Memory Optimization and Retirement of V0 Scheduler ( #6407 )
...
* Optim GPU Mem Usage
---------
Co-authored-by: huzesen <huzesen@baidu.com >
2026-02-28 15:07:43 +08:00
yzwu
60e75ea8e8
[Iluvatar][CI] Fix cannot import get_stop ( #6165 )
2026-02-10 16:57:23 +08:00
sunxin
783d56e28a
[Optimization] Support logprob async copy ( #6362 )
...
* support logprob async copy
* fix prompt logprob
* fix xpu
2026-02-09 17:32:12 +08:00
周周周
2b4748de4f
[MTP] refactor MTP pre_process ( #6358 )
2026-02-09 10:47:15 +08:00
xiaozude
030647521a
[Metax] adapt to the latest develop ( #6282 )
2026-01-29 23:21:20 -08:00
MingkunZhang
c4abb01f9c
[Metax][Fix] fix 'get_token_penalty_multi_scores' input error based (PaddlePaddle#6069) ( #6266 )
2026-01-29 19:24:36 +08:00
GoldPancake
7d6c87c29e
[Others] Support constrained decoding when enable_thinking is false ( #6248 )
...
* support constrained decoding when enable_thinking is false
* fix
* fix
* fix
2026-01-28 00:05:17 -08:00
freeliuzc
ce06c6dfb3
[BugFix] Fix token_penalty kernel ( #6069 )
...
* fix token_penalty kernel
* try to fix xpu
* fix xpu
* fix unit test
2026-01-28 12:03:05 +08:00
freeliuzc
49617d9832
[Feature]Support tag phase token enforce generation ( #6034 )
...
* support tag phase token enforce generation
* optimize note and some feature
* fix sampler unit test
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-01-15 03:59:55 -08:00
GoldPancake
a1fc4e249e
[Bugfix] Fix mtp logprob hang problem when include stop_seq ( #5927 )
...
* fix mtp logprob hang when include stop_seq
2026-01-08 14:21:24 +08:00
chen
193886e745
only cuda run triton op ( #5846 )
2025-12-31 14:17:31 +08:00
GoldPancake
4e10ae5d99
[Speculative Decoding] Optimize draft logprob ( #5842 )
...
* optimize draft logprob
* fix ut
2025-12-31 13:35:56 +08:00
chen
0bcf924e10
[Optimization] Optimization for gather_logprob by 10GB ( #5817 )
...
* opt logprobs gather_logprob,reduce device memory usage by 10GB when token_num=8k
2025-12-30 15:33:34 +08:00
GoldPancake
23d488c488
[Feature] Entropy calculation support ( #5692 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
* support entropy
* fix bug
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-23 21:19:47 +08:00
RuohengMa
2c3c983b96
[XPU] modify speculate_verify ( #5522 )
2025-12-23 14:50:30 +08:00
freeliuzc
15f5112ecb
[Speculative Decoding]Support different inferseed in speculate decoding ( #5568 )
...
* fix mtp entropy drop in RL
* optimize usage and fix unit test
* optimize padding_sampling_params speed(vectorized)
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-17 16:14:29 +08:00
chen
76649b45c1
[Optimization] compulte real max_logprobs in batch ( #5430 )
2025-12-09 14:15:05 +08:00
cmcamdy
9f4977eb74
[xpu] support mtp for xpu(mix) ( #5274 )
...
* [XPU] support kernel for mtp(base)
* [XPU] support kernel for mtp(base)
* format
* format
* format
* fix gather next token
* fix step && add test
* fix
* mv pre/post process
* add adjust batch / gather next token for mtp
* fix code style
* fix mtp kenrel name
* fix mtp kernel test
* mv xpu pre/post process
* mv xpu pre/post process
* [xpu] support mtp
* fix code style
2025-12-01 11:03:14 +08:00
Daci
eab8384da6
[Feature] ThreadPoolExecutor async fill_token_bitmask ( #5083 )
...
* ThreadPoolExecutor async fill_token_bitmask
* ThreadPoolExecutor async fill_token_bitmask logging
* fix test_guided_decoding
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* add fill_bitmask_parallel_batch_size ENV
* FD_FILL_BITMASK_BATCH fastdeploy.envs
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-19 10:04:16 +08:00
Daci
5fc12eddfe
[Optimization] xgrammar async compile, multi thread, speed up ( #4835 )
...
* xgrammar async compile, multi thread, speed up
* fix test_sampler.py & pre-commit err
* add redis version check && fix request.llm_engine_recv_req_timestamp
* xgrammar prefill & decode & v0
* fix test_gpu_prompt_logprobs.py
* add test_guided_decoding.py
* Update fastdeploy/scheduler/splitwise_scheduler.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/model_executor/guided_decoding/xgrammar_backend.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/model_executor/guided_decoding/xgrammar_backend.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* fix torch xgrammar unittest env
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-11-14 18:05:26 +08:00
SunLei
3098aee05f
[Perf] Support tensor transmission between work and engine with zero-copy to improve efficiency ( #4839 )
...
* feat(zmq): support tensor transmission with zero-copy for improved efficiency
* perf: zmq.send disable copy
* zmq recv data for debug
* convert logprobs tensor to cpu
2025-11-11 15:43:11 +08:00
chen
1c3ca48128
[Feature][Executor] GPU Model Runner Supports prompt_logprobs and max_logprobs ( #4769 )
2025-11-05 10:43:25 +08:00
GoldPancake
1f3ce65b58
[Feature] support mtp distribution equivalence verification ( #4699 )
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-10-31 11:45:04 +08:00
GoldPancake
fddda50cb9
Add ut for speculative sampler ( #4650 )
2025-10-30 10:37:49 +08:00
李泳桦
a012e3608b
[Feature] support logits processors ( #4515 )
...
* [feat] provide an interface for logits processors and a builtin LogitBiasLogitsProcessor
* [chore] fix code style
* [fix] add unit test & fix existing bugs
* [feat] add engine/worker arg --logits-processors
* [fix] redefine user args as logits_processors_args and fix some bugs
* [fix] fix test_sampler
* Update fastdeploy/model_executor/logits_processor/builtin.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update fastdeploy/model_executor/logits_processor/__init__.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update tests/model_executor/test_logits_processor.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* [fix] fix typo
* Update fastdeploy/engine/sampling_params.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* [fix] fix bracelet
* [chore] redefine logits processor interface: pass the entire share_inputs into LP, do not copy share_inputs and logits
* [doc] add docs
* [fix] fix logit bias processor not applied when decoding is too fast & add docs and tests
* [fix] fix redundant code
* [feat] skip apply() if no bias is specified
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-10-29 00:08:53 +08:00
RAM
25a983ba9c
1.fix the bug of draft model with ep 2.fix sampler bug ( #4589 )
2025-10-27 17:47:34 +08:00
chen
5c63a089f6
[Feature] Support logprobs_mode ( #4567 )
2025-10-27 14:27:48 +08:00
GoldPancake
47595a2480
[Feature] support mtp logprob ( #4464 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* support mtp logprob
* fix unitest
2025-10-20 15:18:12 +08:00
Jianyu Li
3bbe99eae7
[Intel HPU] Enable dist sampler on intel hpu platform ( #4445 )
2025-10-16 19:02:27 +08:00
RAM
aa27b03bc0
[Executor]CUDAGraph support Speculate Decode ( #3769 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* success run ngram
* Revert "[Code Simplification] remove cum_offsets (#3410 )"
This reverts commit 32b39620bc .
* success run ngram5 tp4 42bs
* success run ngram5 tp4 42bs
* mtp draft commit
* add decorator for target model
* enable draft model in cudagraph v0.5
* revert revrt cum_offset
* enable target model in cudagraph v0.9 And clean debug code
* Revert "success run ngram"
This reverts commit 8351e83993 .
* add reverted code
* enable target model in cudagraph v0.9
* solve comment
* fix bid < 0
* Enable Target Model Padding And Draft Model in cudagraph
* solve problem
* delete rebuild padding debug note
* fast compile
* Add capture list for mtp
* success run 256 tp1 mtp
* Enable Lite TP2 Bsz256
* realy enable tp2 bsz 256
* fix problem
* Solve problem for Draft model in cudagraph
* Solve comment
* replace emptytensor as zeros
* Solve comments
* Revert "fast compile"
This reverts commit 834639a7ff .
* fix bug
* fix merge bug
* fix typo
* fix bug
---------
Co-authored-by: lizexu <2694294196@qq.com >
Co-authored-by: littledgg <1658565283@qq.com >
Co-authored-by: zeroRains <linjunlu@zerorains.top >
Co-authored-by: gongshaotian <gstain5555@outlook.com >
2025-10-09 21:18:29 +08:00
fmiao2372
f1b5392e20
[Intel HPU] Support intel hpu platform ( #4161 )
...
* [Intel HPU] Support intel hpu platform
* fix some issues
* apply precommit and move AttentionBackend_HPU
* fix format issue
* correct ops import
* fix ci issue
* update code in layers
* fix code style issue
* remove dense tp moe ep mode
* fix enc_dec_block_num
* fix rebase issue
* rename hpu to gaudi in readme
* rename ForwardMeta_HPU to HPUForwardMeta
2025-09-24 12:27:50 +08:00
YuanRisheng
2e9e53ff7e
[FDConfig]Remove max_num_batched_tokens/max_num_seqs in parallel config ( #4116 )
...
* remove max_num_batched_tokens in parallel config
* remove max_num_seqs
* update test case
* fix test
* fix
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2025-09-17 10:43:35 +08:00
co63oc
8466219ec8
fix typos ( #3840 )
...
* fix typos
* ci
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-09-12 11:04:38 +08:00
kevin
1908465542
[Feature] mm and thinking model support structred output ( #2749 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
* mm support structured output
* update code
* update code
* update format
* update code
* update code
* add enable_thinking default
* update code
* add structured_outputs test case
* add ci install xgrammar
* add ci timeout time
* update test for structured_outputs
* update code
* add error traceback info
* update error msg
* update structred output code
* update code
* update code
* update config
* update torch version
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2025-09-02 16:21:09 +08:00
co63oc
d6369b4d51
fix typos ( #3684 )
2025-09-01 17:50:17 +08:00