Commit Graph

875 Commits

Author SHA1 Message Date
周周周 522d12c25a add deepep precision test (#6984) 2026-03-24 19:51:33 +08:00
SUN Dong 6cff780fdb [RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850)
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization

* update

* update

* update

* support custom topk inDeepGemmFusedMoeMethod  apply_tp

* apply_ep_prefill support moe_topk_select

* update

* add ut

* add ut

* add ut

* modity doc

* fix env and docs

* add ut

---------

Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>
2026-03-24 11:12:39 +08:00
freeliuzc e87ce4b8cd [Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973)
* support new mtp

* refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process

* fix cuda-graph for spec-decoding

* fix xpu mtp and fix some note

* fix unittest and optmize note

* fix model status update in eos-branch
2026-03-24 10:19:01 +08:00
bukejiyu c62f6b4ea5 [Others] Fix PD reorder for MTP (#6792)
* fix pd reorder in mtp

* add ut

* update

* fix mtp
2026-03-23 21:10:22 +08:00
wikilsh 5e469fc901 [RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy (#6852)
* [RL] Support chunked part files loading in IPC snapshot strategy

## Motivation

When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike.

## Modifications

Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority:

1. **Chunked part files** (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike.
2. **Single full file** (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility).
3. **Shared fallback directory** (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility).

Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`.

## Checklist

- [ ] Add at least a tag in the PR title.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL] Support chunked part files loading in IPC snapshot strategy

## Motivation

When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike.

## Modifications

Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority:

1. **Chunked part files** (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike.
2. **Single full file** (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility).
3. **Shared fallback directory** (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility).

Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`.

## Checklist

- [ ] Add at least a tag in the PR title.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][BugFix] Fix ambiguous model path format and add legacy fallback in IPC snapshot

## Motivation
The previous snapshot file naming `model_state.tp{rank}{id}` concatenated
rank and id without a separator, causing ambiguity (e.g., rank=1, id=234
and rank=12, id=34 both produce `tp1234`). Additionally, after the naming
format is updated, existing checkpoints saved in the old format would fail
to load during elastic recovery, causing unnecessary failures.

## Modifications
- Add dot separator between rank and id in snapshot file name:
  `model_state.tp{rank}{id}` → `model_state.tp{rank}.{id}`
- Add Priority 3 legacy fallback to load old-format files
  (`model_state.tp0{id}.pdparams`) for backward compatibility during
  rolling upgrades
- Update docstring and error message to reflect the new 4-level priority

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][Test] Add unit tests for DynamicWeightManager._update_ipc_snapshot

Cover all 4 loading priority branches (chunked part files, single full
pdparams, legacy format, shared directory fallback) with mock-based
tests to verify correct behavior without filesystem or GPU dependencies.

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* [RL][Test] Remove unused import 'call' in test_update_ipc_snapshot.py

Co-Authored-By: lishuaihui <lishuaihui@baidu.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* [RL] Fix snapshot part index to match filename numbering

Parse part index from filename (e.g. .part0.) instead of using
enumerate index, so that logs and src_type stay consistent with
the actual file naming convention.

Co-Authored-By: wikilsh <wiki_hui@qq.com>

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-23 16:17:41 +08:00
jc bb881c2c0a [PD Disaggregation] pd + cache_storage support vl model (#6906)
* pd + cache_storage support vl model

* support vl model

* fix test
2026-03-23 15:35:20 +08:00
jackyYang6 634d23a38a [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934)
* [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow

* [Docs] Fix thinking_budget markdown formatting

* [Test] Align ernie thinking budget test with process_request_dict
2026-03-23 14:15:55 +08:00
YuBaoku 0b4c1cba9b [CI] Change 21b ep4 to tp1_dp4 in 4_cards_tests (#6745)
* [CI] Change 21b ep4 to tp1_dp4 in 4_cards_tests
2026-03-20 20:42:23 +08:00
jackyYang6 00eb12f656 [BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555)
* [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching

* [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests
2026-03-20 16:37:58 +08:00
AIbin bf7e2424d0 [Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930)
* support DSA_MUTI_BATCH

* update test topk

* update dsk-dsa
2026-03-20 15:59:22 +08:00
cloudforge1 aca733b95c [CI]【Hackathon 10th Spring No.32】load_weight_utils unit test (#6740)
* 【Hackathon 10th Spring No.32】Unit test for load_weight_utils.py

* [CI]【Hackathon 10th Spring No.32】rewrite load_weight_utils unit test

* [CI]【Hackathon 10th Spring No.32】improve load_weight_utils coverage to 83%

- Add test_load_ep_checkpoint_basic: exercises EP checkpoint loading with minimal fixture
- Add test_composite_ep_branch: covers EP path in load_composite_checkpoint
- Add test_get_weight_iterator_unordered: covers unordered sharded safetensors path

* [CI]【Hackathon 10th Spring No.32】align load_weight_utils test with gold standard (tmp_path, split tests)

* [CI]【Hackathon 10th Spring No.32】add coverage tests for load_weight_utils

- Add test_is_layers_grouped: test layers_are_grouped() with grouped, interleaved, and no-layer keys
- Add test_save_model_bf16_cache: exercise save_model decorator with is_checkpoint_bf16=True
- Add test_composite_checkpoint_ep: test load_composite_checkpoint use_ep=True branch
- Add test_composite_checkpoint_rank_mismatch: test tp_size != rank_dirs ValueError
- Add test_composite_checkpoint_kv_quant: test float8_e4m3fn kv_cache path
- Add __main__ block for direct execution

* [CI]【Hackathon 10th Spring No.32】raise load_weight_utils test delta

* [CI]【Hackathon 10th Spring No.32】cover TP sequence-parallel MoE load branches

* test: add load_reordered_experts, pre-sharded, and empty-state tests


---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
2026-03-20 13:14:30 +08:00
luukunn f4a79d4c00 [Optimization]Unified data processing for online and offline (#6891)
* remove process_request

* fix chat

* fix unit test

* remove process response

* fix unit test

* fix offline decode

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix sampling_params

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-19 21:56:09 +08:00
luukunn c3d8db85c4 [Optimization] Update ZMQ server (#6735)
* add batch zmq send reaponse

* update

* Revert "update"

This reverts commit 0234a25b47.

* update

* remove lock

* fix unit test

* add unit test

* add unit test

* pre commit

* add unit test

* fix unit test

* add unit test

* fix worker>1

* update zmq_worker_pid

* fix unit test

* fix unit test

* fix unit test

* add unit test

* fix unit test

* fix first token time

* fix logprobs

* add unit test

* op

* remore debug log

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-03-19 21:53:16 +08:00
cloudforge1 9148562ed0 [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充 (#6734)
* [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充

* [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充

* [CI]【Hackathon 10th Spring No.35】add __main__ block

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-03-19 17:45:21 +08:00
周周周 b1c800b64b remove load_up_proj_weight_first (#6932) 2026-03-19 17:21:34 +08:00
sunxin 33e01f22a8 [Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 (#6894)
* update sampling

* fix

* fix

* fix mtp

* fix test
2026-03-19 01:43:10 -07:00
YuBaoku 2b84a4276e [CI] Optimize CI: add timeout and cancel on PR close (#6933) 2026-03-19 15:54:30 +08:00
jc dd55cda3c8 [CI] Add test for pd and cache storage (#6876)
* Add test for pd and cache storage

* up

* up

* fix bug

* fix bug

* up docker image

* up
2026-03-19 10:38:27 +08:00
gongweibao fb6c56dfd5 [BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 (#6748)
* [BugFix] Force top_k=1 for greedy decoding when temperature=0

When temperature is set to 0 (greedy decoding), only setting temperature
to a small epsilon is insufficient — the sampling kernel may still pick
non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee
argmax behavior.

Additionally, add argmax fast-path in top_k_top_p_sampling() under
FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that
ignore top_k parameter.

* Extract greedy decoding from FD_DETERMINISTIC_MODE guard

top_k=1 → argmax is a correctness optimization, not deterministic-specific.
Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and
mixed-batch override work unconditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update test_torch_model.py

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-18 17:36:43 +08:00
YuBaoku 0359794e08 [CI] Sync _log_softmax_batch_invariant with paddle update (#6893) 2026-03-17 23:03:57 +08:00
Longzhi Wang daaf498213 [Feature] support compute shared experts before combine for better overlap (#6697)
* [Feature] support compute shared experts before combine for better overlap

* fix test

* fix xpu

* fix
2026-03-17 15:18:51 +08:00
jc 950366e58d [PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd (#6718)
* [Feature] Register to router with version info for PD disaggregation

Add RegisterManager for PD (Prefill-Decode) disaggregated deployment:
- All instances (Prefill/Decode) register to Router with heartbeat
- Prefill instances fetch Decode instance list from Router
- Prefill instances establish eager RDMA connections to Decode instances
- Register info includes: host_ip, port, role, version, is_paused, connected_decodes

Changes:
- Add RegisterManager class for managing PD registration and RDMA connections
- Add version field to ModelConfig for model version tracking
- Add connected_decodes to register_info for tracking connected Decode instances
- Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable

Test fixes:
- Add None checks for load_config in FDConfig.__init__
- Add version attribute to test mock model configs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refine

* remove test

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 14:43:35 +08:00
YuBaoku b152baeeee [CI] disable test_batch_invariance_op_logsoftmax.py in unit_test 2026-03-17 14:43:14 +08:00
qwes5s5 3b7507a4c2 test_abort (#6743) 2026-03-17 14:06:40 +08:00
luukunn fe8d58a094 [Optimization]update request in tool parser&reasoning parser (#6858)
* update request in tool parser&reasoning parser
2026-03-17 11:51:12 +08:00
RichardWooSJTU 4ed483d20b [BugFix] Fix ep compatibility issues & Optimize permute operator (#6821)
* fix ep compatibility issues & optimize permute operator

* fix ut

* fix ut
2026-03-17 10:32:11 +08:00
gongweibao a6351dea0b [BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533)
* init

* init

* fix format

* add

* add files

* add ut

* fix some

* add ut

* add more

* add

* fix pre-commit

* fix pre-commit

* fix cover

* skip long seq

* add

* add

* fix

* remove not need

* fix set attr

* fix comments

* fix comments

* fix failed tests

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
2026-03-16 21:32:43 +08:00
ming1753 bb925c605f [Other] Adjust GPUModelRunner to enhance compatibility (#6851) 2026-03-16 14:49:19 +08:00
gongweibao 3fabba0dc7 [Feature] Add Triton unified attention kernel for deterministic inference (#6795)
* [Feature] Add Triton unified attention kernel for deterministic inference

Add a Triton-based unified extend attention kernel that processes both
prefix (cached) and extend (new) KV tokens through a single kernel with
unified kv_indices, ensuring identical accumulation order regardless of
cache hit/miss patterns.

Key components:
- _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV
  cache support, and causal masking for prefix+extend
- Index building utilities: triton_cumsum_with_zero_prefix,
  build_kv_indices_from_block_tables, build_unified_kv_indices,
  _scatter_extend_kv_indices_kernel (all CUDA Graph compatible)
- pre_cache_len_concat_triton: GPU-only replacement for C++ op
- Reference implementations (_ref variants) for correctness validation
- Comprehensive tests: kernel correctness, split invariance,
  determinism, production-scale, cross-validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Vectorize causal mask in test references for ~26x speedup

Replace triple Python for-loop with paddle.where vectorized mask in
naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix cover

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 14:29:45 +08:00
fxyfxy777 4d39232553 [BugFix] add ut for fused_moe_degemm (#6840)
* add ut

* add skip
2026-03-16 12:22:18 +08:00
周周周 091e3c815d Dsa clean code,add dsk_attn_write_cache baseline (#6855) 2026-03-16 11:01:14 +08:00
周周周 820eb60ec6 [Others] clean code (#6839)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-03-14 11:09:28 +08:00
yinwei 3f4441b4b7 [XPU]add mtp cudagraph support (#6831) 2026-03-13 19:46:53 +08:00
周周周 8c1a2827d3 DSA clean code (#6827) 2026-03-13 16:39:47 +08:00
freeliuzc 12f412448b [Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch (#6825) 2026-03-12 23:48:24 -07:00
gongweibao 8906e09e0f [Feature][OP] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path (#6749)
* [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path

- Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm
- Add linear/linear_v2 tracking wrappers in batch_invariant_mode
- Route TP VocabParallelEmbedding through Custom AR instead of NCCL
- Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64
- Add unit tests for RMSNorm and TP embedding invariance

* [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size

- Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical
  correctness test (0.0078125 diff is expected at bfloat16 precision)
- Update test_communication expected buffer size from 8MB to 64MB to match
  FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add RMSNorm layer batch_invariant_mode unit test for coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add pragma no cover for Triton kernel and multi-GPU embedding path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: gongweibao <gognweibao@baidu.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 14:34:44 +08:00
fxyfxy777 8eb177147c [BugFix]rm draft code for glm (#6810)
* rm draft code for glm

* fix baseline

* fix baseline 2
2026-03-12 23:26:05 -07:00
kesmeey d935752be7 [CI] 【Hackathon 10th Spring No.20】功能模块 fastdeploy/engine/common_engine.py 单测补充 (#6292)
* style: format tests/engine/test_common_engine.py with black

* test: expand common engine coverage

* test: add coverage helper for common_engine

* style: format test_common_engine with pre-commit

* Remove test_force_coverage_for_common_engine test

* Update common engine coverage tests

Expand common engine tests and helpers while
aligning setup and cleanup behavior.


* Fix test_schedule_request_to_worker_v1 by mocking num_tasks to return 0

* Sync test_common_engine with branch 26

* chore: fix codestyle in common engine tests

---------

Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-03-13 13:16:07 +08:00
MingkunZhang cb5a742298 [Metax][Test] enable paddleocr using cudagraph (#6820) 2026-03-13 10:47:25 +08:00
huicongyao 2e63d88f7a [Optimization][Speculative Decoding]Fuse padding sampling params (#6765)
* optimize speculate pre process unit test

* Add CUDA kernel for building sampling params in speculative decoding

* init infer seed in device

* format code

* add unittest & fix

* fix

* format-code

* format-code

* fix rebase

* .

* fix unitest
2026-03-12 05:05:15 -07:00
fxyfxy777 250ce40b40 [Feature] use phi permute/unpermute & rm swiglu (#6361)
* tp文字输出正常

* B eb5 mini文字输出正常

* eb5mini ep B卡 文字输出正常

* default use phi moe op

* stash

* tp H卡正常

* ep ok

* rm debug

* rm debug tool

* rm del ffn_out

* rm swiglu

* add envs to swiglu

* merge dev

* fix ci baseline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix ci baseline 2

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 02:01:57 -07:00
Jiaxin Sui a3d7979711 [XPU][CI]Rename test_ep4tp1_online.py to run_ep4tp1_online.py (#6805) 2026-03-12 16:16:20 +08:00
RAM cdaf6dd400 [RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599)
* cherry-pick  Support Fully Async and PrefixCache step 1

* copy routing_indices_cache.py from 2.4

* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)

* cherry-pick [RL] Clear Requests status of R3 (#6569)

* delete code

* fix rename bug

* fix status shape bug

* fix ci
2026-03-12 01:13:30 -07:00
yinwei 7d31a728d1 Add PD+EP cudagraph Support 2026-03-12 13:20:59 +08:00
RichardWooSJTU 9f0778f991 [Feature] Support EP prefill with num_worst_tokens (#6574)
* support num worst tokens

* support num worst tokens

* fix build error

* support num worst tokens: fix errors

* support num worst tokens: fix feild

* support num worst tokens: delete requiements

* replace permute and depermute op by pure cuda

* replace permute and depermute op by pure cuda

* fix ci

* fix op

* fix nan

* fix code style

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
2026-03-11 17:09:07 +08:00
AIbin 1118351b27 [Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest (#6762)
* add deepseek model doc

* update deepseek model doc

* update deepseek model doc

* update deepseek model doc

* cwb suppor DSK_V32 Model

* update DSK_V32_DSA modeling

* Ibin Support DSK_DSA

* update kernel

* update yaml

* update requirements

* update pre_commit

* update model-runner

* fix CI bug

* del start.sh

* fix iluvatar_model_runner

* update DSA & add unitest

* update import deep_gemm
2026-03-11 15:52:54 +08:00
CSWYF3634076 97a4b3631e [Processor]add qwen3vl prompt_token_ids support (#6764)
* [Processor]add qwen3vl prompt_token_ids support

* [Processor]add qwen3vl prompt_token_ids support unittest

* [Processor]add qwen3vl prompt_token_ids support precommit
2026-03-11 15:08:56 +08:00
bukejiyu cffa8c246c [Others]update paddleformer 1.0.0 (#6496)
* update paddleformer 1.0.0

* update
2026-03-11 15:06:29 +08:00
Yonghua Li 7811eeccaa [fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758) 2026-03-11 15:02:32 +08:00
freeliuzc cf7934a4b2 [Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture

* delete debug log

* optimize spec_method usage  && fix unit_test

* add claude unit-test skill

* fix some ugly bug

* enhance robustness and bounds check

* unify method & spec_method to method to avoid bug

* activate CI

* fix unit test

* Unify logprobs computation for naive and speculative decoding, fix CUDA kernel

* fix logprob bug && optimize verify kernel

* fix exist_decode() judge
2026-03-10 23:58:44 -07:00