FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 08:21:53 +08:00

Author	SHA1	Message	Date
freeliuzc	22a4f6019d	[Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy (#7467 ) * fix repeat_time kernel and change default spec verify strategy * fix unit_test	2026-04-18 00:38:01 +08:00
GoldPancake	df3b4e12f4	[Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442 ) * support mtp logprob in pd * fix * fix * fix * fix xpu bugs	2026-04-17 21:37:38 +08:00
lonelygsh	e83d45833f	[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166 ) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.	2026-04-13 20:53:42 +08:00
cloudforge1	c529c2ad98	[Optimization]【Hackathon 10th Spring No.49】GPU ngram_match: BlockScan Phase 2 -optimized (#7136 ) * Port ngram_match and hybrid_mtp_ngram kernels to CUDA Replace CPU n-gram matching kernels with GPU CUDA kernels to eliminate CPU↔GPU data transfer overhead in speculative decoding. Key changes: - ngram_match.cc → ngram_match.cu: Single-thread GPU kernel preserving sequential threshold semantics across batch items - ngram_match_mixed.cu: Replace CPU function with __global__ kernel - ngram.py: Remove ~10 .cpu() tensor copies, pass GPU tensors directly - mtp.py: Remove .cpu()/.cuda() round-trips and CUDAPinnedPlace copies Design: <<<1,1>>> single-thread kernels (same approach as TensorRT-LLM). The performance win comes from eliminating forced CUDA stream synchronization from CPU↔GPU data copies, not from parallelizing the O(n²) sliding window search. * Add correctness + latency test for GPU ngram kernels * Fix test data: step_idx semantics and ngram-matchable patterns * fix: add CPU fallback path for ngram_match and hybrid_mtp_ngram ops Restore backward compatibility with existing CPU-only operator tests (test_ngram_match.py, test_hybrid_mtp_ngram.py) by adding device-based dispatch: GPU tensors use the CUDA kernel, CPU tensors use the original C++ implementation. * fix(test): wrap imported ops with staticmethod to prevent self-binding Python descriptor protocol passes 'self' as first arg when a function stored as class attribute is accessed via instance. Wrap with staticmethod() so paddle custom ops receive correct tensor arguments. * fix(test): ensure max_model_len >= input_len to prevent broadcast error in latency test * fix: keep input_ids_len on CPU in __init__, move to GPU in _run_impl Reverts line 39 to match develop (keeps .cpu()) so diff-cover no longer flags it as an uncovered changed line. The tensor is moved to GPU via .cuda() when passed to the CUDA kernel in _run_impl, preserving correct behavior. * Extract shared ngram search into __device__ helper (ngram_match_common.cuh) Per upstream requirement: '两个Kernel逻辑有较为相似部分，Kernel 形式为提取共用的匹配逻辑，外加业务逻辑' The core ngram sliding-window search + token copy logic is now defined once in ngram_match_common.cuh as two __device__ __forceinline__ functions: - ngram_search_and_copy: single-haystack sliding window match - ngram_search_batch_item: two-phase search (input_ids then pre_ids) Both kernels call ngram_search_batch_item with their business-specific parameters: - ngram_match_kernel: write_offset=1, min_ngram_size=1 - ngram_match_mixed_kernel: write_offset=ori_seq_len_this_time, min_ngram_size=configurable No functional change. CPU fallback paths unchanged. * refactor: parallel CUDA kernels for ngram_match (<<<bsz,256>>> search) Two-phase parallel architecture addressing reviewer feedback: - Phase 1: <<<bsz, 256>>> — parallel sliding-window ngram search using atomicMin64 CAS loop for leftmost-match semantics - Phase 2: <<<1, 1>>> — serial threshold + token copy (inter-batch dependency via running sum of seq_lens_this_time) Phase 1 is O(bsz × seq_len × ngram_size) distributed across bsz × 256 threads. Phase 2 is O(bsz × max_draft_tokens) — negligible. Shared code extracted into ngram_match_common.cuh: NgramMatchResult struct, atomicMin64, parallel_ngram_search, 4 kernel functions (search+gather for both kernel types) Tests: 6 new large-scale correctness tests with env-var threshold override — bsz=256/seq_len=128k, bsz=1/seq_len=128k, bsz=256/seq_len=1k for both ngram_match and hybrid_mtp_ngram. * fix: move __global__ kernel defs from .cuh to .cu files (fix linker multiple-def error) Both ngram_match.cu and ngram_match_mixed.cu include ngram_match_common.cuh. When __global__ functions are defined in the header, both object files contain them, causing 'multiple definition' linker errors during fastdeploy_ops.so link. Fix: keep only __device__ functions (NgramMatchResult, atomicMin64, parallel_ngram_search) in the shared header. Move __global__ kernel definitions into each respective .cu file. Net code change: +304/-304 (zero net lines). * fix: align mixed kernel signatures with host function tensors Fix 7 type-mismatch compilation errors in ngram_match_mixed.cu: - Search kernel: replace seq_lens_encoder/decoder with seq_lens_this_time (host function does not have seq_lens_encoder tensor) - Gather kernel: remove seq_lens_encoder param, compute ori_seq_len_this_time per-batch from seq_lens_this_time (matches CPU path logic) - Fix max_draft_tokens computation to match CPU path formula - Fix skip condition to match CPU path: ori_seq_len_this_time==0 \|\| max_draft_tokens<=0 * 【Hackathon 9th No.49】Replace serial Phase 2 with CUB BlockScan parallel threshold Phase 2 gather kernel now launches <<<1, 1024>>> threads with CUB BlockScan prefix-sum for parallel threshold enforcement, replacing the serial <<<1,1>>> loop. Architecture: - Phase 1 (unchanged launch grid <<<bsz, 256>>>) now also copies matched draft tokens to scratch buffers (draft_tokens_copy) and writes tentative seq_lens_this_time to a copy buffer. - Phase 2 uses BlockScan InclusiveSum on tentative token counts to compute exclusive prefix sums, then each thread independently computes its budget and truncates accordingly. Both ngram_match.cu and ngram_match_mixed.cu updated. Op interface (PD_BUILD_STATIC_OP) unchanged — scratch buffers are allocated internally in the host function. * fix: resolve Copilot/bot review comments on PR #7136 - Remove dead NgramMatchResult writes from both Phase 1 kernels - Fix encoder-active init: default seq_lens_this_time_copy=0, set 1 for active - Add remaining_active budget deduction to mixed gather kernel (parity) - Add PD_CHECK(max_batch_size <= NGRAM_GATHER_THREADS) to both host functions - Remove unused match_buf/match_results allocation from both host functions - Pass seq_lens_encoder to Phase 2 gather for encoder-active skip - clang-format applied * test: add multi-scale latency benchmark (batch 32→1024) Adds test_latency_scaling that benchmarks GPU kernel vs CPU path at batch sizes 32, 128, 256, 512, 1024 with input_len=512. Shows Phase 2 BlockScan scaling and per-batch-item amortization. * cleanup: remove unused kernel params, dead struct, add benchmark env gate - Remove unused max_draft_tokens_param from ngram_match_search_kernel (draft_token_num[batch_idx] already covers the constraint) - Remove unused seq_lens_decoder from ngram_match_mixed_search_kernel (only used in gather kernel, not search kernel) - Remove dead NgramMatchResult struct from ngram_match_common.cuh - Add BENCHMARK_NGRAM env gate to test_latency and test_latency_scaling (prevents benchmark tests from inflating CI runtime) * revert: remove benchmark env gate — let CI run benchmarks * fix: address Copilot review — GPU mirror for input_ids_len, device fix in mtp, benchmark timing isolation * fix: correct stale comment in mixed gather (at-least-ori → 1-token) * bench: add 5-group benchmark matching NKNaN methodology Groups: seq_len, batch_size, ngram hit pattern, threshold, threshold×batch. Data creation outside timing loop. GPU kernel vs CPU-copy path. * fix: rename benchmark for CI discovery, bump to 10k iterations - Renamed benchmark_ngram_kernel.py → test_benchmark_ngram_kernel.py so pytest discovers it (test_.py pattern) - Bumped NUM_ITERS 10→10000, WARMUP 2→5 for noise-free profiling - Gated benchmark class with RUN_NGRAM_BENCHMARKS=1 (won't bloat CI) fix: correct stale filename in benchmark docstring * fix: move PD_CHECK before Phase 1 launch (fail-fast) * bench: remove env-gate from benchmark groups, cut NUM_ITERS to 1000 Benchmark groups 1-5 now run unconditionally in CI (~9s total). Env-gates moved to separate PR #7170. * fix: address Copilot review — conditional return, defensive guards, GPU placement - ngram_match.cu: add remaining<=0 early return, conditional return only when tokens produced (matches CPU continue behavior), include encoder-active items in Phase 2 threshold-budget scan - ngram_match_mixed.cu: split max_draft_tokens into explicit steps to prevent negative intermediates, conditional return only when tokens produced, add seq_lens_decoder invariant comment - ngram.py: explicit .cuda() on input_ids_len_gpu creation - test_ngram_gpu_kernel.py: use CPUPlace() in latency benchmark to measure actual D2H/H2D roundtrip * fix: clarify CAS comment, fix negative intermediate in CPU fallback - Add CAS non-atomic initial read comment in atomicMin64 (#3031826678) - Split draft_budget into explicit int64_t steps in CPU fallback (#3031240456) * perf: A1 (1024 threads) + A2 (early-exit) + fix B1 UB in ngram_match - NGRAM_BLOCK_THREADS 256→1024: 4× thread parallelism per block - Add early-exit break when position exceeds current best match - Fix __ballot_sync UB: was inside divergent if(match) + loop break, revert to plain atomicMin64 (contention-free since matches are rare) - Update stale '256 threads' comments in both .cu files * perf: template-specialize ngram search + cache scratch buffers + fix benchmark Kernel optimizations: - Template-specialize parallel_ngram_search for ngram_size 1,2,3: register-cached ngram tokens, #pragma unroll, __restrict__ hints - Cache Phase 1→2 scratch buffers (grow-only static paddle::Tensor) to eliminate per-call paddle::empty allocation overhead Benchmark fix: - Pre-allocate output tensors once, use fill_() in timing loop instead of creating new paddle.zeros/ones each iteration (removes ~20-40µs measurement noise per iteration) --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>	2026-04-07 01:36:25 -07:00
huicongyao	095a11d932	fix MTP bugs in TP and overlap (#7172 ) * fix MTP bugs in TP and overlap * fix	2026-04-03 14:19:11 +08:00
Yuanle Liu	1af7f80811	Revert "[BugFix][Speculative Decoding] Correct index calculation in speculate…" (#7133 ) This reverts commit `ba1aa1edff`.	2026-04-01 06:54:23 -07:00
lonelygsh	ba1aa1edff	[BugFix][Speculative Decoding] Correct index calculation in speculate decoding operators (#7121 ) - Fix accept_idx calculation in spec_set_value_by_stop_seqs - Fix condition check from < to <= for token matching - Fix accept_tokens indexing logic - Remove unnecessary -1 in current_step comparison for max_think_len Co-authored-by: guanshihui] <guanshihui@baidu.com>	2026-04-01 05:36:53 -07:00
sunxin	c29e86fc9d	[Feature] Support mtp overlap schedule (#7001 )	2026-04-01 14:24:26 +08:00
huicongyao	dd2aa10ed4	fix cuda graph capture failure in CI test (#7094 )	2026-03-31 11:05:51 +08:00
huicongyao	25d64efdc4	[Speculative Decoding] Refactor Eagle MTP hidden states copy (#6812 ) * reformat eagle_get_hidden_states & eagle_get_self_hidden_states * readibility * fix xpu bug * fix coverage failure * change luanch params & parallelize position_map compute * Fix MTP-related bugs in FastDeploy centralized inference * fix * refactor mtp hidden_states process * fix * add unittest & optimize kernel * remove useless code * fix	2026-03-25 22:54:31 -07:00
freeliuzc	7a6c28781b	[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005 ) * optimize attn_mask_offset and optimize mtp usage * delete useless branch * fix kernel format * fix kernel runner	2026-03-25 01:52:06 -07:00
freeliuzc	e87ce4b8cd	[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973 ) * support new mtp * refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process * fix cuda-graph for spec-decoding * fix xpu mtp and fix some note * fix unittest and optmize note * fix model status update in eos-branch	2026-03-24 10:19:01 +08:00
Yonghua Li	7c8c0a3c02	[BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822 ) * [BugFix] replace ftok with custom_ftok in get_output/save_output ops * [Test] add unit test for custom_ftok * [Chore] create custom_ftok.h * [Chore] reorganize header file * [Fix] fix cache messager msg_queue_id+rank_id conflict	2026-03-16 14:22:18 +08:00
freeliuzc	12f412448b	[Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch (#6825 )	2026-03-12 23:48:24 -07:00
huicongyao	2e63d88f7a	[Optimization][Speculative Decoding]Fuse padding sampling params (#6765 ) * optimize speculate pre process unit test * Add CUDA kernel for building sampling params in speculative decoding * init infer seed in device * format code * add unittest & fix * fix * format-code * format-code * fix rebase * . * fix unitest	2026-03-12 05:05:15 -07:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00
wangyifei	b57c960837	cuda13.0, implement changes to CCCL (#6751 )	2026-03-10 16:47:02 +08:00
sunxin	0dc7034ce0	[Model Runner] Deprecate not_need_stop (#6356 ) * Deprecate not_need_stop	2026-03-05 10:55:42 +08:00
gongweibao	ddb06ff83f	init (#6642 ) Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-04 21:55:31 +08:00
huicongyao	0f718baaf2	[Speculative Decoding]Reformat input preprocess for spec decode (#6501 ) * add speculate_pre_process kernel * reduce one slice * make d2h async && fix mtp bug for new pre_process * fix * add unitest * fix: code stype formatting * fix * fix: thread race in speculate_preprocess && rename d2h event	2026-03-03 10:22:07 +08:00
ming1753	97eee75677	[Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407 ) * Optim GPU Mem Usage --------- Co-authored-by: huzesen <huzesen@baidu.com>	2026-02-28 15:07:43 +08:00
cmcamdy	13447279aa	[XPU] Fix PD + MTP (#6495 ) * fix pd + mtp * fix code style * fix PD + MTP, D get P's first token * add anno for gpu(speculate_update) * update draft insertv1 * fix wapper & kernel * fix wapper * fix code stype	2026-02-27 19:07:35 +08:00
Yuanle Liu	6d3fede240	[OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6493 ) * Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2026-02-25 21:36:50 +08:00
周周周	2b4748de4f	[MTP] refactor MTP pre_process (#6358 )	2026-02-09 10:47:15 +08:00
周周周	8277b95fa6	remove speculate_get_padding_offset op (#6308 )	2026-02-03 15:18:12 +08:00
xiaozude	030647521a	[Metax] adapt to the latest develop (#6282 )	2026-01-29 23:21:20 -08:00
freeliuzc	ce06c6dfb3	[BugFix] Fix token_penalty kernel (#6069 ) * fix token_penalty kernel * try to fix xpu * fix xpu * fix unit test	2026-01-28 12:03:05 +08:00
周周周	0966df78dc	[Others] remove stop_nums (#6182 )	2026-01-26 12:12:47 +08:00
freeliuzc	49617d9832	[Feature]Support tag phase token enforce generation (#6034 ) * support tag phase token enforce generation * optimize note and some feature * fix sampler unit test --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-15 03:59:55 -08:00
chenjian	74d0f1c01f	[Optim] Robust sync status when preempted happens (#5796 ) * [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-14 12:07:33 +08:00
GoldPancake	a1fc4e249e	[Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927 ) * fix mtp logprob hang when include stop_seq	2026-01-08 14:21:24 +08:00
freeliuzc	9018ccf74e	[Speculative Decoding] Fix attn_mask_offset for multi-step MTP in mixed and PD-split modes (#5738 ) * fix attn_mask_offset in mtp with multi-step and pd-split-mode * fix xpu operater register * update pmtp multi-step mtp strategy in d-split -mode * add note * fix xpu register	2025-12-25 01:54:59 -08:00
Yuanle Liu	867803ae10	[BugFix] fix speculate_limit_thinking_content_length (#5590 ) * fix speculate_limit_thinking_content_length * update	2025-12-16 04:31:45 -08:00
Copilot	e38709b499	[BugFix] Fix limit_thinking early return logic in CUDA kernels (#5471 ) * Initial plan * [BugFix] Fix limit_thinking bug - change AND to OR in condition checks Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * Update Chinese comments to reflect OR logic instead of AND Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2025-12-10 11:03:19 +08:00
lizexu123	95eab9f9ee	[Feature] support stop_token_ids (#5399 ) * support stop_token_ids * fix * delete chinese * support both * delete print	2025-12-09 17:49:12 +08:00
xiaozude	df67379bc3	[Metax] modify wrapSize to WARP_SIZE (#5442 )	2025-12-09 01:44:02 -08:00
K11OntheBoat	8d99bac532	Remove CUDA ERROR 9 of inputs of get_padding_offset kernel (#5440 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2025-12-09 14:17:30 +08:00
GoldPancake	8545b705ed	fix top_p_candidates (#5400 ) Co-authored-by: freeliuzc <lzc842650834@gmail.com>	2025-12-05 20:01:05 +08:00
GoldPancake	cfc5b0ccf9	[BugFix] fix mtp logprob bugs in chunk prefill (#5244 ) * fix mtp logprob bugs in chunk prefill * fix * fix	2025-11-27 11:31:29 +08:00
freeliuzc	f1e36ff2f7	[Speculative Decoding][MTP]Support stop_seqs and pd-split mode (#5029 ) * support multi_stop_seqs in speculative decoding * support mtp tp with ep split * fix custom op register * fix spec stop_seqs params	2025-11-20 15:26:01 +08:00
freeliuzc	11398790d3	[Speculative Decoding][MTP]Support attn mask offset (#4641 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * [MTP]Merge support attn (#4591) * support mask_offset in speculate decoding * fix dummpy run output * add unit test * fix unit test import * support attn_mask_offset in mtp mode * add update_attn_mask op * fix unit test && fix code-style	2025-11-03 10:08:01 +08:00
freeliuzc	f44f4bafd1	support mtp in splitewise and scheduler_v1 mode (#4743 )	2025-11-03 10:07:15 +08:00
Yuanle Liu	b301bd6c31	[BugFix] fix thinking bug (#4710 ) * fix thinking bug * fix ut * update * fix	2025-10-31 22:00:31 +08:00
GoldPancake	1f3ce65b58	[Feature] support mtp distribution equivalence verification (#4699 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-10-31 11:45:04 +08:00
Yuanle Liu	cef3164c3b	Optimizing the performance of think length limit using custom operators (#4279 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details Publish Job / publish_pre_check (push) Has been cancelled Details Publish Job / print_publish_pre_check_outputs (push) Has been cancelled Details Publish Job / FD-Clone-Linux (push) Has been cancelled Details Publish Job / Show Code Archive Output (push) Has been cancelled Details Publish Job / BUILD_SM8090 (push) Has been cancelled Details Publish Job / BUILD_SM8689 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled Details Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled Details Publish Job / Run FD Image Build (push) Has been cancelled Details Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled Details Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details Publish Job / Run Base Tests (push) Has been cancelled Details Publish Job / Run Accuracy Tests (push) Has been cancelled Details Publish Job / Run Stable Tests (push) Has been cancelled Details CI Images Build / FD-Clone-Linux (push) Has been cancelled Details CI Images Build / Show Code Archive Output (push) Has been cancelled Details CI Images Build / CI Images Build (push) Has been cancelled Details CI Images Build / BUILD_SM8090 (push) Has been cancelled Details CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled Details CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled Details CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled Details CI Images Build / Run Base Tests (push) Has been cancelled Details CI Images Build / Run Accuracy Tests (push) Has been cancelled Details CI Images Build / Run Stable Tests (push) Has been cancelled Details CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled Details * delete impl * delete min_length&max_length * support limit thinking content strategy * fix * fix * fix * update * fix set_value_by_flags_and_idx * fix * fix * fix * fix * update * fix * fix * fix typo * fix ci * fix * fix * support mtp * fix * fix * update * update	2025-10-20 21:09:13 +08:00
GoldPancake	47595a2480	[Feature] support mtp logprob (#4464 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * support mtp logprob * fix unitest	2025-10-20 15:18:12 +08:00
freeliuzc	582aebd48b	[MTP]support mtp chunk_prefill_v1 (#4366 ) * support mtp chunk_prefill_v1 * fix mtp chunkprefill output, fix unit test * fix unit test * fix save_output	2025-10-15 13:21:32 +08:00
freeliuzc	365601ea5a	[MTP]support more branchs in topp kernel (#4352 )	2025-10-11 11:33:52 +08:00
RAM	aa27b03bc0	[Executor]CUDAGraph support Speculate Decode (#3769 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details * success run ngram * Revert "[Code Simplification] remove cum_offsets (#3410)" This reverts commit `32b39620bc`. * success run ngram5 tp4 42bs * success run ngram5 tp4 42bs * mtp draft commit * add decorator for target model * enable draft model in cudagraph v0.5 * revert revrt cum_offset * enable target model in cudagraph v0.9 And clean debug code * Revert "success run ngram" This reverts commit `8351e83993`. * add reverted code * enable target model in cudagraph v0.9 * solve comment * fix bid < 0 * Enable Target Model Padding And Draft Model in cudagraph * solve problem * delete rebuild padding debug note * fast compile * Add capture list for mtp * success run 256 tp1 mtp * Enable Lite TP2 Bsz256 * realy enable tp2 bsz 256 * fix problem * Solve problem for Draft model in cudagraph * Solve comment * replace emptytensor as zeros * Solve comments * Revert "fast compile" This reverts commit `834639a7ff`. * fix bug * fix merge bug * fix typo * fix bug --------- Co-authored-by: lizexu <2694294196@qq.com> Co-authored-by: littledgg <1658565283@qq.com> Co-authored-by: zeroRains <linjunlu@zerorains.top> Co-authored-by: gongshaotian <gstain5555@outlook.com>	2025-10-09 21:18:29 +08:00
co63oc	30a1c1783f	rename eagle_get_base_model_hidden_states.cu (#3753 ) CE Compile Job / ce_job_pre_check (push) Has been cancelled Details CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled Details CE Compile Job / FD-Clone-Linux (push) Has been cancelled Details CE Compile Job / Show Code Archive Output (push) Has been cancelled Details CE Compile Job / BUILD_SM8090 (push) Has been cancelled Details CE Compile Job / BUILD_SM8689 (push) Has been cancelled Details CE Compile Job / CE_UPLOAD (push) Has been cancelled Details Deploy GitHub Pages / deploy (push) Has been cancelled Details	2025-09-07 10:24:58 +08:00

1 2

67 Commits