FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
周周周	522d12c25a	add deepep precision test (#6984 )	2026-03-24 19:51:33 +08:00
SUN Dong	6cff780fdb	[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850 ) * [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization * update * update * update * support custom topk inDeepGemmFusedMoeMethod apply_tp * apply_ep_prefill support moe_topk_select * update * add ut * add ut * add ut * modity doc * fix env and docs * add ut --------- Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>	2026-03-24 11:12:39 +08:00
freeliuzc	e87ce4b8cd	[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973 ) * support new mtp * refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process * fix cuda-graph for spec-decoding * fix xpu mtp and fix some note * fix unittest and optmize note * fix model status update in eos-branch	2026-03-24 10:19:01 +08:00
bukejiyu	c62f6b4ea5	[Others] Fix PD reorder for MTP (#6792 ) * fix pd reorder in mtp * add ut * update * fix mtp	2026-03-23 21:10:22 +08:00
wikilsh	5e469fc901	[RL][BugFix][Optimization] Support chunked part files loading and fix model path format in IPC snapshot strategy (#6852 ) * [RL] Support chunked part files loading in IPC snapshot strategy ## Motivation When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike. ## Modifications Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority: 1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike. 2. Single full file (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility). 3. Shared fallback directory (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility). Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`. ## Checklist - [ ] Add at least a tag in the PR title. - [ ] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. Please write the reason in this PR if no unit tests. - [ ] Provide accuracy results. - [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL] Support chunked part files loading in IPC snapshot strategy ## Motivation When using IPC snapshot for elastic recovery in RL training, loading a single large pdparams file causes a significant memory spike. This PR refactors `_update_ipc_snapshot` to support loading chunked part files to avoid the memory spike. ## Modifications Refactored `_update_ipc_snapshot` in `fastdeploy/rl/dynamic_weight_manager.py` with a three-level loading priority: 1. Chunked part files (`model_state.tpR{id}.part{N}.pdparams`): Load multiple smaller shards sequentially, freeing memory between each chunk via `gc.collect()` to avoid memory spike. 2. Single full file (`model_state.tpR{id}.pdparams`): Legacy single-file loading path (preserved for backward compatibility). 3. Shared fallback directory (`/shared_ipc_meta/...`): Oldest legacy fallback path (preserved for backward compatibility). Also fixed the rank ID in the file name pattern from hardcoded `tp0` to dynamic `paddle.distributed.get_rank()`. ## Checklist - [ ] Add at least a tag in the PR title. - [ ] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. Please write the reason in this PR if no unit tests. - [ ] Provide accuracy results. - [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][BugFix] Fix ambiguous model path format and add legacy fallback in IPC snapshot ## Motivation The previous snapshot file naming `model_state.tp{rank}{id}` concatenated rank and id without a separator, causing ambiguity (e.g., rank=1, id=234 and rank=12, id=34 both produce `tp1234`). Additionally, after the naming format is updated, existing checkpoints saved in the old format would fail to load during elastic recovery, causing unnecessary failures. ## Modifications - Add dot separator between rank and id in snapshot file name: `model_state.tp{rank}{id}` → `model_state.tp{rank}.{id}` - Add Priority 3 legacy fallback to load old-format files (`model_state.tp0{id}.pdparams`) for backward compatibility during rolling upgrades - Update docstring and error message to reflect the new 4-level priority Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][Test] Add unit tests for DynamicWeightManager._update_ipc_snapshot Cover all 4 loading priority branches (chunked part files, single full pdparams, legacy format, shared directory fallback) with mock-based tests to verify correct behavior without filesystem or GPU dependencies. Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * [RL][Test] Remove unused import 'call' in test_update_ipc_snapshot.py Co-Authored-By: lishuaihui <lishuaihui@baidu.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * [RL] Fix snapshot part index to match filename numbering Parse part index from filename (e.g. .part0.) instead of using enumerate index, so that logs and src_type stay consistent with the actual file naming convention. Co-Authored-By: wikilsh <wiki_hui@qq.com> --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-23 16:17:41 +08:00
jc	bb881c2c0a	[PD Disaggregation] pd + cache_storage support vl model (#6906 ) * pd + cache_storage support vl model * support vl model * fix test	2026-03-23 15:35:20 +08:00
jackyYang6	634d23a38a	[Bugfix] Align thinking_budget behavior with ERNIE reasoning flow (#6934 ) * [Bugfix] Align thinking_budget behavior with ERNIE reasoning flow * [Docs] Fix thinking_budget markdown formatting * [Test] Align ernie thinking budget test with process_request_dict	2026-03-23 14:15:55 +08:00
YuBaoku	0b4c1cba9b	[CI] Change 21b ep4 to tp1_dp4 in 4_cards_tests (#6745 ) * [CI] Change 21b ep4 to tp1_dp4 in 4_cards_tests	2026-03-20 20:42:23 +08:00
jackyYang6	00eb12f656	[BugFix][Models] Unify PaddleFormers fused QKV TP loading and stabilize fallback TP path (#6555 ) * [BugFix][Models] avoid custom all-reduce in PaddleFormers fallback TP path and tighten TP-aware layout matching * [BugFix][Models] unify PaddleFormers fused QKV TP loading and align fallback tests	2026-03-20 16:37:58 +08:00
AIbin	bf7e2424d0	[Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930 ) * support DSA_MUTI_BATCH * update test topk * update dsk-dsa	2026-03-20 15:59:22 +08:00
cloudforge1	aca733b95c	[CI]【Hackathon 10th Spring No.32】load_weight_utils unit test (#6740 ) * 【Hackathon 10th Spring No.32】Unit test for load_weight_utils.py * [CI]【Hackathon 10th Spring No.32】rewrite load_weight_utils unit test * [CI]【Hackathon 10th Spring No.32】improve load_weight_utils coverage to 83% - Add test_load_ep_checkpoint_basic: exercises EP checkpoint loading with minimal fixture - Add test_composite_ep_branch: covers EP path in load_composite_checkpoint - Add test_get_weight_iterator_unordered: covers unordered sharded safetensors path * [CI]【Hackathon 10th Spring No.32】align load_weight_utils test with gold standard (tmp_path, split tests) * [CI]【Hackathon 10th Spring No.32】add coverage tests for load_weight_utils - Add test_is_layers_grouped: test layers_are_grouped() with grouped, interleaved, and no-layer keys - Add test_save_model_bf16_cache: exercise save_model decorator with is_checkpoint_bf16=True - Add test_composite_checkpoint_ep: test load_composite_checkpoint use_ep=True branch - Add test_composite_checkpoint_rank_mismatch: test tp_size != rank_dirs ValueError - Add test_composite_checkpoint_kv_quant: test float8_e4m3fn kv_cache path - Add __main__ block for direct execution * [CI]【Hackathon 10th Spring No.32】raise load_weight_utils test delta * [CI]【Hackathon 10th Spring No.32】cover TP sequence-parallel MoE load branches * test: add load_reordered_experts, pre-sharded, and empty-state tests --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>	2026-03-20 13:14:30 +08:00
luukunn	f4a79d4c00	[Optimization]Unified data processing for online and offline (#6891 ) * remove process_request * fix chat * fix unit test * remove process response * fix unit test * fix offline decode * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix sampling_params --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-19 21:56:09 +08:00
luukunn	c3d8db85c4	[Optimization] Update ZMQ server (#6735 ) * add batch zmq send reaponse * update * Revert "update" This reverts commit `0234a25b47`. * update * remove lock * fix unit test * add unit test * add unit test * pre commit * add unit test * fix unit test * add unit test * fix worker>1 * update zmq_worker_pid * fix unit test * fix unit test * fix unit test * add unit test * fix unit test * fix first token time * fix logprobs * add unit test * op * remore debug log --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-03-19 21:53:16 +08:00
cloudforge1	9148562ed0	[CI]【Hackathon 10th Spring No.35】resource_manager 单测补充 (#6734 ) * [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充 * [CI]【Hackathon 10th Spring No.35】resource_manager 单测补充 * [CI]【Hackathon 10th Spring No.35】add __main__ block --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com> Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-03-19 17:45:21 +08:00
周周周	b1c800b64b	remove load_up_proj_weight_first (#6932 )	2026-03-19 17:21:34 +08:00
sunxin	33e01f22a8	[Feature][Sampling] Extend top-k_top-p sampling to all backends and unify greedy decoding with top_k=1 (#6894 ) * update sampling * fix * fix * fix mtp * fix test	2026-03-19 01:43:10 -07:00
YuBaoku	2b84a4276e	[CI] Optimize CI: add timeout and cancel on PR close (#6933 )	2026-03-19 15:54:30 +08:00
jc	dd55cda3c8	[CI] Add test for pd and cache storage (#6876 ) * Add test for pd and cache storage * up * up * fix bug * fix bug * up docker image * up	2026-03-19 10:38:27 +08:00
gongweibao	fb6c56dfd5	[BugFix][DataProcessor] Force top_k=1 for greedy decoding when temperature=0 (#6748 ) * [BugFix] Force top_k=1 for greedy decoding when temperature=0 When temperature is set to 0 (greedy decoding), only setting temperature to a small epsilon is insufficient — the sampling kernel may still pick non-top-1 tokens. Explicitly set top_k=1 in all processors to guarantee argmax behavior. Additionally, add argmax fast-path in top_k_top_p_sampling() under FD_DETERMINISTIC_MODE to handle non-rejection sampling backends that ignore top_k parameter. * Extract greedy decoding from FD_DETERMINISTIC_MODE guard top_k=1 → argmax is a correctness optimization, not deterministic-specific. Remove the FD_DETERMINISTIC_MODE guard so all-greedy fast-path and mixed-batch override work unconditionally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update test_torch_model.py --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-18 17:36:43 +08:00
YuBaoku	0359794e08	[CI] Sync _log_softmax_batch_invariant with paddle update (#6893 )	2026-03-17 23:03:57 +08:00
Longzhi Wang	daaf498213	[Feature] support compute shared experts before combine for better overlap (#6697 ) * [Feature] support compute shared experts before combine for better overlap * fix test * fix xpu * fix	2026-03-17 15:18:51 +08:00
jc	950366e58d	[PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd (#6718 ) * [Feature] Register to router with version info for PD disaggregation Add RegisterManager for PD (Prefill-Decode) disaggregated deployment: - All instances (Prefill/Decode) register to Router with heartbeat - Prefill instances fetch Decode instance list from Router - Prefill instances establish eager RDMA connections to Decode instances - Register info includes: host_ip, port, role, version, is_paused, connected_decodes Changes: - Add RegisterManager class for managing PD registration and RDMA connections - Add version field to ModelConfig for model version tracking - Add connected_decodes to register_info for tracking connected Decode instances - Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable Test fixes: - Add None checks for load_config in FDConfig.__init__ - Add version attribute to test mock model configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refine * remove test --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 14:43:35 +08:00
YuBaoku	b152baeeee	[CI] disable test_batch_invariance_op_logsoftmax.py in unit_test	2026-03-17 14:43:14 +08:00
qwes5s5	3b7507a4c2	test_abort (#6743 )	2026-03-17 14:06:40 +08:00
luukunn	fe8d58a094	[Optimization]update request in tool parser&reasoning parser (#6858 ) * update request in tool parser&reasoning parser	2026-03-17 11:51:12 +08:00
RichardWooSJTU	4ed483d20b	[BugFix] Fix ep compatibility issues & Optimize permute operator (#6821 ) * fix ep compatibility issues & optimize permute operator * fix ut * fix ut	2026-03-17 10:32:11 +08:00
gongweibao	a6351dea0b	[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533 ) * init * init * fix format * add * add files * add ut * fix some * add ut * add more * add * fix pre-commit * fix pre-commit * fix cover * skip long seq * add * add * fix * remove not need * fix set attr * fix comments * fix comments * fix failed tests --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-16 21:32:43 +08:00
ming1753	bb925c605f	[Other] Adjust GPUModelRunner to enhance compatibility (#6851 )	2026-03-16 14:49:19 +08:00
gongweibao	3fabba0dc7	[Feature] Add Triton unified attention kernel for deterministic inference (#6795 ) * [Feature] Add Triton unified attention kernel for deterministic inference Add a Triton-based unified extend attention kernel that processes both prefix (cached) and extend (new) KV tokens through a single kernel with unified kv_indices, ensuring identical accumulation order regardless of cache hit/miss patterns. Key components: - _fwd_kernel_unified: Triton JIT kernel with online softmax, paged KV cache support, and causal masking for prefix+extend - Index building utilities: triton_cumsum_with_zero_prefix, build_kv_indices_from_block_tables, build_unified_kv_indices, _scatter_extend_kv_indices_kernel (all CUDA Graph compatible) - pre_cache_len_concat_triton: GPU-only replacement for C++ op - Reference implementations (_ref variants) for correctness validation - Comprehensive tests: kernel correctness, split invariance, determinism, production-scale, cross-validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Vectorize causal mask in test references for ~26x speedup Replace triple Python for-loop with paddle.where vectorized mask in naive_attention and _build_causal_mask. seq4096 test: 2m39s -> 6s. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix cover --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 14:29:45 +08:00
fxyfxy777	4d39232553	[BugFix] add ut for fused_moe_degemm (#6840 ) * add ut * add skip	2026-03-16 12:22:18 +08:00
周周周	091e3c815d	Dsa clean code，add dsk_attn_write_cache baseline (#6855 )	2026-03-16 11:01:14 +08:00
周周周	820eb60ec6	[Others] clean code (#6839 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-14 11:09:28 +08:00
yinwei	3f4441b4b7	[XPU]add mtp cudagraph support (#6831 )	2026-03-13 19:46:53 +08:00
周周周	8c1a2827d3	DSA clean code (#6827 )	2026-03-13 16:39:47 +08:00
freeliuzc	12f412448b	[Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch (#6825 )	2026-03-12 23:48:24 -07:00
gongweibao	8906e09e0f	[Feature][OP] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path (#6749 ) * [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path - Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm - Add linear/linear_v2 tracking wrappers in batch_invariant_mode - Route TP VocabParallelEmbedding through Custom AR instead of NCCL - Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64 - Add unit tests for RMSNorm and TP embedding invariance * [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size - Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical correctness test (0.0078125 diff is expected at bfloat16 precision) - Update test_communication expected buffer size from 8MB to 64MB to match FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add RMSNorm layer batch_invariant_mode unit test for coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add pragma no cover for Triton kernel and multi-GPU embedding path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 14:34:44 +08:00
fxyfxy777	8eb177147c	[BugFix]rm draft code for glm (#6810 ) * rm draft code for glm * fix baseline * fix baseline 2	2026-03-12 23:26:05 -07:00
kesmeey	d935752be7	[CI] 【Hackathon 10th Spring No.20】功能模块 fastdeploy/engine/common_engine.py 单测补充 (#6292 ) * style: format tests/engine/test_common_engine.py with black * test: expand common engine coverage * test: add coverage helper for common_engine * style: format test_common_engine with pre-commit * Remove test_force_coverage_for_common_engine test * Update common engine coverage tests Expand common engine tests and helpers while aligning setup and cleanup behavior. * Fix test_schedule_request_to_worker_v1 by mocking num_tasks to return 0 * Sync test_common_engine with branch 26 * chore: fix codestyle in common engine tests --------- Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-03-13 13:16:07 +08:00
MingkunZhang	cb5a742298	[Metax][Test] enable paddleocr using cudagraph (#6820 )	2026-03-13 10:47:25 +08:00
huicongyao	2e63d88f7a	[Optimization][Speculative Decoding]Fuse padding sampling params (#6765 ) * optimize speculate pre process unit test * Add CUDA kernel for building sampling params in speculative decoding * init infer seed in device * format code * add unittest & fix * fix * format-code * format-code * fix rebase * . * fix unitest	2026-03-12 05:05:15 -07:00
fxyfxy777	250ce40b40	[Feature] use phi permute/unpermute & rm swiglu (#6361 ) * tp文字输出正常 * B eb5 mini文字输出正常 * eb5mini ep B卡文字输出正常 * default use phi moe op * stash * tp H卡正常 * ep ok * rm debug * rm debug tool * rm del ffn_out * rm swiglu * add envs to swiglu * merge dev * fix ci baseline Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix ci baseline 2 --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 02:01:57 -07:00
Jiaxin Sui	a3d7979711	[XPU][CI]Rename test_ep4tp1_online.py to run_ep4tp1_online.py (#6805 )	2026-03-12 16:16:20 +08:00
RAM	cdaf6dd400	[RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599 ) * cherry-pick Support Fully Async and PrefixCache step 1 * copy routing_indices_cache.py from 2.4 * cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388) * cherry-pick [RL] Clear Requests status of R3 (#6569) * delete code * fix rename bug * fix status shape bug * fix ci	2026-03-12 01:13:30 -07:00
yinwei	7d31a728d1	Add PD+EP cudagraph Support	2026-03-12 13:20:59 +08:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
AIbin	1118351b27	[Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest (#6762 ) * add deepseek model doc * update deepseek model doc * update deepseek model doc * update deepseek model doc * cwb suppor DSK_V32 Model * update DSK_V32_DSA modeling * Ibin Support DSK_DSA * update kernel * update yaml * update requirements * update pre_commit * update model-runner * fix CI bug * del start.sh * fix iluvatar_model_runner * update DSA & add unitest * update import deep_gemm	2026-03-11 15:52:54 +08:00
CSWYF3634076	97a4b3631e	[Processor]add qwen3vl prompt_token_ids support (#6764 ) * [Processor]add qwen3vl prompt_token_ids support * [Processor]add qwen3vl prompt_token_ids support unittest * [Processor]add qwen3vl prompt_token_ids support precommit	2026-03-11 15:08:56 +08:00
bukejiyu	cffa8c246c	[Others]update paddleformer 1.0.0 (#6496 ) * update paddleformer 1.0.0 * update	2026-03-11 15:06:29 +08:00
Yonghua Li	7811eeccaa	[fix] resolve get_save_output_v1 socket name conflicts between multiple instances (#6758 )	2026-03-11 15:02:32 +08:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00

1 2 3 4 5 ...

875 Commits