[Speculative Decoding] Unify Spec and non-spec branch (#6685)

* optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge
2026-04-23 08:21:53 +08:00 · 2026-03-11 14:58:44 +08:00
parent b6190de557
commit cf7934a4b2
41 changed files with 3428 additions and 392 deletions
@@ -159,7 +159,7 @@ class CudaGraphPiecewiseBackend:
        real_shape = ids_remove_padding.shape[0]
        if self.speculative_decoding and all(self.real_bsz_to_captured_size.values()):
            seq_lens_this_time: paddle.Tensor = kwargs["forward_meta"].seq_lens_this_time
-            num_running_requests = seq_lens_this_time.flatten().nonzero(as_tuple=False)[-1].item() + 1
+            num_running_requests = int((seq_lens_this_time.flatten() > 0).sum().item())
            real_shape = self.real_bsz_to_captured_size[num_running_requests]
        exist_prefill = kwargs["forward_meta"].exist_prefill
        # Static split graph mode: use Static + CUDAGraph for prefill/mixed phase