mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 08:21:53 +08:00
[Speculative Decoding] Unify Spec and non-spec branch (#6685)
* optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge
This commit is contained in:
@@ -159,7 +159,7 @@ class CudaGraphPiecewiseBackend:
|
||||
real_shape = ids_remove_padding.shape[0]
|
||||
if self.speculative_decoding and all(self.real_bsz_to_captured_size.values()):
|
||||
seq_lens_this_time: paddle.Tensor = kwargs["forward_meta"].seq_lens_this_time
|
||||
num_running_requests = seq_lens_this_time.flatten().nonzero(as_tuple=False)[-1].item() + 1
|
||||
num_running_requests = int((seq_lens_this_time.flatten() > 0).sum().item())
|
||||
real_shape = self.real_bsz_to_captured_size[num_running_requests]
|
||||
exist_prefill = kwargs["forward_meta"].exist_prefill
|
||||
# Static split graph mode: use Static + CUDAGraph for prefill/mixed phase
|
||||
|
||||
Reference in New Issue
Block a user