mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
[Feature] Support EP prefill with num_worst_tokens (#6574)
* support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
This commit is contained in:
@@ -326,8 +326,9 @@ class BlockWiseFP8LinearMethod(QuantMethodBase):
|
||||
return linear_out
|
||||
if not fastdeploy.envs.FD_USE_PHI_FP8_QUANT:
|
||||
x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant_padding(
|
||||
x, self.quant_config.weight_block_size[0]
|
||||
x, self.quant_config.weight_block_size[0], self.quant_config.deepgemm_scale_ue8m0
|
||||
)
|
||||
x_scale_tensor = x_scale_tensor[: x.shape[0], ...]
|
||||
else:
|
||||
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
|
||||
x,
|
||||
|
||||
Reference in New Issue
Block a user