[Feature] Support EP prefill with num_worst_tokens (#6574)

* support num worst tokens

* support num worst tokens

* fix build error

* support num worst tokens: fix errors

* support num worst tokens: fix feild

* support num worst tokens: delete requiements

* replace permute and depermute op by pure cuda

* replace permute and depermute op by pure cuda

* fix ci

* fix op

* fix nan

* fix code style

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
This commit is contained in:
RichardWooSJTU
2026-03-11 17:09:07 +08:00
committed by GitHub
parent 0466c7e8a8
commit 9f0778f991
21 changed files with 1775 additions and 166 deletions
@@ -326,8 +326,9 @@ class BlockWiseFP8LinearMethod(QuantMethodBase):
return linear_out
if not fastdeploy.envs.FD_USE_PHI_FP8_QUANT:
x, x_scale_tensor = fastdeploy.model_executor.ops.gpu.per_token_quant_padding(
x, self.quant_config.weight_block_size[0]
x, self.quant_config.weight_block_size[0], self.quant_config.deepgemm_scale_ue8m0
)
x_scale_tensor = x_scale_tensor[: x.shape[0], ...]
else:
x, x_scale_tensor = paddle.incubate.nn.functional.fp8_quant_blockwise(
x,