mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Files

T

yuxuan 44b52701f6 [Feature] Support NVFP4 MoE on SM100 (#6003 )

* fp4 dense

* [WIP] support nvfp4, dense part

* [wip] developing loading qwen model

* loading

* update

* dense fp4 OK, cudagraph error

* [WIP] moe forward part

* with flashinfer-backend

* qwen3_moe_fp4

* update

* support flashinfer-cutlass moe, qwen3-moe-fp4 OK

* support ernie4.5-fp4

* fix load error

* add some ut

* add docs

* fix CLA, test

* fix the apply() in ModelOptNvFp4FusedMoE

* fix CodeStyle

* del the PADDLE_COMPATIBLE_API

* fix broken url: nvidia_gpu.md

* fix docs

* fix token_ids

* fix CI in Hopper

* move flashinfer imports inside the function

* fix model_runner

Removed the logic for generating random padding IDs.

* Remove skip condition for CUDA version in nvfp4 test

* add test for nvfp4

* fix according to review

* Add Chinese translation link to NVFP4 documentation

* del flashinfer.py

* fix unittest

---------

Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>

2026-01-29 14:16:07 +08:00

images

update_wint2_doc (#3968 )

2025-09-08 15:53:09 +08:00

nvfp4.md

[Feature] Support NVFP4 MoE on SM100 (#6003 )

2026-01-29 14:16:07 +08:00

online_quantization.md

[Docx] add language (en/cn) switch links (#4470 )

2025-10-17 15:47:41 +08:00

README.md

[Docx] add language (en/cn) switch links (#4470 )

2025-10-17 15:47:41 +08:00

wint2.md

[Docx] add language (en/cn) switch links (#4470 )

2025-10-17 15:47:41 +08:00

README.md

简体中文

Quantization

FastDeploy supports various quantization inference precisions including FP8, INT8, INT4, 2-bits, etc. It supports different precision inference for weights, activations, and KVCache tensors, which can meet the inference requirements of different scenarios such as low cost, low latency, and long context.

1. Precision Support List

Quantization Method	Weight Precision	Activation Precision	KVCache Precision	Online/Offline	Supported Hardware
WINT8	INT8	BF16	BF16	Online	GPU, XPU
WINT4	INT4	BF16	BF16	Online	GPU, XPU
Block-wise FP8	block-wise static FP8	token-wise dynamic FP8	BF16	Online	GPU
WINT2	2Bits	BF16	BF16	Offline	GPU
MixQuant	INT4/INT8	INT8/BF16	INT8/BF16	Offline	GPU, XPU

Notes

Quantization Method: Corresponds to the "quantization" field in the quantization configuration file.
Online/Offline Quantization: Mainly used to distinguish when to quantize the weights.
- Online Quantization: The weights are quantized after being loaded into inference engine.
- Offline Quantization: Before inference, weights are quantized offline and stored as low-bit numerical types. During inference, the quantized low-bit numerical values are loaded.
Dynamic/Static Quantization: Mainly used to distinguish the quantization method of activations
- Static Quantization: Quantization coefficients are determined and stored before inference. During inference, pre-calculated quantization coefficients are loaded. Since quantization coefficients remain fixed (static) during inference, it's called static quantization.
- Dynamic Quantization: During inference, quantization coefficients for the current batch are calculated in real-time. Since quantization coefficients change dynamically during inference, it's called dynamic quantization.

2. Model Support List

Model Name	Supported Quantization Precision
ERNIE-4.5-300B-A47B	WINT8, WINT4, Block-wise FP8, MixQuant

3. Quantization Precision Terminology

FastDeploy names various quantization precisions in the following format:

{tensor abbreviation}{numerical type}{tensor abbreviation}{numerical type}{tensor abbreviation}{numerical type}

Examples:

W8A8C8: W=weights, A=activations, C=CacheKV; 8 defaults to INT8
W8A8C16: 16 defaults to BF16, others same as above
W4A16C16 / WInt4 / weight-only int4: 4 defaults to INT4
WNF4A8C8: NF4 refers to 4bits norm-float numerical type
Wfp8Afp8: Both weights and activations are FP8 precision
W4Afp8: Weights are INT4, activations are FP8