Files
FastDeploy/docs/quantization/nvfp4.md
T
lizexu123 5f612a348d [BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)
* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
2026-04-03 15:43:19 +08:00

4.2 KiB

简体中文

NVFP4 Quantization

NVFP4 is an innovative 4-bit floating-point format introduced by NVIDIA. For detailed information, please refer to Introducing NVFP4 for Efficient and Accurate Low-Precision Inference.

Based on FlashInfer, Fastdeploy supports NVFP4 quantized model inference in the format produced by Modelopt.

  • Note: Currently, this feature only supports FP4 quantized models of Ernie/Qwen series.

How to Use

Environment Setup

Supported Environment

  • Supported Hardware: GPU sm >= 100
  • PaddlePaddle Version: 3.3.0 or higher
  • Fastdeploy Version: 2.5.0 or higher

FastDeploy Installation

Please ensure that FastDeploy is installed with NVIDIA GPU support. Follow the official guide to set up the base environment: Fastdeploy NVIDIA GPU Environment Installation Guide.

FlashInfer-cutedsl backend

PaddlePaddle Compatibility Patches for FlashInfer

Due to compatibility issues between FlashInfer and PaddlePaddle, you need to apply the following patches in miniconda/envs/<your_env>/lib/python3.10/site-packages/:

  1. nvidia_cutlass_dsl/python_packages/cutlass/torch.py

    Replace torch.device with "torch.device" (as a string to avoid conflicts).

  2. flashinfer/utils.py

Modify the get_compute_capability function:

@functools.cache
def get_compute_capability(device: torch.device) -> Tuple[int, int]:
    return torch.cuda.get_device_capability(device)
    if device.type != "cuda":
        raise ValueError("device must be a cuda device")
    return torch.cuda.get_device_capability(device.index)
  1. flashinfer/cute_dsl/blockscaled_gemm.py

Replace cutlass_torch.current_stream() with:

cuda.CUstream(torch.cuda.current_stream().stream_base.raw_stream)

Running Inference Service

flashinfer-cutlass backend:

python -m fastdeploy.entrypoints.openai.api_server \
    --model nv-community/Qwen3-30B-A3B-FP4 \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --cache-queue-port 8183 \
    --tensor-parallel-size 1 \
    --max-model-len  32768 \
    --max-num-seqs 128

flashinfer-cutedsl backend:

python -m fastdeploy.entrypoints.openai.multi_api_server \
       --ports "9811,9812,9813,9814" \
       --num-servers 4 \
       --model ERNIE-4.5-21B-A3B-FP4 \
       --disable-custom-all-reduce \
       --tensor-parallel-size 1 \
       --data-parallel-size 4 \
       --no-enable-prefix-caching \
       --max-model-len 65536 \
       --enable-expert-parallel \
       --num-gpu-blocks-override 8192 \
       --max-num-seqs 4 \
       --gpu-memory-utilization 0.9 \
       --max-num-batched-tokens 512 \
       --ep-prefill-use-worst-num-tokens \
       --graph-optimization-config '{"use_cudagraph":false}'

API Access

Make service requests using the following command

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "把李白的静夜思改写为现代诗"}
  ]
}'
curl -X POST "http://0.0.0.0:9811/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "把李白的静夜思改写为现代诗"}
  ]
}'

FastDeploy service interface is compatible with OpenAI protocol. You can make service requests using the following Python code.

import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
        {"role": "user", "content": "把李白的静夜思改写为现代诗"},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')