Files
FastDeploy/docs/quantization/nvfp4.md
T
yuxuan 44b52701f6 [Feature] Support NVFP4 MoE on SM100 (#6003)
* fp4 dense

* [WIP] support nvfp4, dense part

* [wip] developing loading qwen model

* loading

* update

* dense fp4 OK, cudagraph error

* [WIP] moe forward part

* with flashinfer-backend

* qwen3_moe_fp4

* update

* support flashinfer-cutlass moe, qwen3-moe-fp4 OK

* support ernie4.5-fp4

* fix load error

* add some ut

* add docs

* fix CLA, test

* fix the apply() in ModelOptNvFp4FusedMoE

* fix CodeStyle

* del the PADDLE_COMPATIBLE_API

* fix broken url: nvidia_gpu.md

* fix docs

* fix token_ids

* fix CI in Hopper

* move flashinfer imports inside the function

* fix model_runner

Removed the logic for generating random padding IDs.

* Remove skip condition for CUDA version in nvfp4 test

* add test for nvfp4

* fix according to review

* Add Chinese translation link to NVFP4 documentation

* del flashinfer.py

* fix unittest

---------

Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>
2026-01-29 14:16:07 +08:00

2.3 KiB

简体中文

NVFP4 Quantization

NVFP4 is an innovative 4-bit floating-point format introduced by NVIDIA. For detailed information, please refer to Introducing NVFP4 for Efficient and Accurate Low-Precision Inference.

Based on FlashInfer, Fastdeploy supports NVFP4 quantized model inference in the format produced by Modelopt.

  • Note: Currently, this feature only supports FP4 quantized models of Ernie/Qwen series.

How to Use

Environment Setup

Supported Environment

  • Supported Hardware: GPU sm >= 100
  • PaddlePaddle Version: 3.3.0 or higher
  • Fastdeploy Version: 2.5.0 or higher

FastDeploy Installation

Please ensure that FastDeploy is installed with NVIDIA GPU support. Follow the official guide to set up the base environment: Fastdeploy NVIDIA GPU Environment Installation Guide.

Running Inference Service

python -m fastdeploy.entrypoints.openai.api_server \
    --model nv-community/Qwen3-30B-A3B-FP4 \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --cache-queue-port 8183 \
    --tensor-parallel-size 1 \
    --max-model-len  32768 \
    --max-num-seqs 128

API Access

Make service requests using the following command

curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "把李白的静夜思改写为现代诗"}
  ]
}'

FastDeploy service interface is compatible with OpenAI protocol. You can make service requests using the following Python code.

import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")

response = client.chat.completions.create(
    model="null",
    messages=[
        {"role": "system", "content": "I'm a helpful AI assistant."},
        {"role": "user", "content": "把李白的静夜思改写为现代诗"},
    ],
    stream=True,
)
for chunk in response:
    if chunk.choices[0].delta:
        print(chunk.choices[0].delta.content, end='')
print('\n')
```.