mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
44b52701f6
* fp4 dense * [WIP] support nvfp4, dense part * [wip] developing loading qwen model * loading * update * dense fp4 OK, cudagraph error * [WIP] moe forward part * with flashinfer-backend * qwen3_moe_fp4 * update * support flashinfer-cutlass moe, qwen3-moe-fp4 OK * support ernie4.5-fp4 * fix load error * add some ut * add docs * fix CLA, test * fix the apply() in ModelOptNvFp4FusedMoE * fix CodeStyle * del the PADDLE_COMPATIBLE_API * fix broken url: nvidia_gpu.md * fix docs * fix token_ids * fix CI in Hopper * move flashinfer imports inside the function * fix model_runner Removed the logic for generating random padding IDs. * Remove skip condition for CUDA version in nvfp4 test * add test for nvfp4 * fix according to review * Add Chinese translation link to NVFP4 documentation * del flashinfer.py * fix unittest --------- Co-authored-by: zoooo0820 <zoooo0820@qq.com> Co-authored-by: bukejiyu <395822456@qq.com>
2.3 KiB
2.3 KiB
NVFP4 Quantization
NVFP4 is an innovative 4-bit floating-point format introduced by NVIDIA. For detailed information, please refer to Introducing NVFP4 for Efficient and Accurate Low-Precision Inference.
Based on FlashInfer, Fastdeploy supports NVFP4 quantized model inference in the format produced by Modelopt.
- Note: Currently, this feature only supports FP4 quantized models of Ernie/Qwen series.
How to Use
Environment Setup
Supported Environment
- Supported Hardware: GPU sm >= 100
- PaddlePaddle Version: 3.3.0 or higher
- Fastdeploy Version: 2.5.0 or higher
FastDeploy Installation
Please ensure that FastDeploy is installed with NVIDIA GPU support. Follow the official guide to set up the base environment: Fastdeploy NVIDIA GPU Environment Installation Guide.
Running Inference Service
python -m fastdeploy.entrypoints.openai.api_server \
--model nv-community/Qwen3-30B-A3B-FP4 \
--port 8180 \
--metrics-port 8181 \
--engine-worker-queue-port 8182 \
--cache-queue-port 8183 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 128
API Access
Make service requests using the following command
curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "把李白的静夜思改写为现代诗"}
]
}'
FastDeploy service interface is compatible with OpenAI protocol. You can make service requests using the following Python code.
import openai
host = "0.0.0.0"
port = "8180"
client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
response = client.chat.completions.create(
model="null",
messages=[
{"role": "system", "content": "I'm a helpful AI assistant."},
{"role": "user", "content": "把李白的静夜思改写为现代诗"},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```.