[BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120)

* fix nvfp4

* fix

* add document

* fix nvfp4

* support eb5

* support bka

* support eb5

* support xpu

* fix

* fix

* add import cutedsl

* fix

* fix

* fix test

* fix H卡

* update document

* fix

* update document

* update document

* fix
This commit is contained in:
lizexu123
2026-04-03 15:43:19 +08:00
committed by GitHub
parent 095a11d932
commit 5f612a348d
8 changed files with 317 additions and 90 deletions
+62 -2
View File
@@ -18,7 +18,38 @@ Based on [FlashInfer](https://github.com/flashinfer-ai/flashinfer), Fastdeploy s
Please ensure that FastDeploy is installed with NVIDIA GPU support.
Follow the official guide to set up the base environment: [Fastdeploy NVIDIA GPU Environment Installation Guide](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/).
### Running Inference Service
### FlashInfer-cutedsl backend
#### PaddlePaddle Compatibility Patches for FlashInfer
Due to compatibility issues between FlashInfer and PaddlePaddle, you need to apply the following patches in `miniconda/envs/<your_env>/lib/python3.10/site-packages/`:
1. **nvidia_cutlass_dsl/python_packages/cutlass/torch.py**
Replace `torch.device` with `"torch.device"` (as a string to avoid conflicts).
2. **flashinfer/utils.py**
Modify the `get_compute_capability` function:
```bash
@functools.cache
def get_compute_capability(device: torch.device) -> Tuple[int, int]:
return torch.cuda.get_device_capability(device)
if device.type != "cuda":
raise ValueError("device must be a cuda device")
return torch.cuda.get_device_capability(device.index)
```
3. **flashinfer/cute_dsl/blockscaled_gemm.py**
Replace `cutlass_torch.current_stream()` with:
```bash
cuda.CUstream(torch.cuda.current_stream().stream_base.raw_stream)
```
#### Running Inference Service
flashinfer-cutlass backend:
```bash
python -m fastdeploy.entrypoints.openai.api_server \
--model nv-community/Qwen3-30B-A3B-FP4 \
@@ -31,6 +62,26 @@ python -m fastdeploy.entrypoints.openai.api_server \
--max-num-seqs 128
```
flashinfer-cutedsl backend:
```bash
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports "9811,9812,9813,9814" \
--num-servers 4 \
--model ERNIE-4.5-21B-A3B-FP4 \
--disable-custom-all-reduce \
--tensor-parallel-size 1 \
--data-parallel-size 4 \
--no-enable-prefix-caching \
--max-model-len 65536 \
--enable-expert-parallel \
--num-gpu-blocks-override 8192 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 512 \
--ep-prefill-use-worst-num-tokens \
--graph-optimization-config '{"use_cudagraph":false}'
```
### API Access
Make service requests using the following command
@@ -43,6 +94,15 @@ curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \
]
}'
```
```shell
curl -X POST "http://0.0.0.0:9811/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "把李白的静夜思改写为现代诗"}
]
}'
```
FastDeploy service interface is compatible with OpenAI protocol. You can make service requests using the following Python code.
@@ -64,4 +124,4 @@ for chunk in response:
if chunk.choices[0].delta:
print(chunk.choices[0].delta.content, end='')
print('\n')
```.
```