[MetaxGPU] adapt to the latest fastdeploy on metax gpu (#3492)

2026-04-23 00:17:25 +08:00 · 2025-08-25 17:44:20 +08:00
parent c13c904971
commit 2ae7ab28d2
8 changed files with 338 additions and 115 deletions
@@ -0,0 +1,82 @@
+# 使用 Metax GPU C550 运行ERNIE 4.5 系列模型
+
+FastDeploy在Metax C550上对ERNIE 4.5系列模型进行了深度适配和优化，实现了推理入口和GPU的统一，无需修改即可完成推理任务的迁移。
+
+环境准备：
+- Python >= 3.10
+- Linux X86_64
+
+| Chip Type | Driver Version | KMD Version |
+| :---: | :---: | :---: |
+| MetaX C550 | 3.0.0.1  | 2.14.6 |
+
+## 1. 容器镜像获取
+
+```shell
+docker login --username=cr_temp_user --password=eyJpbnN0YW5jZUlkIjoiY3JpLXpxYTIzejI2YTU5M3R3M2QiLCJ0aW1lIjoiMTc1NTUxODEwODAwMCIsInR5cGUiOiJzdWIiLCJ1c2VySWQiOiIyMDcwOTQwMTA1NjYzNDE3OTIifQ:8226ca50ce5476c42062e24d3c465545de1c1780 cr.metax-tech.com && docker pull cr.metax-tech.com/public-library/maca-native:3.0.0.4-ubuntu20.04-amd64
+```
+
+## 2. 预安装
+
+```shell
+1）pip install paddlepaddle==3.0.0.dev20250729 -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
+2）pip install paddle-metax-gpu==3.0.0.dev20250807 -i https://www.paddlepaddle.org.cn/packages/nightly/maca/
+```
+
+## 3. FastDeploy代码下载并编译
+
+```shell
+git clone https://github.com/PaddlePaddle/FastDeploy
+cd FastDeploy
+bash build.sh
+```
+The built packages will be in the ```FastDeploy/dist``` directory.
+
+## 4. 环境验证
+
+After installation, verify the environment with this Python code:
+```python
+import paddle
+from paddle.jit.marker import unified
+# Verify GPU availability
+paddle.utils.run_check()
+# Verify FastDeploy custom operators compilation
+from fastdeploy.model_executor.ops.gpu import beam_search_softmax
+```
+If the above code executes successfully, the environment is ready.
+
+## 5. 示例
+from fastdeploy import LLM, SamplingParams
+
+prompts = [
+    "Hello. My name is",
+]
+
+sampling_params = SamplingParams(top_p=0.95, max_tokens=32, temperature=0.6)
+
+llm = LLM(model="/root/model/ERNIE-4.5-21B-A3B-Paddle", tensor_parallel_size=1, max_model_len=256, engine_worker_queue_port=9135, quantization='wint8', static_decode_blocks=0, gpu_memory_utilization=0.9)
+
+outputs = llm.generate(prompts, sampling_params)
+
+print(f"Generated {len(outputs)} outputs")
+print("=" * 50 + "\n")
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs.text
+    print(prompt)
+    print(generated_text)
+    print("-" * 50)
+
+输出：
+INFO     2025-08-18 10:54:18,455 416822 engine.py[line:202] Waiting worker processes ready...
+Loading Weights: 100%|█████████████████████████████████████████████████████████████████████████| 100/100 [03:33<00:00,  2.14s/it]
+Loading Layers: 100%|██████████████████████████████████████████████████████████████████████████| 100/100 [00:18<00:00,  5.54it/s]
+INFO     2025-08-18 10:58:16,149 416822 engine.py[line:247] Worker processes are launched with 240.08204197883606 seconds.
+Processed prompts: 100%|███████████████████████| 1/1 [00:21<00:00, 21.84s/it, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
+Generated 1 outputs
+==================================================
+
+Hello. My name is
+Alice and I'm here to help you. What can I do for you today?
+Hello Alice! I'm trying to organize a small party