apps/FastDeploy

Fork 0

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Files

T

kxz2002 24b85b752b

CE Compile Job / ce_job_pre_check (push) Has been cancelled

Details

CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled

Details

CE Compile Job / FD-Clone-Linux (push) Has been cancelled

Details

CE Compile Job / Show Code Archive Output (push) Has been cancelled

Details

CE Compile Job / BUILD_SM8090 (push) Has been cancelled

Details

CE Compile Job / BUILD_SM8689 (push) Has been cancelled

Details

CE Compile Job / CE_UPLOAD (push) Has been cancelled

Details

[Cherry-Pick] Unify the registration name recognition for tool_parser and reasoning_parser to “-” (#4668 ) (#4737 )

* [Feature] add a new reasoning parser (#4571)

* add new reasoning_parser initial commit

* add parser file content

* add register

* ernie_test_reasoning_parser

* support <tool_call> token and add tool_parser

* add and fix unit tests

* modify reasoning_parser

* modify reasoning parser and tool parser

* modify unit tests

* modify reasoning_parser and tool_parser

* modify unit tests

* fix tool_parser

* modify the logic of reasoning_parser and tool_parser

* add and modify unit tests

* standardize code style

* simplify reasoning_parser and tool_parser

* modify unit test

* [BugFix] Fix finish reason in _create_chat_completion_choice  (#4582)

* fix n_param _create_chat_completion_choicel

* fix unit test

* fix final_res

* modify unit tests

* [BugFix] fix offline llm chat "enable_thinking" is always "False" (#4686)

* fix enable_thinking

* recover ernie4_5_vl_processor

* [Feature] Unify the registration name recognition for tool_parser and reasoning_parser to “-” (#4668)

* parser register name unify

* change ernie_x1 to ernie-x1

* change ernie4_5_vl to ernie-45-vl

* fix unit test

2025-10-31 23:27:21 +08:00

4.9 KiB

Raw Blame History

English

ERNIE-4.5-21B-A3B-Thinking

一、环境准备

1.1 支持情况

ERNIE-4.5-21B-A3B 各量化精度，在下列硬件上部署所需要的最小卡数如下：

	WINT8	WINT4	FP8
H800 80GB	1	1	1
A800 80GB	1	1	/
H20 96GB	1	1	1
L20 48GB	1	1	1
A30 40GB	2	1	/

注：

在启动命令后指定--tensor-parallel-size 2 即可修改部署卡数
表格中未列出的硬件，可根据显存大小进行预估是否可以部署
ERNIE-4.5-21B-A3B-Thinking 需要FastDeploy 2.2及以上版本支持

1.2 安装fastdeploy

安装，请参考Fastdeploy Installation完成安装。
模型下载，请参考支持模型列表。

二、如何使用

2.1 基础：启动服务

通过下列命令启动服务

python -m fastdeploy.entrypoints.openai.api_server \
       --model baidu/ERNIE-4.5-21B-A3B-Thinking \
       --load-choices "default_v1" \
       --tensor-parallel-size 1 \
       --max-model-len 131072 \
       --quantization wint8 \
       --reasoning-parser ernie-x1 \
       --tool-call-parser ernie-x1 \
       --max-num-seqs 32

其中：

--quantization: 表示模型采用的量化策略。不同量化策略，模型的性能和精度也会不同。可选值包括：wint8 / wint4 / block_wise_fp8(需要Hopper架构)。
--max-model-len：表示当前部署的服务所支持的最长Token数量。设置得越大，模型可支持的上下文长度也越大，但相应占用的显存也越多，可能影响并发数。
--load-choices: 表示loader的版本，"default_v1"表示启用v1版本的loader，具有更快的加载速度和更少的内存使用。
--reasoning-parser 、 --tool-call-parser: 表示对应调用的思考内容和工具调用解析器

更多的参数含义与默认设置，请参见FastDeploy参数说明。

2.2 进阶：如何获取更优性能

2.2.1 评估应用场景，正确设置参数

结合应用场景，评估平均输入长度、平均输出长度、最大上下文长度。例如，平均输入长度为2000，输出长度为80000，那么建议设置为 32768

根据最大上下文长度，设置max-model-len

2.2.2 Prefix Caching

原理： Prefix Caching的核心思想是通过缓存输入序列的中间计算结果（KV Cache），避免重复计算，从而加速具有相同前缀的多个请求的响应速度。具体参考prefix-cache

启用方式： 自2.2版本开始（包括develop分支），Prefix Caching已经默认开启。

对于2.1及更早的版本，需要手动开启。其中--enable-prefix-caching表示启用前缀缓存，--swap-space表示在GPU缓存的基础上，额外开启CPU缓存，大小为GB，应根据机器实际情况调整。建议取值为(机器总内存 - 模型大小) * 20%。如果因为其他程序占用内存等原因导致服务启动失败，可以尝试减小--swap-space的值。

--enable-prefix-caching
--swap-space 50

2.2.3 Chunked Prefill

原理： 采用分块策略，将预填充（Prefill）阶段请求拆解为小规模子任务，与解码（Decode）请求混合批处理执行。可以更好地平衡计算密集型（Prefill）和访存密集型（Decode）操作，优化GPU资源利用率，减少单次Prefill的计算量和显存占用，从而降低显存峰值，避免显存不足的问题。具体请参考Chunked Prefill

启用方式： 自2.2版本开始（包括develop分支），Chunked Prefill已经默认开启。

对于2.1及更早的版本，需要手动开启。

--enable-chunked-prefill

2.2.4 CUDAGraph

原理： CUDAGraph 是 NVIDIA 提供的一项 GPU 计算加速技术，通过将 CUDA 操作序列捕获（capture）为图结构（graph），实现 GPU 任务的高效执行和优化。CUDAGraph 的核心思想是将一系列 GPU 计算和内存操作封装为一个可重复执行的图，从而减少 CPU-GPU 通信开销、降低内核启动延迟，并提升整体计算性能。

启用方式： 在启动命令中增加

--use-cudagraph

注：

通常情况下不需要额外设置其他参数，但CUDAGraph会产生一些额外的显存开销，在一些显存受限的场景下可能需要调整。详细的参数调整请参考GraphOptimizationBackend 相关配置参数说明

2.2.5 拒绝采样

原理： 拒绝采样即从一个易于采样的提议分布（proposal distribution）中生成样本，避免显式排序从而达到提升采样速度的效果，对小尺寸的模型有较明显的提升。

启用方式： 启动前增加下列环境变量

export FD_SAMPLING_CLASS=rejection

三、常见问题FAQ

如果您在使用过程中遇到问题，可以在FAQ中查阅。

4.9 KiB Raw Blame History

ERNIE-4.5-21B-A3B-Thinking

一、环境准备

1.1 支持情况

1.2 安装fastdeploy

二、如何使用

2.1 基础：启动服务

2.2 进阶：如何获取更优性能

2.2.1 评估应用场景，正确设置参数

2.2.2 Prefix Caching

2.2.3 Chunked Prefill

2.2.4 CUDAGraph

2.2.5 拒绝采样

三、常见问题FAQ

4.9 KiB

Raw Blame History