[Docs] Add Doc for Online quantification (#6399)

* add doc for dynamic quant

* check
This commit is contained in:
chen
2026-02-09 14:09:18 +08:00
committed by GitHub
parent 8bb83b2239
commit 29a270bb38
2 changed files with 2 additions and 2 deletions
+1 -1
View File
@@ -24,7 +24,7 @@ When using FastDeploy to deploy models (including offline inference and service
| ```use_warmup``` | `int` | Whether to perform warmup at startup, will automatically generate maximum length data for warmup, default: 0 (disabled) |
| ```limit_mm_per_prompt``` | `dict[str]` | Limit the amount of multimodal data per prompt, e.g.: {"image": 10, "video": 3}, default: 1 for all |
| ```enable_mm``` | `bool` | __[DEPRECATED]__ Whether to support multimodal data (for multimodal models only), model architecture automatically detects multimodal models, no manual setting needed |
| ```quantization``` | `str` | Model quantization strategy, when loading BF16 CKPT, specifying wint4 or wint8 supports lossless online 4bit/8bit quantization |
| ```quantization``` | `str` | Model Quantization Strategy: When loading a BF16 checkpoint (CKPT), specifying `wint4`, `wint8`, `block_wise_fp8` or `wfp8afp8` enables lossless online 4-bit/8-bit quantization of weights, KVCache is not quantized by default; if this parameter is parsed as a dictionary (dict), `mix_quant` (mixed quantization) can be specified, where `dense_quant_type`, `moe_quant_type` and `kv_cache_quant_type` specify the quantization types for DenseGEMM, MoE and KVCache respectively, no quantization is applied to the corresponding modules if the relevant parameters are not specified (e.g., `'{"quantization":"mix_quant","dense_quant_type":"wint8","moe_quant_type":"wint4","kv_cache_quant_type":"block_wise_fp8"}'`); Note: Online quantization of KVCache to `block_wise_fp8` is only supported by the AppendAttn backend. |
| ```gpu_memory_utilization``` | `float` | GPU memory utilization, default: 0.9 |
| ```num_gpu_blocks_override``` | `int` | Preallocated KVCache blocks, this parameter can be automatically calculated by FastDeploy based on memory situation, no need for user configuration, default: None |
| ```max_num_batched_tokens``` | `int` | Maximum batch token count in Prefill phase, default: None (same as max_model_len) |
+1 -1
View File
@@ -22,7 +22,7 @@
| ```use_warmup``` | `int` | 是否在启动时进行warmup,会自动生成极限长度数据进行warmup,默认0(不启用) |
| ```limit_mm_per_prompt``` | `dict[str]` | 限制每个prompt中多模态数据的数量,如:{"image": 10, "video": 3},默认都为1 |
| ```enable_mm``` | `bool` | __[已废弃]__ 是否支持多模态数据(仅针对多模模型),模型架构会自动检测是否为多模态模型,无需手动设置 |
| ```quantization``` | `str` | 模型量化策略,当在加载BF16 CKPT时,指定wint4wint8时,支持无损在线4bit/8bit量化 |
| ```quantization``` | `str` | 模型量化策略,当在加载BF16 CKPT时,指定wint4wint8、block_wise_fp8、wfp8afp8时,支持权重无损在线4bit/8bit量化,默认不量化KVCache。如果该参数会被解析为dict,可指定`mix_quant`混合量化,其中`dense_quant_type`、`moe_quant_type`和`kv_cache_quant_type`分别指定DenseGEMM、MOE和KVCache量化类型,缺省时不量化,如`'{"quantization":"mix_quant","dense_quant_type":"wint8","moe_quant_type":"wint4","kv_cache_quant_type":"block_wise_fp8"}'`,注:仅AppendAttn后端支持KVCache在线量化`block_wise_fp8` |
| ```gpu_memory_utilization``` | `float` | GPU显存利用率,默认0.9 |
| ```num_gpu_blocks_override``` | `int` | 预分配KVCache块数,此参数可由FastDeploy自动根据显存情况计算,无需用户配置,默认为None |
| ```max_num_batched_tokens``` | `int` | Prefill阶段最大Batch的Token数量,默认为None(与max_model_len一致) |