FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 17:11:21 +08:00

Files

T

RichardWooSJTU 61789febb9 [Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433 )

* support to load static quant ue8m0 scale of deepgemm via v0_loader

* [Fix] Fix ue8m0 scale pack dimension calculation and block size validation

1. Fix pack dimension calculation in fused_moe_triton_backend.py:
   - Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division
   - This ensures sufficient pack allocation when num_scales is not a multiple of 4

2. Fix block size hardcoding in block_wise_fp8.py:
   - Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]`
   - Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 11:32:35 +08:00

graph_optimization

[BugFix] fix bug when seq_lens_this_time is 2D (#6613 )

2026-03-02 23:52:03 +08:00

guided_decoding

[Feature] Guided Decoding add LLguidance backend (#5124 )

2025-12-03 20:23:57 +08:00

layers

[Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433 )

2026-03-03 11:32:35 +08:00

logits_processor

[Feature] Support ThinkingBudget Logits processor to control thinking content length (#6367 )