* cherry-pick Support Fully Async and PrefixCache step 1
* copy routing_indices_cache.py from 2.4
* cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388)
* cherry-pick [RL] Clear Requests status of R3 (#6569)
* delete code
* fix rename bug
* fix status shape bug
* fix ci
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
* support to load static quant ue8m0 scale of deepgemm via v0_loader
* [Fix] Fix ue8m0 scale pack dimension calculation and block size validation
1. Fix pack dimension calculation in fused_moe_triton_backend.py:
- Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division
- This ensures sufficient pack allocation when num_scales is not a multiple of 4
2. Fix block size hardcoding in block_wise_fp8.py:
- Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]`
- Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: handle 4 return values from noaux_tc_redundant op
The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP:
- output_tensor (scores)
- topk_values
- topk_indices
- tokens_per_expert_stats_list_out (inplace updated)
The Python code was only unpacking 3 values, causing:
ValueError: too many values to unpack (expected 3)
This fix correctly unpacks all 4 return values, ignoring the inplace
updated tensor which is the same as the input tokens_per_expert_stats_list.
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
* fix: make noaux_tc_redundant return 4 values to match OP definition
The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3,
causing inconsistent behavior across different Paddle framework versions.
This fix explicitly returns 4 values:
- scores (inplace modified)
- topk_values
- topk_indices
- tokens_per_expert_stats_list (inplace modified via atomicAdd)
Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com>
---------
Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>
* fp4 dense
* [WIP] support nvfp4, dense part
* [wip] developing loading qwen model
* loading
* update
* dense fp4 OK, cudagraph error
* [WIP] moe forward part
* with flashinfer-backend
* qwen3_moe_fp4
* update
* support flashinfer-cutlass moe, qwen3-moe-fp4 OK
* support ernie4.5-fp4
* fix load error
* add some ut
* add docs
* fix CLA, test
* fix the apply() in ModelOptNvFp4FusedMoE
* fix CodeStyle
* del the PADDLE_COMPATIBLE_API
* fix broken url: nvidia_gpu.md
* fix docs
* fix token_ids
* fix CI in Hopper
* move flashinfer imports inside the function
* fix model_runner
Removed the logic for generating random padding IDs.
* Remove skip condition for CUDA version in nvfp4 test
* add test for nvfp4
* fix according to review
* Add Chinese translation link to NVFP4 documentation
* del flashinfer.py
* fix unittest
---------
Co-authored-by: zoooo0820 <zoooo0820@qq.com>
Co-authored-by: bukejiyu <395822456@qq.com>