fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
lizexu123
6619298b50
【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models ( #6007 )
...
* update w4afp8
* build.sh ok
* support cuda_graph
* fix
* add test
* fix max_tokens_per_expert
* >=70
* fix
* compute_max_tokens_from_prefix_sum in w4afp8
* compute_max_tokens use cub
2026-01-15 19:18:42 +08:00
Cheng Yanfei
fbcccaa750
[Intel HPU] enable MoE EP for hpu ( #5855 )
...
* enable HPU MoE EP
* MoE intermediate_scale stack
* enable loader_v1 esp for tensor_wise_fp8 TP or EP
* modify activation_scale name
2026-01-15 13:08:00 +08:00
RAM
b3f59fd9b5
[RL][CI] Support Async R3 And Add Accuracy Test ( #5937 )
...
* add bs1 r3 test case
* async put
* r3 test case 1.0
* success run eb5
* refine test case
* pre-commit
* add eb45 & glm testcase
* format code
* add p2pstore requirements
* support only last turn
* R3 use worker log
* refine code &fix ci bug
* refine error mesg
* fix empty input bug
* Success set acc ci of eb45 and glm45
* refine code
* fix bug
2026-01-14 04:25:06 -08:00
xiaoxiaohehe001
00a01ae024
[Feature] Support redundant expert for eplb ( #5918 )
...
* [BugFix] support redundant expert for eplb
* support redundant expert for eplb
* support redundant expert for eplb
* update
* fix ci eplb
2026-01-09 17:13:24 +08:00
Ryan
3e74bacc5e
add m_grouped_gemm_fp8_fp8_bf16_nt_contiguous_custom_python_op ( #5847 )
2026-01-07 16:17:55 +08:00
lizexu123
1d3ae7c024
[BugFix] fix w4afp8 tp=8 ( #5868 )
...
* fix w4afp8 tp=8
* fix
2026-01-05 18:59:02 +08:00
ming1753
f50e1bcc16
[Others] enable use PFCC deep_ep ( #5822 )
...
* upstream deep_ep
* fix bug
* fix bug
* modify env name
2026-01-05 02:07:01 -08:00
周周周
dc13344ab8
[Optimization] add del to decrease peak memory in MoE prefill ( #5863 )
2026-01-05 14:01:48 +08:00
lizexu123
44a13e4557
[Feature] support w4afp8 v1_loader and v0_loader(tp>1) ( #5757 )
...
* support
* fix
* support w4afp8 v1_loader and v0_loader
* fix
* fix test
* fix test
* fix test
* fix moe.py
* add test_ernie_4_5_w4afp8
* add test
* delete tensor
* fix test
* fix
* add
* fix test
2025-12-30 14:11:52 +08:00
Ryan
eb782a0225
[BugFix] Fix return value inconsistency for ep_moe_expert_combine op ( #5812 )
2025-12-29 16:44:00 +08:00
Nyakku Shigure
11227e00bb
[GraphOptimization] Wrap deep gemm and triton as python op ( #5673 )
...
* [GraphOptimization] Wrap deep gemm and triton as python op
* add unitest to _base_test && compatibility
* paddle.static.MetaTensor -> "paddle.static.MetaTensor"
* mv register_custom_python_op
* rename yaml
---------
Co-authored-by: DrRyanHuang <zihaohuang@aliyun.com >
2025-12-24 15:23:46 +08:00
bukejiyu
d1c6e57341
[Others] upgrade paddleformer to 0.4.0 ( #5599 )
2025-12-23 05:08:01 -08:00
Sunny-bot1
04035e4ebf
support w4afp8 two stage ( #5608 )
2025-12-22 15:13:05 +08:00
Sunny-bot1
40f3897a4e
support w4afp8 moe offline permute & load ( #5613 )
2025-12-22 15:12:57 +08:00
Longzhi Wang
d8587e987e
[Model] tp+ep support v1_loader ( #5465 )
...
* [Model] tp+ep support v1_loader
* fix
* fix mtp_linear
* fix mtp_linear
* fix
* fix
* fix v0 loader
* fix
* Add get_tensor for ep
* fix linear weight_loader
* fix typo
* fix
2025-12-18 14:31:54 +08:00
zhupengyang
8735cb5045
[XPU] refactor moe ffn ( #5501 )
...
- remove BKCL_DISPATCH_ALL_GATHER
- support sparse mode
- support moe quant_method
2025-12-18 14:14:05 +08:00
fmiao2372
404cf0ece4
[Intel HPU] enable tensor_wise_fp8 ( #5324 )
...
* [Intel HPU] enable tensor_wise_fp8
* update code based on comments
* fix code style issue
* fix bug about RP 5138
* mv kv_cache modifications to HPU backend
* fix FP8 Precision Issues
* fix FP8 Precision Issues
* Add quantization UT
---------
Co-authored-by: yanfeich <yanfei.cheng@intel.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-17 16:45:03 +08:00
RAM
6fc5eccf83
[RL] R3 Support RDMA Store ( #5467 )
...
* [RL] R3 support rdma store
* refine notes
* refine code
* disable prefix cache
* support preempted task and put cpu tensor
2025-12-16 16:50:13 +08:00
bukejiyu
4066dfb4a6
RL fix ( #5503 )
2025-12-11 19:25:27 +08:00
周周周
ff353b922f
[Others] update tbo related code ( #5485 )
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-12-11 12:34:46 +08:00
Sunny-bot1
364197c4b5
support w4afp8 mtp ( #5429 )
2025-12-08 20:24:00 +08:00
RAM
b2908b8e82
[New][RL] Support Rollout Routing Replay ( #5405 )
...
* [RL] Support Rollout Routing Replay
* add routing indices cache
* fix config bug and moe forward bug
* R3 Support GLM
* support eb4.5
* fix merge bug
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* add routing replay ci
* support glm topk
* support orther top_k
* fix ci bug
* pre-commit
* only support chatcmpl
* Revert "Revert "[RL] Support Rollout Routing Replay (#5321 )" (#5402 )"
This reverts commit c45e064f3d .
* Fix XPU and NPU bug
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Yuanle Liu <yuanlehome@163.com >
2025-12-05 22:06:26 +08:00
Jiang-Jia-Jun
c45e064f3d
Revert "[RL] Support Rollout Routing Replay ( #5321 )" ( #5402 )
...
This reverts commit 96d2d4877b .
2025-12-05 20:19:39 +08:00
RAM
96d2d4877b
[RL] Support Rollout Routing Replay ( #5321 )
...
* [RL] Support Rollout Routing Replay
* add routing indices cache
* fix config bug and moe forward bug
* R3 Support GLM
* support eb4.5
* fix merge bug
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* add routing replay ci
* support glm topk
* support orther top_k
* fix ci bug
* pre-commit
* only support chatcmpl
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: Yuanle Liu <yuanlehome@163.com >
2025-12-05 20:01:33 +08:00
周周周
c83dc58105
[Feature] support Two batch overlap, mainly used in Prefill ( #5078 )
2025-12-05 14:58:50 +08:00
Longzhi Wang
5cd17fd662
[Models] Add forward_meta to moe models' forward function ( #5138 )
...
* [Models] Add forward_meta to moe models' forward function
* fix missing param
* fix
* fix
* fix forward_meta
* fix test and remove chunked MoE releated in config
* fix test
* fix
* fix
2025-12-04 13:26:58 +08:00
fmiao2372
209006e6a6
[Intel HPU] fix memory fragmentation issue due to warmup process and fix moe all_reduce issue ( #5357 )
2025-12-04 11:29:41 +08:00
lzy
690bcb8e50
[Optimization] 1.fix tp+ep moe_forward; 2.set max_prefill_batch=env.MAX_PREFILL_NUM ( #5315 )
2025-12-03 13:33:15 +08:00
Sunny-bot1
3629db4129
[Quantization] Support w4afp8 MoE dynamic quantization ( #5282 )
...
* support dynamic activation quant for w4afp8
* support dynamic w4afp8
* add test
* fix
* fix
---------
Co-authored-by: zhoutianzi666 <17801055074@163.com >
2025-12-02 18:56:16 +08:00
K11OntheBoat
2e1680838f
[PD Disaggregation] Support PD deployment of DeepSeekv3. ( #5251 )
...
* Support deepseekv3 cache transfer for PD deploy
* clean some log info
---------
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
2025-12-02 14:11:50 +08:00
chen
aa35ce449d
[Optimization] EP empty_input_forward Remove Communication ( #5254 )
2025-12-01 21:10:40 +08:00
Longzhi Wang
add524d80c
[Feature] support chunked moe ( #4575 )
...
* [Feature] support chunked moe
* update
* update
* fix and add test
* update
* fix conflict and modity test
* fix fused_moe
* fix fused_moe
* fix docstring
* fix
* fix typo
* fix test
* fix
* fix
* fix test
* fix test
2025-12-01 15:17:18 +08:00
fmiao2372
2c7683d551
[Intel HPU] change MoE weights and scales from list to tensor and add… ( #5289 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
* [Intel HPU] change MoE weights and scales from list to tensor and add q/k rms norm
* update doc
* move HPU_CHUNK_SIZE into envs
2025-11-28 19:17:05 +08:00
Yuanle Liu
cb56d46694
[Optimization] Refine row parallel bias and nranks and moe all_reduce ( #5247 )
...
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
Publish Job / publish_pre_check (push) Has been cancelled
Publish Job / print_publish_pre_check_outputs (push) Has been cancelled
Publish Job / FD-Clone-Linux (push) Has been cancelled
Publish Job / Show Code Archive Output (push) Has been cancelled
Publish Job / BUILD_SM8090 (push) Has been cancelled
Publish Job / BUILD_SM8689 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8090 (push) Has been cancelled
Publish Job / PADDLE_PYPI_UPLOAD_8689 (push) Has been cancelled
Publish Job / Run FD Image Build (push) Has been cancelled
Publish Job / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
Publish Job / Run FastDeploy LogProb Tests (push) Has been cancelled
Publish Job / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
Publish Job / Run Base Tests (push) Has been cancelled
Publish Job / Run Accuracy Tests (push) Has been cancelled
Publish Job / Run Stable Tests (push) Has been cancelled
CI Images Build / FD-Clone-Linux (push) Has been cancelled
CI Images Build / Show Code Archive Output (push) Has been cancelled
CI Images Build / CI Images Build (push) Has been cancelled
CI Images Build / BUILD_SM8090 (push) Has been cancelled
CI Images Build / Run FastDeploy Unit Tests and Coverage (push) Has been cancelled
CI Images Build / Run FastDeploy LogProb Tests (push) Has been cancelled
CI Images Build / Extracted partial CE model tasks to run in CI. (push) Has been cancelled
CI Images Build / Run Base Tests (push) Has been cancelled
CI Images Build / Publish Docker Images Pre Check (push) Has been cancelled
* rename nranks to tp_size and fix bias in v1 loader
* fix
* update
2025-11-26 05:09:09 -08:00
chen
209970836e
[BugFix] BF16 MoE Cutlass Backend Support EP ( #5242 )
2025-11-26 19:16:22 +08:00
xiaoxiaohehe001
e150a418d4
support moe offline quant ( #5142 )
2025-11-24 18:59:18 +08:00
xiaoxiaohehe001
95f3c8c641
[Fix] Fix eplb bug and support fp8 load weight ( #5178 )
...
* fix eplb part2
* fix eplb part2
* fix eplb part2
2025-11-24 15:31:37 +08:00
xiaoxiaohehe001
6471dade4a
[Fix] Fix noaux ep test ( #5161 )
...
* support noaux eplb
* noaux_eplb
* noaux_eplb
* noaux_eplb
* noaux_eplb
2025-11-21 16:36:41 +08:00
xiaoxiaohehe001
6ca2651995
[Feature] Support noaux for eplb ( #5143 )
...
* support noaux eplb
* noaux_eplb
* noaux_eplb
* noaux_eplb
2025-11-21 14:10:32 +08:00
Ryan
0857099191
mv import ( #5146 )
2025-11-20 19:25:56 +08:00
Sunny-bot1
bde97e09f7
support dynamic activation quant for w4afp8 ( #5117 )
2025-11-19 21:11:16 +08:00
Sunny-bot1
43f0c7557e
[Feature] Add an unquantized option for MoE and Dense quant type ( #4813 )
2025-11-19 16:24:03 +08:00
bukejiyu
a82f25ea7b
[RL]Resolve shape mismatch problems in RL-related modules ( #5032 )
...
* RL fix
* update
2025-11-19 11:12:48 +08:00
MingkunZhang
a36c958c66
[Metax] support default_v1 loader based #4988 ( #5001 )
CE Compile Job / ce_job_pre_check (push) Has been cancelled
CE Compile Job / print_ce_job_pre_check_outputs (push) Has been cancelled
CE Compile Job / FD-Clone-Linux (push) Has been cancelled
CE Compile Job / Show Code Archive Output (push) Has been cancelled
CE Compile Job / BUILD_SM8090 (push) Has been cancelled
CE Compile Job / BUILD_SM8689 (push) Has been cancelled
CE Compile Job / CE_UPLOAD (push) Has been cancelled
Deploy GitHub Pages / deploy (push) Has been cancelled
2025-11-18 09:44:30 +08:00
yangjianfengo1
3afb717995
【Fix】fix deepep dispatch ( #5036 )
...
* fix dispatch
* fix dispatch
---------
Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com >
2025-11-17 10:34:01 +08:00
yzwu
3b80a799ab
[Iluvatar][CI] Fix moe_expert_dispatch cannot support dequant_scale ( #5012 )
...
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-11-17 10:18:42 +08:00
yangjianfengo1
ae7bee8122
【New Feature】W4afp8 supports per group quantization ( #4987 )
...
* w4afp8 支持per group
* code style
* fix transpose
* revert fast hardmard
---------
Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com >
Co-authored-by: plusNew001 <95567040+plusNew001@users.noreply.github.com >
2025-11-13 19:17:27 +08:00
ming1753
3148dbca06
[BugFix] fix VL fp8 bug when moe token_num is 0 ( #4928 )
...
* [BugFix] fix VL fp8 bug when moe token_num is 0
* fix bug
* format
* fix bug
2025-11-12 21:19:36 +08:00
yzwu
76e60e98f8
[Iluvatar][CI] fix safetensors_rust.SafetensorError: framework paddle is invalid ( #4972 )
2025-11-12 14:13:40 +08:00