lizhenyun01
|
446b26bbc0
|
[Feature] support blackwell gemm in ht (#7053)
* [Feature] support blackwell gemm in ht
* [Feature] support ops for convert
* fix cuda error 716
* fix cuda error
* opt memory
* remove unused code
|
2026-04-07 19:52:51 +08:00 |
|
Bingoo
|
410988d9ec
|
[OP] support deepgeem for sm103 (#7073)
* support deepgeem for sm103
* add assert
* modify code style
* add assert
* modify sm version condition
* remove assert
|
2026-04-01 21:01:09 +08:00 |
|
SUN Dong
|
6cff780fdb
|
[RL] Support moe_topk_select using Paddle native operators and Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization and swiglu-fp8-quant op for DeepGemmFusedMoE for training alignment (#6850)
* [RL] Add fused stack-transpose-quant for BlockWiseFP8 MoE weight quantization
* update
* update
* update
* support custom topk inDeepGemmFusedMoeMethod apply_tp
* apply_ep_prefill support moe_topk_select
* update
* add ut
* add ut
* add ut
* modity doc
* fix env and docs
* add ut
---------
Co-authored-by: zhanghonggeng <zhanghonggeng@baidu.com>
|
2026-03-24 11:12:39 +08:00 |
|
AIbin
|
cb6819d086
|
[Optimization][OP]support per_token_group_fp8_quant cuda kernel (#6865)
* support per_token_group_fp8_quant cuda kernel
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* update code
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
2026-03-17 19:17:51 +08:00 |
|
AIbin
|
c3aceb6bdc
|
[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689)
* Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM
|
2026-03-10 15:05:14 +08:00 |
|
Weiguo Zhu
|
8fb24122b8
|
fix reshard error (#6536)
|
2026-02-27 22:22:37 +08:00 |
|
JYChen
|
c6d8fbe526
|
[BugFix] fix log with paddlefleet.ops (#6528)
|
2026-02-27 14:34:29 +08:00 |
|
AIbin
|
0eb87467f8
|
[BugFix]fix RL bug about blockwisefp8 (#6466)
* fix RL bug about blockwisefp8
* fix moe same bug
* fix RL FP8 bug
|
2026-02-12 09:15:29 +08:00 |
|
JYChen
|
40c952e7b5
|
fix deepgemm import (#6451)
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
|
2026-02-11 20:10:01 +08:00 |
|
JYChen
|
9bcd863902
|
[Others] support import deepgemm/deepep from fleet ops (#6351)
* update paddleformers to v1.0
* only change import fleetpath
|
2026-02-09 11:53:13 +08:00 |
|
JYChen
|
c745a22420
|
[Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304)
|
2026-02-03 17:47:38 +08:00 |
|
JYChen
|
6c685c9474
|
Revert "[Feature] Support Ernie FP8 on sm100 (#5593)" (#6275)
This reverts commit eb80724b71.
|
2026-01-30 11:22:01 +08:00 |
|
JYChen
|
eb80724b71
|
[Feature] Support Ernie FP8 on sm100 (#5593)
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com>
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com>
|
2026-01-29 13:49:54 +08:00 |
|