RichardWooSJTU
9f0778f991
[Feature] Support EP prefill with num_worst_tokens ( #6574 )
...
* support num worst tokens
* support num worst tokens
* fix build error
* support num worst tokens: fix errors
* support num worst tokens: fix feild
* support num worst tokens: delete requiements
* replace permute and depermute op by pure cuda
* replace permute and depermute op by pure cuda
* fix ci
* fix op
* fix nan
* fix code style
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-11 17:09:07 +08:00
fxyfxy777
36547cfdb3
[Feature] FD_USE_PHI_FP8_QUANT ( #6320 )
...
* add ut
* add use_fd_quant env
* rm mask_per_token_quant
* add make ops list
* USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true
* modify comments
* use bool type
* Add function declaration
2026-02-03 22:33:03 -08:00
fxyfxy777
2ada119a38
[Optimize] optimize mask_quant & swiglu ( #6222 )
...
* optimize mask_quant op speed up 1.5
* fix calculate sequence
* add fused
* rm log
* push kernel code
* add ut
* accuracy ok
* add ue8m0
* add ut
* add merge develop
* rm ut of mask_per_token_quant
2026-02-02 13:52:38 +08:00
JYChen
6c685c9474
Revert "[Feature] Support Ernie FP8 on sm100 ( #5593 )" ( #6275 )
...
This reverts commit eb80724b71 .
2026-01-30 11:22:01 +08:00
JYChen
eb80724b71
[Feature] Support Ernie FP8 on sm100 ( #5593 )
...
* Deepgemm暂时可用版本
* dense部分 e8m0 ok
* EB模型E8M0跑通的版本
* code check
* support 21b-tp2, dev_paddle
* 单机4.5T ep OK的版本
* 修复删除的代码,单机4.5T ep(非cudagraph)
* eb tp
* Support SM100 block-wise FP8 inference
* refine codes, support deepgemm on sm100
* add thirdparty PFCC/DeepGEMM
* fix ep decode
* 使用deepep ue8m0, 解决精度问题
* 修复FP8 TP精度
* Deepgemm升级适配Hopper逻辑
* add ue8m0 kernel
* add ue8m0 kernel
* fix custom_ops/gpu_ops/cpp_extensions.cc
* eb 输出正常
* eb5 text is right
* 目测精度一致
* 自测精度对齐
* 替换masked_per_token_quant, ep精度OK
* 性能提升约30%
* 暂时跑通ep但是有问题
* 自测一致
* rm test fun
* fix ep event
* 图优化算子更新Deepgemm
* fix build
* 暂时绕过deepgemm CI编译问题
* 根据SM区分deepgemm版本
* remove useless code
---------
Co-authored-by: ckl117 <ckl117@163.com >
Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com ”>
Co-authored-by: fxyfxy777 <fxyfxy777@163.com >
2026-01-29 13:49:54 +08:00
fxyfxy777
4c92035f2d
[Feature] Unify fp8 block_wise quant ops ( #5991 )
...
* quant stash
* blockwise_quant
* precommit
* rm tensor.cut
* tp ok
* add swiglu
* rm outdate code
* fix activate ut
* change baseline
* fix baseline error
2026-01-15 05:50:37 -08:00
Ryan
724045c426
add some op infershape&dtype ( #5762 )
2025-12-26 16:17:39 +08:00
Yuanle Liu
cdc0004894
Revert "[Feature] add ue8m0 for per_token_quant_fp8 ( #5563 )" ( #5611 )
...
This reverts commit 73e1d6aa90 .
2025-12-17 13:59:06 +08:00
fxyfxy777
73e1d6aa90
[Feature] add ue8m0 for per_token_quant_fp8 ( #5563 )
...
* ue8m0
* add default arg
---------
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2025-12-16 18:40:12 +08:00
周周周
95243f012c
[Others] add PADDLE_ENFORCE ( #5288 )
2025-11-28 14:23:35 +08:00
Ryan
e25c067f70
[OP] Add InferShape&InferDtype for per_token_quant_padding ( #4667 )
...
* add InferShape&InferDtype for per_token_quant_padding
* fix codestyle
2025-10-30 10:28:26 +08:00
周周周
76513f6416
Support 45t fp8 8 GPU ( #3659 )
2025-08-28 10:52:53 +08:00
RichardWooSJTU
e39159f3bd
Add switch to apply fine-grained per token quant fp8 ( #3192 )
...
Co-authored-by: yuanxiaolan <yuanxiaolan01@baidu.com >
2025-08-04 19:54:03 -07:00
Jiang-Jia-Jun
05c670e593
[Sync] Update to latest code ( #2679 )
...
* [Sync] Update to latest code
* Add new code files
* Add new code files
* update code
* Try to fix build.sh
* Try to fix build.sh
* Update code
* Update requirements.txt
* Update code
---------
Co-authored-by: Jiang-Jia-Jun <jiangjiajun@baidu.com >
2025-07-03 15:43:53 +08:00
MARD1NO
ac5f860536
use shfl_xor_sync to reduce redundant shfl broadcast
2025-06-30 13:12:21 +08:00
jiangjiajun
684703fd72
[LLM] First commit the llm deployment code
2025-06-09 19:20:15 +08:00