FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-24 01:29:57 +08:00

Author	SHA1	Message	Date
fxyfxy777	36547cfdb3	[Feature] FD_USE_PHI_FP8_QUANT (#6320 ) * add ut * add use_fd_quant env * rm mask_per_token_quant * add make ops list * USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true * modify comments * use bool type * Add function declaration	2026-02-03 22:33:03 -08:00
周周周	6225439778	add PADDLE_ENFORCE (#6321 )	2026-02-04 10:47:19 +08:00
JYChen	c745a22420	[Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304 )	2026-02-03 17:47:38 +08:00
周周周	8277b95fa6	remove speculate_get_padding_offset op (#6308 )	2026-02-03 15:18:12 +08:00
fxyfxy777	2ada119a38	[Optimize] optimize mask_quant & swiglu (#6222 ) * optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant	2026-02-02 13:52:38 +08:00
xiaozude	030647521a	[Metax] adapt to the latest develop (#6282 )	2026-01-29 23:21:20 -08:00
JYChen	6c685c9474	Revert "[Feature] Support Ernie FP8 on sm100 (#5593 )" (#6275 ) This reverts commit `eb80724b71`.	2026-01-30 11:22:01 +08:00
JYChen	eb80724b71	[Feature] Support Ernie FP8 on sm100 (#5593 ) * Deepgemm暂时可用版本 * dense部分 e8m0 ok * EB模型E8M0跑通的版本 * code check * support 21b-tp2, dev_paddle * 单机4.5T ep OK的版本 * 修复删除的代码,单机4.5T ep(非cudagraph) * eb tp * Support SM100 block-wise FP8 inference * refine codes, support deepgemm on sm100 * add thirdparty PFCC/DeepGEMM * fix ep decode * 使用deepep ue8m0, 解决精度问题 * 修复FP8 TP精度 * Deepgemm升级适配Hopper逻辑 * add ue8m0 kernel * add ue8m0 kernel * fix custom_ops/gpu_ops/cpp_extensions.cc * eb 输出正常 * eb5 text is right * 目测精度一致 * 自测精度对齐 * 替换masked_per_token_quant, ep精度OK * 性能提升约30% * 暂时跑通ep但是有问题 * 自测一致 * rm test fun * fix ep event * 图优化算子更新Deepgemm * fix build * 暂时绕过deepgemm CI编译问题 * 根据SM区分deepgemm版本 * remove useless code --------- Co-authored-by: ckl117 <ckl117@163.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: fxyfxy777 <fxyfxy777@163.com>	2026-01-29 13:49:54 +08:00
jc	7da5f54fb3	[CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler (#6250 ) * Add unit test for swap_layout * remove splitwise_scheduler test	2026-01-28 19:20:20 +08:00
GoldPancake	7d6c87c29e	[Others] Support constrained decoding when enable_thinking is false (#6248 ) * support constrained decoding when enable_thinking is false * fix * fix * fix	2026-01-28 00:05:17 -08:00
sunxin	27f8799f04	[Model Runner] Refactor execute_model for GPU async scheduling (#6176 )	2026-01-28 14:19:33 +08:00
freeliuzc	ce06c6dfb3	[BugFix] Fix token_penalty kernel (#6069 ) * fix token_penalty kernel * try to fix xpu * fix xpu * fix unit test	2026-01-28 12:03:05 +08:00
周周周	aa57864c5b	remove unneeded para from flash_mask_attention (#6218 )	2026-01-27 14:04:27 +08:00
yangjianfengo1	b3627b59f8	[Bug Fix] fix mask attention (#6216 )	2026-01-26 07:46:26 -08:00
sunxin	adc69c15d0	[Model Runner] Prepare token count and move FA3 initialization into the graph (#6170 ) * prepare for token num and put FA3 init in graph	2026-01-26 12:16:57 +08:00
周周周	0966df78dc	[Others] remove stop_nums (#6182 )	2026-01-26 12:12:47 +08:00
jc	309c7d9764	router support divided roolout (#6150 )	2026-01-22 10:39:39 +08:00
lizexu123	f4902fe42d	[BugFix] fix wint2 (#6109 ) * fix * fix * fix	2026-01-20 21:46:21 +08:00
sunxin	a4144e0b8e	[Optimization] Avoid unnecessary penalty computation (#6078 )	2026-01-19 15:24:12 +08:00
GoldPancake	bda38aa519	[Speculative Decoding] Support MTP for GLM-4.5-Air (#6047 ) * glm mtp * add spec neox partial rope	2026-01-16 14:35:24 +08:00
fxyfxy777	4c92035f2d	[Feature] Unify fp8 block_wise quant ops (#5991 ) * quant stash * blockwise_quant * precommit * rm tensor.cut * tp ok * add swiglu * rm outdate code * fix activate ut * change baseline * fix baseline error	2026-01-15 05:50:37 -08:00
freeliuzc	49617d9832	[Feature]Support tag phase token enforce generation (#6034 ) * support tag phase token enforce generation * optimize note and some feature * fix sampler unit test --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-15 03:59:55 -08:00
lizexu123	6619298b50	【Optim】Optimize grid dimensions using max_tokens_per_expert for MoE models (#6007 ) * update w4afp8 * build.sh ok * support cuda_graph * fix * add test * fix max_tokens_per_expert * >=70 * fix * compute_max_tokens_from_prefix_sum in w4afp8 * compute_max_tokens use cub	2026-01-15 19:18:42 +08:00
Daci	e10b51b8c6	[Feature] get_output_kv_signal blocking read mode & send_first_token (#5836 ) * get_output_kv_signal blocking read mode * send first token before recycle * xpu get_output_kv_signal blocking read mode --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-01-15 14:11:03 +08:00
chenjian	74d0f1c01f	[Optim] Robust sync status when preempted happens (#5796 ) * [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-14 12:07:33 +08:00
周周周	ad8d05a8de	[Optimization] Do not compute ATTN padding part in In Cuda graph mode (#5985 )	2026-01-13 11:32:27 +08:00
lzy	223b2f5d86	Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917 )	2026-01-12 14:09:39 +08:00
sunxin	17ef3920f3	remove decoder_num_blocks_device memset (#5982 )	2026-01-10 21:22:06 +08:00
周周周	b8d9daa785	MLA clean code (#5979 )	2026-01-10 21:05:00 +08:00
xiaoxiaohehe001	00a01ae024	[Feature] Support redundant expert for eplb (#5918 ) * [BugFix] support redundant expert for eplb * support redundant expert for eplb * support redundant expert for eplb * update * fix ci eplb	2026-01-09 17:13:24 +08:00
GoldPancake	a1fc4e249e	[Bugfix] Fix mtp logprob hang problem when include stop_seq (#5927 ) * fix mtp logprob hang when include stop_seq	2026-01-08 14:21:24 +08:00
lizhenyun01	2be8656c29	[BugFix] fix mtp split kv attetion (#5920 ) * [BugFix] fix mtp split kv attetion * clean code * clean code	2026-01-07 04:07:31 -08:00
kevin	a76e8ae40c	[Feature] support rdma pd dy-c8 (#5788 ) * add rdma pd dy-c8 * update code	2026-01-07 14:55:25 +08:00
周周周	f15df1ec89	Revert cuda check (#5915 ) * commit * commit	2026-01-07 14:40:18 +08:00
yangjianfengo1	59523b27de	opt w4afp8 (#5853 )	2026-01-07 12:22:35 +08:00
周周周	83ae59431e	[BugFix] fix BatchMLAWithPagedKVCacheKernel O_tmp (#5895 )	2026-01-06 15:39:06 +08:00
Yuanle Liu	5e729bc2ba	[OPs] ep_moe_expert_dispatch.cu dispatch num_experts_per_rank 5 (#5890 )	2026-01-06 10:39:35 +08:00
周周周	ab553b3b8b	revert cuda_check (#5883 )	2026-01-05 20:51:31 +08:00
lizexu123	1d3ae7c024	[BugFix] fix w4afp8 tp=8 (#5868 ) * fix w4afp8 tp=8 * fix	2026-01-05 18:59:02 +08:00
chen	ac39c0f887	support fa3 qwen-vl rope (#5869 )	2026-01-05 15:29:34 +08:00
sunxin	adb91dcacc	[BugFix] Fix wint4 ep issue caused by empty run (#5870 )	2026-01-05 14:24:37 +08:00
周周周	e3957a5ebc	[Others] remove template NUM_EXPERTS_PER_RANK in permute_x_fp8_kernel (#5620 )	2026-01-04 11:21:15 +08:00
Sunny-bot1	598d292a69	w4afp8 fix quant (#5830 )	2025-12-30 21:16:13 +08:00
Yonghua Li	a8d3e3ba12	[BugFix] fix shm opened but not closed in set_data_ipc (#5826 )	2025-12-29 23:35:07 +08:00
CSWYF3634076	9286403570	[Models] Add Qwen3-VL Model Support (#5763 ) * support v1 loader * remove useless code * remove useless * [Model] support Qwen3VL images success * [Model] support Qwen3VL rope_3d * [Model] support Qwen3VL remove log * [Model] support Qwen3VL RL * [Model] support Qwen3VL tp * [Model] support Qwen3VL video * [Model] support Qwen3VL fix ernievl * [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds * [Model] support Qwen3VL fix multi card * [Model] support Qwen3VL file close * [Model] support Qwen3VL fix ce * [Model] support Qwen3VL fix unittest * [Model] support Qwen3VL add unittest --------- Co-authored-by: Ayakouji <yuhongh@qq.com>	2025-12-29 17:39:33 +08:00
周周周	a3f0696e35	[BugFix] fix compile error in sm89 (#5809 )	2025-12-29 16:55:52 +08:00
Longzhi Wang	11329ee35e	[Model] support mode config for expert_dispatch (#5748 )	2025-12-29 13:37:20 +08:00
Ryan	09229d8953	change `count_tokens_per_expert_func` declaration: `Tensor` -> `vector<Tensor>` (#5794 )	2025-12-26 19:02:28 +08:00
Ryan	724045c426	add some op infershape&dtype (#5762 )	2025-12-26 16:17:39 +08:00
周周周	03363cab4c	make flash_mask attention pybind (#5783 )	2025-12-26 14:31:35 +08:00

1 2 3 4 5 ...

275 Commits