FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
chen	6a4efa011a	update attn_mask_q 2 (#7372 )	2026-04-13 23:34:21 +08:00
K11OntheBoat	10a5e1c7c3	Check optional params before .get() call in gqa_rope_write_cache (#7311 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-04-13 19:05:43 +08:00
JYChen	ffe2cf10f9	[Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7317 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:37:27 +08:00
YuBaoku	9985b192b4	[XPU][CI]Update xtdk version in download_dependencies.sh (#7320 ) (#7321 ) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-04-11 00:27:32 +08:00
YuBaoku	1e0ab318e0	[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221 ) (#7294 ) Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com>	2026-04-10 13:54:09 +08:00
fxyfxy777	7f55586e63	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 ) (#7282 )	2026-04-09 21:37:53 +08:00
YuBaoku	19cac90117	[XPU][CI] lock xvllm version for fix bug (#7264 ) (#7265 ) * Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-04-09 12:46:37 +08:00
xiaoxiaohehe001	2647d80699	[Cherry-Pick][BugFix] Fix flash mask attn 2.5 (#7249 ) * [CherryPick] Fix flash_mask_attn * [CherryPick] Fix flash_mask_attn	2026-04-09 11:05:16 +08:00
Bingoo	1fa58fcb34	support moe for sm103 (#7239 )	2026-04-08 20:57:07 +08:00
JYChen	566699303c	solve conflict (#7135 ) Co-authored-by: wangyifei <mitu626@163.com>	2026-04-02 10:55:15 +08:00
cmcamdy	d854e4ee4b	[Cherry-Pick][XPU] Fix speculate schedule(#7049 ) (#7051 ) * [BugFix] xpu fix speculate schedule cache kernel * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-27 18:30:06 +08:00
chen	2efab46261	add instantiations for decoder rope enfore_fmul_rn=true (#7010 )	2026-03-25 20:06:52 +08:00
chen	49c2310854	[RL][Cherry-Pick] RoPE without fmad opt (#6901 ) (#6902 ) * [RL] RoPE without fmad opt (#6901) * env FD_ENABLE_RL=1 do fmul_rn(ab) in rope pre_commit	2026-03-25 10:42:16 +08:00
GoldPancake	cf0df470cf	[Cherry-Pick][Speculative Decoding] Support suffix decoding (#6403 ) (#6967 )	2026-03-23 17:33:58 +08:00
Yonghua Li	a4f36cc8db	[Cherry-Pick] [BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822 ) (#6824 ) * [BugFix] replace ftok with custom_ftok in get_output/save_output ops * [Test] add unit test for custom_ftok * [Chore] create custom_ftok.h * [Chore] reorganize header file * [Fix] fix syntax * [Fix] fix cache messager msg_queue_id+rank_id conflict	2026-03-16 14:22:30 +08:00
yinwei	f103a143db	[XPU][CI]Cherry-Pick PR and Update CI Case (#6619 ) * [XPU] Fix PD + MTP (#6495) * fix pd + mtp * fix code style * fix PD + MTP, D get P's first token * add anno for gpu(speculate_update) * update draft insertv1 * fix wapper & kernel * fix wapper * fix code stype * fix tp4 dp1 (#6624) * update paddle whl package --------- Co-authored-by: cmcamdy <1027740945@qq.com>	2026-03-11 10:57:30 +08:00
AIbin	01e6ca734a	[Cherry-Pick][Feature]Supports SWA based on appendattn #6547 (#6594 ) * support SWA V1	2026-03-02 20:15:23 +08:00
Yuanle Liu	0a5ad26f6f	[Cherry-Pick][OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6511 ) * [OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6493) * Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * Delete tests/model_executor/test_thinking_budget.py * fix --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2026-02-26 13:29:38 +08:00
sunxin	d36ff9ebfa	[Cherry-Pick 2.5][BugFix] Fix get_padding_offset in empty run (#6460 )	2026-02-11 20:21:53 +08:00
GoldPancake	375d7fbffb	fix bug (#6425 )	2026-02-10 17:07:51 +08:00
Mattheliu	c776d483e4	[BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384 ) * fix: handle 4 return values from noaux_tc_redundant op The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP: - output_tensor (scores) - topk_values - topk_indices - tokens_per_expert_stats_list_out (inplace updated) The Python code was only unpacking 3 values, causing: ValueError: too many values to unpack (expected 3) This fix correctly unpacks all 4 return values, ignoring the inplace updated tensor which is the same as the input tokens_per_expert_stats_list. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> * fix: make noaux_tc_redundant return 4 values to match OP definition The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3, causing inconsistent behavior across different Paddle framework versions. This fix explicitly returns 4 values: - scores (inplace modified) - topk_values - topk_indices - tokens_per_expert_stats_list (inplace modified via atomicAdd) Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> --------- Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>	2026-02-09 13:17:47 +08:00
周周周	2b4748de4f	[MTP] refactor MTP pre_process (#6358 )	2026-02-09 10:47:15 +08:00
jc	d6b3c722c1	[KVCache] Storage cache supports c8 model (#6298 ) * Refine cache transfer manager * Storage cache supports c8 model	2026-02-06 12:01:17 +08:00
周周周	e3fb8796b4	Remove MTP rebuil_padding useless code (#6336 )	2026-02-05 16:28:44 +08:00
chen	29a313a402	[Optimization] Support FA2/FA3/FA4 with attn_mask_q (#6354 ) * support FA4 sm100 * flash attn backend support mask * flash attn backend run flashmask correct * add test for flash_attn_backend and flash_attn_func * check * add test for fa4 * requirements.txt add fa4 whl * check test on sm100 * fix CI conflict * add enable_torch_proxy for flash_mask * lazy import fa4 * check * fix tests import * check test_load_mpt import	2026-02-05 14:39:00 +08:00
lizan1999	72edd394d9	[XPU] support noaux_tc (#6326 )	2026-02-05 12:04:16 +08:00
fxyfxy777	36547cfdb3	[Feature] FD_USE_PHI_FP8_QUANT (#6320 ) * add ut * add use_fd_quant env * rm mask_per_token_quant * add make ops list * USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true * modify comments * use bool type * Add function declaration	2026-02-03 22:33:03 -08:00
周周周	6225439778	add PADDLE_ENFORCE (#6321 )	2026-02-04 10:47:19 +08:00
JYChen	c745a22420	[Feature] Support Ernie FP8 on sm100 ( the fixed version) (#6304 )	2026-02-03 17:47:38 +08:00
周周周	8277b95fa6	remove speculate_get_padding_offset op (#6308 )	2026-02-03 15:18:12 +08:00
fxyfxy777	2ada119a38	[Optimize] optimize mask_quant & swiglu (#6222 ) * optimize mask_quant op speed up 1.5 * fix calculate sequence * add fused * rm log * push kernel code * add ut * accuracy ok * add ue8m0 * add ut * add merge develop * rm ut of mask_per_token_quant	2026-02-02 13:52:38 +08:00
xiaozude	030647521a	[Metax] adapt to the latest develop (#6282 )	2026-01-29 23:21:20 -08:00
JYChen	6c685c9474	Revert "[Feature] Support Ernie FP8 on sm100 (#5593 )" (#6275 ) This reverts commit `eb80724b71`.	2026-01-30 11:22:01 +08:00
JYChen	eb80724b71	[Feature] Support Ernie FP8 on sm100 (#5593 ) * Deepgemm暂时可用版本 * dense部分 e8m0 ok * EB模型E8M0跑通的版本 * code check * support 21b-tp2, dev_paddle * 单机4.5T ep OK的版本 * 修复删除的代码,单机4.5T ep(非cudagraph) * eb tp * Support SM100 block-wise FP8 inference * refine codes, support deepgemm on sm100 * add thirdparty PFCC/DeepGEMM * fix ep decode * 使用deepep ue8m0, 解决精度问题 * 修复FP8 TP精度 * Deepgemm升级适配Hopper逻辑 * add ue8m0 kernel * add ue8m0 kernel * fix custom_ops/gpu_ops/cpp_extensions.cc * eb 输出正常 * eb5 text is right * 目测精度一致 * 自测精度对齐 * 替换masked_per_token_quant, ep精度OK * 性能提升约30% * 暂时跑通ep但是有问题 * 自测一致 * rm test fun * fix ep event * 图优化算子更新Deepgemm * fix build * 暂时绕过deepgemm CI编译问题 * 根据SM区分deepgemm版本 * remove useless code --------- Co-authored-by: ckl117 <ckl117@163.com> Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”> Co-authored-by: fxyfxy777 <fxyfxy777@163.com>	2026-01-29 13:49:54 +08:00
jc	7da5f54fb3	[CI] Add unit test for swap_layout && remove unit test of splitwise_scheduler (#6250 ) * Add unit test for swap_layout * remove splitwise_scheduler test	2026-01-28 19:20:20 +08:00
GoldPancake	7d6c87c29e	[Others] Support constrained decoding when enable_thinking is false (#6248 ) * support constrained decoding when enable_thinking is false * fix * fix * fix	2026-01-28 00:05:17 -08:00
sunxin	27f8799f04	[Model Runner] Refactor execute_model for GPU async scheduling (#6176 )	2026-01-28 14:19:33 +08:00
freeliuzc	ce06c6dfb3	[BugFix] Fix token_penalty kernel (#6069 ) * fix token_penalty kernel * try to fix xpu * fix xpu * fix unit test	2026-01-28 12:03:05 +08:00
周周周	aa57864c5b	remove unneeded para from flash_mask_attention (#6218 )	2026-01-27 14:04:27 +08:00
yangjianfengo1	b3627b59f8	[Bug Fix] fix mask attention (#6216 )	2026-01-26 07:46:26 -08:00
sunxin	adc69c15d0	[Model Runner] Prepare token count and move FA3 initialization into the graph (#6170 ) * prepare for token num and put FA3 init in graph	2026-01-26 12:16:57 +08:00
周周周	0966df78dc	[Others] remove stop_nums (#6182 )	2026-01-26 12:12:47 +08:00
RuohengMa	976203cf60	[XPU ]fix text_image_gather_scatter in cudagraph mode(#6049 )	2026-01-23 19:48:43 +08:00
lizan1999	b3a48529ab	[XPU] add more type for recover batch sequence (#6142 )	2026-01-23 15:16:05 +08:00
jc	309c7d9764	router support divided roolout (#6150 )	2026-01-22 10:39:39 +08:00
lizexu123	f4902fe42d	[BugFix] fix wint2 (#6109 ) * fix * fix * fix	2026-01-20 21:46:21 +08:00
yinwei	51a8a2ed57	[XPU] Support CudaGraph(add block attn cuda_graph support) (#6116 ) * add block attn cuda_graph support	2026-01-20 19:33:11 +08:00
zhupengyang	45ebb2efb4	[XPU] support plugin model (#6092 )	2026-01-20 13:00:09 +08:00
sunxin	a4144e0b8e	[Optimization] Avoid unnecessary penalty computation (#6078 )	2026-01-19 15:24:12 +08:00
GoldPancake	bda38aa519	[Speculative Decoding] Support MTP for GLM-4.5-Air (#6047 ) * glm mtp * add spec neox partial rope	2026-01-16 14:35:24 +08:00

1 2 3 4 5 ...

377 Commits