FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-25 09:57:51 +08:00

Author	SHA1	Message	Date
AIbin	c3aceb6bdc	[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689 ) * Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM	2026-03-10 15:05:14 +08:00
sunxin	28f7727a3d	[Feature] Set overlap schedule as default (#6668 ) * overlap default	2026-03-09 22:34:54 +08:00
周周周	3897a0b4fc	nvfp4 clean code (#6671 )	2026-03-09 18:00:34 +08:00
gongweibao	30f9f33f34	[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM (#6610 ) * add fa deter * add ut * add long sentence * fix basic * fix bugs * fix adn * fix first * fix single * fix single * fix single test * refine * add more test * refine comments * add comments of bmm * fix ci * remove probe * add * remove not need * refine tests * fix comments and refine code * refine code * refine test * refine test * mv 4cards tests * fix tests * add * fix comments * fix cover * fix cover --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-09 10:27:53 +08:00
周周周	cebe6f7dae	clean nvfp4 related code (#6644 )	2026-03-05 15:48:33 +08:00
ming1753	81e04bf5d1	[BugFix] fix flash attn mtp rope emb bug (#6649 )	2026-03-04 21:19:12 +08:00
bukejiyu	598cce8545	[RL] Support SM100 FP8 quantization in RL (#6601 ) * RL SM100 Fix * update	2026-03-04 04:55:04 -08:00
zhupengyang	1256fd3806	[XPU] weight only quant method support QKVGate_proj (#6641 )	2026-03-04 18:25:03 +08:00
yzwu	3345641f4e	[Iluvatar][CI] fix the dim error of seq_lens_encoder and seq_lens_decoder (#6637 )	2026-03-04 14:00:40 +08:00
ming1753	02d32eea3b	Revert "[Bug Fix] Fix MM mtp incorrect rope emb (#6581 )" (#6631 ) This reverts commit `c5eb6b65e7`.	2026-03-04 11:23:28 +08:00
ming1753	c5eb6b65e7	[Bug Fix] Fix MM mtp incorrect rope emb (#6581 ) * [Bug Fix] Fix MM mtp incorrect rope emb	2026-03-03 19:28:59 +08:00
RichardWooSJTU	61789febb9	[Quantization] Support to load static quant ue8m0 scale of DeepGEMM via v0_loader (#6433 ) * support to load static quant ue8m0 scale of deepgemm via v0_loader * [Fix] Fix ue8m0 scale pack dimension calculation and block size validation 1. Fix pack dimension calculation in fused_moe_triton_backend.py: - Changed from `ceil_div(...) // 4` to `(num_scales + 3) // 4` for correct ceiling division - This ensures sufficient pack allocation when num_scales is not a multiple of 4 2. Fix block size hardcoding in block_wise_fp8.py: - Use `self.quant_config.weight_block_size` instead of hardcoded `[128, 128]` - Add assertion to ensure weight_block_size is `[128, 128]` for ue8m0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 11:32:35 +08:00
chen	1cae7a0d53	weight only quant method support QKVGate_proj (#6612 )	2026-03-03 11:19:32 +08:00
周周周	3cc09418f1	support dsv3 use flashmla (#6593 )	2026-03-03 11:09:43 +08:00
ming1753	33d6d2403c	[BugFix] fix bug when seq_lens_this_time is 2D (#6613 )	2026-03-02 23:52:03 +08:00
MingkunZhang	3cf7c6c281	[Metax][Fix] fix ci error based pr#6535 (#6600 )	2026-03-02 18:50:16 +08:00
ming1753	344db8c8af	[BugFix] Fix mtp when token_ids_all is None (#6591 ) * [BugFix] Fix mtp when token_ids_all is None * fix bug	2026-03-02 01:23:44 -08:00
RichardWooSJTU	7bd86f99a5	[BugFix] Fix tbo nan (#6439 )	2026-03-02 14:28:48 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
周周周	d957ccd46d	seq_lens related tensor shape -> [max_num_seqs] (#6535 )	2026-03-02 11:18:30 +08:00
chen	5382fb2c60	[BugFix] lazy enable_torch_proxy for cutlass (#6523 ) * lazy enable_torch_proxy for cutlass * test init_flash_attn_version	2026-03-02 10:43:58 +08:00
RichardWooSJTU	7cfb0ffba0	fix pfcc deep ep in low latency mode (#6440 )	2026-03-02 10:35:51 +08:00
AIbin	59b578c337	[Feature]Supports SWA based on appendattn (#6547 )	2026-03-01 19:02:08 +08:00
zccjjj	a2072fe20c	[XPU] support warmup with ep & remove apply_tp_fused_op (#6289 )	2026-02-28 15:40:36 +08:00
ming1753	97eee75677	[Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407 ) * Optim GPU Mem Usage --------- Co-authored-by: huzesen <huzesen@baidu.com>	2026-02-28 15:07:43 +08:00
YuBaoku	54f7d9f621	[CI] Sync mm_batch_invariant with paddle.mm update (#6557 )	2026-02-28 14:56:42 +08:00
Weiguo Zhu	8fb24122b8	fix reshard error (#6536 )	2026-02-27 22:22:37 +08:00
JYChen	c6d8fbe526	[BugFix] fix log with paddlefleet.ops (#6528 )	2026-02-27 14:34:29 +08:00
sunxin	53aaac69da	[Optimization] Enable BF16 gate computation for GLM and Qwen (#6457 ) * gate bf16 * add gate-fp32 * fix * update baseline * update * update * fix	2026-02-26 21:08:46 -08:00
gongweibao	edd31e8849	[Feature] Add Deterministic Inference Support (#6476 ) * add * [tests] Add Paddle attention determinism tests and refactor resource manager Add comprehensive determinism tests for Paddle attention layer and refactor resource manager for deterministic mode support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add * add * add * add * add more * add more * fixsome * fixsome * fix bugs * fix bugs * only in gpu * add docs * fix comments * fix some * fix some * fix comments * add more * fix potential problem * remove not need * remove not need * remove no need * fix bug * fix bugs * fix comments * fix comments * Update tests/ce/deterministic/test_determinism_verification.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/inter_communicator/test_ipc_signal.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/engine/test_sampling_params_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism_standalone.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix comments * fix import error * fix a bug * fix bugs * fix bugs * fix coverage * refine codes * refine code * fix comments * fix comments * fix comments * rm not need * fix allreduce large tensor bug * mv log files * mv log files * add files --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-26 19:31:51 -08:00
zccjjj	c34cb2a8c2	[XPU] [bugfix] fix moe_ffn_quant_type_map bugs about datatype and tensorshape (#6337 )	2026-02-27 09:55:41 +08:00
chen	2d1531f3cb	dev opensource model support fa4/flashmasV2/V3 (#6518 )	2026-02-26 17:46:05 +08:00
zhupengyang	a303eacf62	[XPU] support norm before rope (#6475 )	2026-02-25 18:43:44 +08:00
Longzhi Wang	22566168c3	[Feature] support qkv&gate linear fusion (#6455 ) * [Feature] support qkv&gate linear fusion * add test	2026-02-24 15:20:29 +08:00
AIbin	0eb87467f8	[BugFix]fix RL bug about blockwisefp8 (#6466 ) * fix RL bug about blockwisefp8 * fix moe same bug * fix RL FP8 bug	2026-02-12 09:15:29 +08:00
JYChen	40c952e7b5	fix deepgemm import (#6451 ) Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-02-11 20:10:01 +08:00
zhupengyang	4a8c54926b	[XPU] topk_method=noaux_tc (#6355 )	2026-02-11 16:12:20 +08:00
yzwu	60e75ea8e8	[Iluvatar][CI] Fix cannot import get_stop (#6165 )	2026-02-10 16:57:23 +08:00
chen	d937d6ebfd	check (#6424 )	2026-02-10 15:55:17 +08:00
chen	a8ffcaa068	fix fa4 test (#6408 )	2026-02-10 10:57:21 +08:00
bukejiyu	5bfc0938e2	[BugFix] PD reorder fix and add ut (#6375 )	2026-02-09 04:42:48 -08:00
sunxin	783d56e28a	[Optimization] Support logprob async copy (#6362 ) * support logprob async copy * fix prompt logprob * fix xpu	2026-02-09 17:32:12 +08:00
bukejiyu	dc5917289d	[loader]supoort wint2 backend (#6139 ) * support wint2 * update	2026-02-08 22:42:36 -08:00
Mattheliu	c776d483e4	[BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384 ) * fix: handle 4 return values from noaux_tc_redundant op The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP: - output_tensor (scores) - topk_values - topk_indices - tokens_per_expert_stats_list_out (inplace updated) The Python code was only unpacking 3 values, causing: ValueError: too many values to unpack (expected 3) This fix correctly unpacks all 4 return values, ignoring the inplace updated tensor which is the same as the input tokens_per_expert_stats_list. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> * fix: make noaux_tc_redundant return 4 values to match OP definition The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3, causing inconsistent behavior across different Paddle framework versions. This fix explicitly returns 4 values: - scores (inplace modified) - topk_values - topk_indices - tokens_per_expert_stats_list (inplace modified via atomicAdd) Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> --------- Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>	2026-02-09 13:17:47 +08:00
JYChen	9bcd863902	[Others] support import deepgemm/deepep from fleet ops (#6351 ) * update paddleformers to v1.0 * only change import fleetpath	2026-02-09 11:53:13 +08:00
周周周	2b4748de4f	[MTP] refactor MTP pre_process (#6358 )	2026-02-09 10:47:15 +08:00
K11OntheBoat	116e2aea7a	Support Norm before Rope (#6332 ) Co-authored-by: K11OntheBoat <“ruianmaidanglao@163.com”>	2026-02-05 15:28:52 +08:00
chen	29a313a402	[Optimization] Support FA2/FA3/FA4 with attn_mask_q (#6354 ) * support FA4 sm100 * flash attn backend support mask * flash attn backend run flashmask correct * add test for flash_attn_backend and flash_attn_func * check * add test for fa4 * requirements.txt add fa4 whl * check test on sm100 * fix CI conflict * add enable_torch_proxy for flash_mask * lazy import fa4 * check * fix tests import * check test_load_mpt import	2026-02-05 14:39:00 +08:00
GoldPancake	183b8d325a	[RL] Support GLM MTP RL Model (#6267 )	2026-02-04 20:14:35 +08:00
fxyfxy777	36547cfdb3	[Feature] FD_USE_PHI_FP8_QUANT (#6320 ) * add ut * add use_fd_quant env * rm mask_per_token_quant * add make ops list * USE_FD_FP8_QUANT -> FD_USE_PHI_FP8_QUANT 默认是true * modify comments * use bool type * Add function declaration	2026-02-03 22:33:03 -08:00

1 2 3 4 5 ...

465 Commits