FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
freeliuzc	7a6c28781b	[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug (#7005 ) * optimize attn_mask_offset and optimize mtp usage * delete useless branch * fix kernel format * fix kernel runner	2026-03-25 01:52:06 -07:00
chen	c92e277cf1	[RL] RoPE without fmad opt (#6901 ) * env FD_ENABLE_RL=1 do fmul_rn(a*b) in rope	2026-03-24 21:19:53 +08:00
zhupengyang	5780345646	[XPU] fix speculate_verify (#6985 )	2026-03-24 18:55:09 +08:00
freeliuzc	e87ce4b8cd	[Speculative Decoding] refactor MTP and optimize spec-decoding postprocess (#6973 ) * support new mtp * refactor(speculate_decoding and mtp): optimize mtp sturcture logic. Update spec-branch status-process * fix cuda-graph for spec-decoding * fix xpu mtp and fix some note * fix unittest and optmize note * fix model status update in eos-branch	2026-03-24 10:19:01 +08:00
Ding	defaffd5fb	【Hackathon 10th Spring No.45】FastDeploy 支持在 T4/V100 硬件的编译 -part (#6488 ) * fix(custom_ops): gate unsupported ops for sm70/sm75 build * fix(custom_ops): gate deepgemm exports to sm75+ only * [BugFix][OP] deduplicate CUDA sources to avoid moe_deepgemm multiple definition * revert two custom_ops files to 352f922f9	2026-03-23 19:16:23 +08:00
AIbin	bf7e2424d0	[Optimization][Feature]Supports multiple batches of DSK-DSA. (#6930 ) * support DSA_MUTI_BATCH * update test topk * update dsk-dsa	2026-03-20 15:59:22 +08:00
lizan1999	148eee84c6	[XPU] use quant2d_per_token for weight quant int8 && fix some XPU Kernel check (#6869 )	2026-03-17 19:44:48 +08:00
gongweibao	e4c9cac124	[BugFix] Cap nvcc -t threads to avoid compilation failures on high-co… (#6885 ) * [BugFix] Cap nvcc -t threads to avoid compilation failures on high-core machines On machines with many cores (e.g. 192), the nvcc -t flag was set to os.cpu_count(), causing each nvcc process to spawn that many internal threads. Combined with Paddle's ThreadPoolExecutor launching parallel compilations (also based on cpu_count), this leads to ~28K+ threads, resource exhaustion, and silent compilation failures. The linker then cannot find the missing .o files, but a second build succeeds because already-compiled objects are cached. Cap nvcc -t at 4 to keep total parallelism reasonable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-17 19:27:45 +08:00
AIbin	cb6819d086	[Optimization][OP]support per_token_group_fp8_quant cuda kernel (#6865 ) * support per_token_group_fp8_quant cuda kernel * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * update code --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-17 19:17:51 +08:00
RichardWooSJTU	4ed483d20b	[BugFix] Fix ep compatibility issues & Optimize permute operator (#6821 ) * fix ep compatibility issues & optimize permute operator * fix ut * fix ut	2026-03-17 10:32:11 +08:00
gongweibao	a6351dea0b	[BugFix][Optimization] Replace silent failures with catchable exceptions and informative error messages (#6533 ) * init * init * fix format * add * add files * add ut * fix some * add ut * add more * add * fix pre-commit * fix pre-commit * fix cover * skip long seq * add * add * fix * remove not need * fix set attr * fix comments * fix comments * fix failed tests --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-16 21:32:43 +08:00
AIbin	c9f7f5234e	[Optimization][BugFix]Optimize Deepseek networking code (#6861 ) * update dsk model * update dsk model	2026-03-16 16:52:43 +08:00
mayang002	72ff7bf4cd	[XPU] Fix wrapper files (#6830 ) - Add WRAPPER_CHECK_PTR for pointer validity checks - Add WRAPPER_ASSERT_GT/GE/LE for parameter range validation - Simplify wrapper function calls to direct return pattern	2026-03-16 14:39:40 +08:00
Yonghua Li	7c8c0a3c02	[BugFix] replace ftok with custom_ftok in get_output/save_output ops (#6822 ) * [BugFix] replace ftok with custom_ftok in get_output/save_output ops * [Test] add unit test for custom_ftok * [Chore] create custom_ftok.h * [Chore] reorganize header file * [Fix] fix cache messager msg_queue_id+rank_id conflict	2026-03-16 14:22:18 +08:00
周周周	820eb60ec6	[Others] clean code (#6839 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-03-14 11:09:28 +08:00
cmcamdy	7591e0d6bc	fix eb5 mtp(mix) (#6800 )	2026-03-13 17:36:57 +08:00
周周周	8c1a2827d3	DSA clean code (#6827 )	2026-03-13 16:39:47 +08:00
freeliuzc	12f412448b	[Speculative Decoding] Fix speculate stop_seqs and fix accept_num in eos branch (#6825 )	2026-03-12 23:48:24 -07:00
gongweibao	8906e09e0f	[Feature][OP] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path (#6749 ) * [Feature] Add batch-invariant RMSNorm kernel and TP embedding Custom AR path - Add Triton-based rms_norm_batch_invariant kernel for M-invariant RMSNorm - Add linear/linear_v2 tracking wrappers in batch_invariant_mode - Route TP VocabParallelEmbedding through Custom AR instead of NCCL - Increase FD_CUSTOM_AR_MAX_SIZE_MB default from 8 to 64 - Add unit tests for RMSNorm and TP embedding invariance * [Fix] Fix test tolerances for bfloat16 RMSNorm and custom AR buffer size - Relax bfloat16 atol from 1e-3 to 1e-2 for D=3584 in RMSNorm numerical correctness test (0.0078125 diff is expected at bfloat16 precision) - Update test_communication expected buffer size from 8MB to 64MB to match FD_CUSTOM_AR_MAX_SIZE_MB default change in envs.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add RMSNorm layer batch_invariant_mode unit test for coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add pragma no cover for Triton kernel and multi-GPU embedding path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: gongweibao <gognweibao@baidu.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 14:34:44 +08:00
mayang002	1f9f889e37	[XPU] refactor: XPU plugin namespace migration (#6799 ) * [XPU] refactor: XPU plugin namespace migration - Migrate wrapper layer namespace from baidu::xpu::api::plugin to fastdeploy::plugin - Migrate kernel layer namespace from xpu3::plugin to fd_xpu3 - Add api:: prefix for types (Context, SUCCESS, XPUIndexType, ctx_guard) - Remove XPU2 support, keep only XPU3 - Update ops/ directory to use new namespace Total: 137 files changed * [XPU] fix: add return value check and correct error messages - Add PADDLE_ENFORCE_XDNN_SUCCESS check for speculate_get_logits and update_attn_mask_offsets - Fix empty error message in draft_model_postprocess - Correct function name in speculate_schedule_cache error message - Update error messages from 'xpu::plugin::' to 'fastdeploy::plugin::'	2026-03-13 10:21:51 +08:00
huicongyao	2e63d88f7a	[Optimization][Speculative Decoding]Fuse padding sampling params (#6765 ) * optimize speculate pre process unit test * Add CUDA kernel for building sampling params in speculative decoding * init infer seed in device * format code * add unittest & fix * fix * format-code * format-code * fix rebase * . * fix unitest	2026-03-12 05:05:15 -07:00
yzwu	901b38c936	[Iluvatar] Optimize decode group_gemm and Support cuda graph for ernie (#6803 )	2026-03-12 19:21:17 +08:00
cmcamdy	3543088d3e	[XPU] rm stop nums (#6651 ) * rm stop nums * fix conflict --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-03-12 14:05:58 +08:00
Jiajun Ji	88c4fbf8e1	[XPU] Add speculate_limit_thinking_content_length Op. (#6627 ) * [XPU] Add speculate_limit_thinking_content_length OP for xpu. * add unittest. * format codes. * format codes. * format codes. * Fix unused kernel launch return value. --------- Co-authored-by: cmcamdy <1027740945@qq.com>	2026-03-11 17:30:17 +08:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
AIbin	1118351b27	[Optimization] Update Deepseekv3.2 model and dsa-indexer networking and add some unitest (#6762 ) * add deepseek model doc * update deepseek model doc * update deepseek model doc * update deepseek model doc * cwb suppor DSK_V32 Model * update DSK_V32_DSA modeling * Ibin Support DSK_DSA * update kernel * update yaml * update requirements * update pre_commit * update model-runner * fix CI bug * del start.sh * fix iluvatar_model_runner * update DSA & add unitest * update import deep_gemm	2026-03-11 15:52:54 +08:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00
wangyifei	b57c960837	cuda13.0, implement changes to CCCL (#6751 )	2026-03-10 16:47:02 +08:00
AIbin	c3aceb6bdc	[Models][OP][Optimization] Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM (#6689 ) * Support DeepSeek-v3.2 model, integrate DSA & Indexer architecture with FlashMLA/DeepGEMM	2026-03-10 15:05:14 +08:00
mayang002	ecc5032176	[XPU] Add return value checks for all XPU kernel launches (#6666 ) * [XPU] Add return value checks for all XPU kernel launches - Add -fxpu-launch-return compiler flag in CMakeLists.txt to enable kernel launch return values - Add KERNEL_ASSERT_SUCCESS(ctx, ret_xre) checks after every XPU kernel launch across 45 wrapper files (55 launch sites total) - Covers both main wrapper/ and mtp_wrapper/ directories - Properly handles multiple kernel launches in the same function scope by reusing the ret_xre variable * [XPU] code style fix	2026-03-10 10:45:18 +08:00
gongweibao	30f9f33f34	[Feature][BugFix][OP] Enhance Deterministic Inference Mode with Kernel-level Fixes and Batch-invariant BMM (#6610 ) * add fa deter * add ut * add long sentence * fix basic * fix bugs * fix adn * fix first * fix single * fix single * fix single test * refine * add more test * refine comments * add comments of bmm * fix ci * remove probe * add * remove not need * refine tests * fix comments and refine code * refine code * refine test * refine test * mv 4cards tests * fix tests * add * fix comments * fix cover * fix cover --------- Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-09 10:27:53 +08:00
sunxin	0dc7034ce0	[Model Runner] Deprecate not_need_stop (#6356 ) * Deprecate not_need_stop	2026-03-05 10:55:42 +08:00
gongweibao	ddb06ff83f	init (#6642 ) Co-authored-by: gongweibao <gognweibao@baidu.com>	2026-03-04 21:55:31 +08:00
MingkunZhang	e8e18cecce	[Metax][Fix] fix ci error based pr#6501 (#6636 )	2026-03-04 11:09:57 +08:00
lizan1999	c637692427	[XPU] support MTP Step > 1 (#6609 ) Co-authored-by: lizan1999 <lizan03@baidu.com>	2026-03-04 10:07:37 +08:00
Jiajun Ji	4ff3f4212f	[XPU] Add update_attn_mask_offsets op for xpu. (#6556 ) * add update_attn_mask_offsets op for xpu. * format code style. * format codes with pre-commit.	2026-03-03 18:00:05 +08:00
周周周	3cc09418f1	support dsv3 use flashmla (#6593 )	2026-03-03 11:09:43 +08:00
huicongyao	0f718baaf2	[Speculative Decoding]Reformat input preprocess for spec decode (#6501 ) * add speculate_pre_process kernel * reduce one slice * make d2h async && fix mtp bug for new pre_process * fix * add unitest * fix: code stype formatting * fix * fix: thread race in speculate_preprocess && rename d2h event	2026-03-03 10:22:07 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
AIbin	59b578c337	[Feature]Supports SWA based on appendattn (#6547 )	2026-03-01 19:02:08 +08:00
ming1753	97eee75677	[Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407 ) * Optim GPU Mem Usage --------- Co-authored-by: huzesen <huzesen@baidu.com>	2026-02-28 15:07:43 +08:00
cmcamdy	13447279aa	[XPU] Fix PD + MTP (#6495 ) * fix pd + mtp * fix code style * fix PD + MTP, D get P's first token * add anno for gpu(speculate_update) * update draft insertv1 * fix wapper & kernel * fix wapper * fix code stype	2026-02-27 19:07:35 +08:00
gongweibao	edd31e8849	[Feature] Add Deterministic Inference Support (#6476 ) * add * [tests] Add Paddle attention determinism tests and refactor resource manager Add comprehensive determinism tests for Paddle attention layer and refactor resource manager for deterministic mode support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * add * add * add * add * add more * add more * fixsome * fixsome * fix bugs * fix bugs * only in gpu * add docs * fix comments * fix some * fix some * fix comments * add more * fix potential problem * remove not need * remove not need * remove no need * fix bug * fix bugs * fix comments * fix comments * Update tests/ce/deterministic/test_determinism_verification.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/inter_communicator/test_ipc_signal.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/engine/test_sampling_params_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update tests/layers/test_paddle_attention_determinism_standalone.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix comments * fix import error * fix a bug * fix bugs * fix bugs * fix coverage * refine codes * refine code * fix comments * fix comments * fix comments * rm not need * fix allreduce large tensor bug * mv log files * mv log files * add files --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-02-26 19:31:51 -08:00
GoldPancake	2178f2829b	[Speculative Decoding] Support suffix decoding (#6403 ) * support suffix decoding	2026-02-26 11:42:05 +08:00
Yuanle Liu	6d3fede240	[OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6493 ) * Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2026-02-25 21:36:50 +08:00
sunxin	51f812aaa4	fix empty get_padding_offset (#6462 )	2026-02-12 12:34:23 +08:00
AIbin	983be007f5	[Feature]support swa & sink Based on appendattn (#6410 ) * support swa & sink Based on appendattn	2026-02-10 18:28:03 +08:00
yzwu	60e75ea8e8	[Iluvatar][CI] Fix cannot import get_stop (#6165 )	2026-02-10 16:57:23 +08:00
GoldPancake	8b1dd0f360	fix bug (#6422 )	2026-02-10 14:58:50 +08:00
Mattheliu	c776d483e4	[BugFix]fix handle 4 return values from noaux_tc_redundant op (#6384 ) * fix: handle 4 return values from noaux_tc_redundant op The noaux_tc_redundant CUDA op is defined with 4 outputs in PD_BUILD_STATIC_OP: - output_tensor (scores) - topk_values - topk_indices - tokens_per_expert_stats_list_out (inplace updated) The Python code was only unpacking 3 values, causing: ValueError: too many values to unpack (expected 3) This fix correctly unpacks all 4 return values, ignoring the inplace updated tensor which is the same as the input tokens_per_expert_stats_list. Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> * fix: make noaux_tc_redundant return 4 values to match OP definition The PD_BUILD_STATIC_OP defines 4 outputs but the function only returned 3, causing inconsistent behavior across different Paddle framework versions. This fix explicitly returns 4 values: - scores (inplace modified) - topk_values - topk_indices - tokens_per_expert_stats_list (inplace modified via atomicAdd) Co-Authored-By: Claude (Claude Opus 4.5) <noreply@anthropic.com> --------- Co-authored-by: Claude (Claude Opus 4.5) <noreply@anthropic.com>	2026-02-09 13:17:47 +08:00

1 2 3 4 5 ...

406 Commits