FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
google-labs-jules[bot]	ddec1b07f8	⚡ Bolt: [performance improvement] Pre-allocate np.full array for padding lists instead of using slow list concatenations in pad_batch_data The old implementation uses `[[pad_id] * (max_len - len(inst)) + list(inst) for inst in insts]` to pad list sequences. This performs an $O(N \times \text{max\_len})$ list concatenation, creating many intermediate Python lists and stressing the garbage collector, before finally passing the result to `np.array(..., dtype=np.int64)`. This change updates it to pre-allocate an empty numpy array (`np.full`) and safely populates it using numpy slicing (`padded_insts[i, :l] = inst`). The change results in a ~2x faster performance. This has been verified to be completely logically equivalent to the original un-modified processor output on a comprehensive set of test cases.	2026-04-13 15:14:37 +00:00
Yuanle Liu	0ddb6e461c	[Optimization] 移除 num_blocks 上限限制 (#7241 )	2026-04-13 07:07:41 -07:00
lonelygsh	e83d45833f	[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166 ) - speculate_limit_thinking_content_length: update current_base_step to step_idx+1 (step_idx now records history count before current round); remove incorrect step_idx decrement on accept_num truncation; mark step_idx param as const. - speculate_set_stop_value_multi_seqs: fix can_stop gate to use step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx formula (remove stale -accept_num offset); use <= condition so accept_idx maps directly to the accepted token that ends the stop sequence; fix accept_tokens index (remove -1). - Update unit tests for speculate_set_stop_value_multi_seqs kernel.	2026-04-13 20:53:42 +08:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
YuBaoku	1e08ee74e5	[CI] Modify 4-card container startup config and move test case (#7363 )	2026-04-13 05:23:49 -07:00
freeliuzc	31e2a8bbad	[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323 ) * support mtp overlap in pd-split mode with insert_task overlap	2026-04-13 19:41:17 +08:00
JYChen	5ddd1af756	remove fa4 requirements (#7143 )	2026-04-13 19:24:20 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
Zhang Yulong	738c658c54	[Benchmark] Update seed argument handling in benchmark_serving.py (#7356 )	2026-04-13 16:05:50 +08:00
周周周	a6f0055d51	add ips check (#7352 ) * commit * commit --------- Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-13 15:24:22 +08:00
liuruyan	b34708604c	[TI-consistent] support quant use pow2scale (#7308 ) * support quant use pow2scale * fix * fix	2026-04-13 00:01:53 -07:00
AIbin	6213ad5340	[Docs][BugFix] fix mla log (#7243 ) * [Docs] Fix Chinese punctuation issues	2026-04-13 12:15:43 +08:00
Nyako Shigure	d659099415	[Cleanup] Replace torch proxy alias with public compat API (#7348 )	2026-04-13 11:43:26 +08:00
Jiajun Ji	cb03958b52	[XPU] Refactor get_padding_offset to single kernel. (#7029 ) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len.	2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun	26d6a20c2f	[Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299 ) * [Optim] Remove IPCLock between CacheManager and WorkerProcess * Update envs.py * Update worker_process.py --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>	2026-04-12 13:59:34 +08:00
周周周	225fc8d222	use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340 )	2026-04-11 22:39:43 +08:00
chen	4982aa000e	[RL]moe bf16 ep support paddle batch_gemm (#7337 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:12 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
JYChen	076ab07528	[RL] change glm rope_emb calculation (#7316 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:36:28 +08:00
YuBaoku	fcf8b1336d	[CI] Fix nightly test error and add container cleanup in build_rl (#7335 ) * [CI] Fix nightly test error and add container cleanup in build_rl	2026-04-11 12:14:46 +08:00
Jiaxin Sui	6e5de2fd6d	[XPU][CI]Update xtdk version in download_dependencies.sh (#7320 )	2026-04-11 00:26:48 +08:00
YuBaoku	1269eda2f9	[CI] Ensure container cleanup after job to avoid resource leakage (#7315 ) * [CI] Ensure container cleanup after job to avoid resource leakage * [CI] Use prebuilt wheels to install xgrammar==0.1.19 and torch==2.6.0	2026-04-10 22:32:18 +08:00
sunxin	00005c92e0	[BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300 )	2026-04-10 03:29:45 -07:00
zhangbo9674	627f0d9cc8	[RL] change rms norm for glm (#7269 ) * change rms norm for glm * refine code * refine code * refine code	2026-04-10 01:02:37 -07:00
K11OntheBoat	870dbac370	Use triton qk_norm both in Prefill and Decode (#7213 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-10 15:44:01 +08:00
YuBaoku	5c9fa43150	[Docs] Update Release Note (#7302 )	2026-04-10 15:26:53 +08:00
yinwei	4aecaa70ba	[XPU][Docs] Update Release Note (#7262 ) * update * update docs * update docs * update commit * update commit --------- Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>	2026-04-10 15:22:16 +08:00
bukejiyu	14d46181b8	[Loader] add multi-thread model loading (#6877 ) * multi-thread-loader * fix ut	2026-04-09 23:40:15 -07:00
GoldPancake	c1fb3112f8	[FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281 ) * refactor quant cli param	2026-04-10 14:13:42 +08:00
Zhang Yulong	7614175e13	Disable fixed random seed in benchmark_dataset.py (#7263 ) Commented out the random seed initialization to allow for varied randomness in benchmarks.	2026-04-10 13:56:14 +08:00
Jiang-Jia-Jun	e327673737	Update nvidia_gpu.md	2026-04-10 13:53:04 +08:00
ming1753	734fbcffde	[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221 )	2026-04-10 11:31:51 +08:00
AIbin	3c54a41131	[Docs][Feature]add fastdeploy-llm-integration skill & research-report skill (#7287 ) * add fastdeploy-llm-integration skill & research-report skill	2026-04-10 11:24:23 +08:00
YuBaoku	b7b4fe6a69	[Docs][CI] Fix prebuilt wheel installation and update Docs (#7289 ) * [CI] Fix prebuilt wheel installation and update Docs * [CI] Update Dockerfile.gpu to restrict SM80/86/89/90, CUDA 12.6 and Python 3.10 * Update nvidia_gpu.md * Update nvidia_gpu.md * Revise NVIDIA GPU installation instructions Updated installation instructions for PaddlePaddle and FastDeploy to remove specific CUDA version mentions and clarify support for multiple GPU architectures. --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-04-10 10:31:12 +08:00
YuBaoku	ee73623c76	[CI] Set high-risk OOM tests for sequential execution (#7268 )	2026-04-09 22:22:57 +08:00
YuBaoku	924690b791	[CI] Add no_proxy configuration for docker execution (#7283 )	2026-04-09 19:20:33 +08:00
lizexu123	613f92ee8f	[Feature] support nvfp4 tbo (#7259 )	2026-04-09 17:29:39 +08:00
AIbin	fcaf614133	[Docs]add dsk-3.2 doc (#7278 ) * add dsk-3.2 doc	2026-04-09 17:28:25 +08:00
周周周	1782872d61	add deep_ep hopper test (#7206 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-09 17:23:54 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
Jiang-Jia-Jun	33682c6749	[Docs] Update docs for release/2.5 (#7267 ) * Update docs for release/2.5 * Update English docs for release/2.5 - Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link - Update docs/get_started/installation/nvidia_gpu.md: - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option - Update docs/zh/get_started/installation/nvidia_gpu.md: - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1 Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Clarify --extra-index-url usage in installation docs Add note explaining that --extra-index-url is only for downloading fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed from the Paddle source specified by -i. Applied to both Chinese and English nvidia_gpu.md installation guides. Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> * Update nvidia_gpu.md --------- Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>	2026-04-09 16:07:18 +08:00
cloudforge1	85c6773e6c	[CI]【Hackathon 10th Spring No.33】config 单测补充 (#6730 ) * [CI]【Hackathon 10th Spring No.33】config 单测补充 * fix test_commit_config: reset fields before partial-file test * [CI]【Hackathon 10th Spring No.33】boost delta coverage for architecture helper branches * [CI]【Hackathon 10th Spring No.33】add version attr to model config mock * [CI]【Hackathon 10th Spring No.33】add mrope, runner validation, tail_layer coverage * [CI]【Hackathon 10th Spring No.33】boost: cover 96 more lines (FDConfig assertions, guided decoding, env branches) * [CI]【Hackathon 10th Spring No.33】config unit test * [CI]【Hackathon 10th Spring No.33】cover expert parallel branch * fix: reset commit hash before _load_from_version_file test; block cuda import via setitem(None) * refactor: convert to unittest.TestCase style per reviewer request --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com> Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com> Co-authored-by: Tao Luo <luotao02@baidu.com>	2026-04-09 14:28:54 +08:00
cloudforge1	cefc724607	[CI]【Hackathon 10th Spring No.29】engine unit test (#6771 ) * [CI]【Hackathon 10th Spring No.29】engine unit test Merge with upstream test_engine.py (PR #7083) and add comprehensive coverage for LLMEngine: lifecycle, worker signals, requests, utils, stop_profile, and start error handling. * fix: add deploy_modality to _make_cfg() — Copilot review --------- Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com> Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-04-09 13:45:59 +08:00
Jiaxin Sui	80d5d9fd32	[XPU][CI] lock xvllm version for fix bug (#7264 ) * Remove duplicate NICs from environment variables * Update version for xvllm in download_dependencies.sh	2026-04-09 12:44:27 +08:00
Bingoo	3d2326c1b9	[BugFix] detection jinja2 (#7251 ) * detection jinja2 * format	2026-04-09 11:30:16 +08:00
xiaoxiaohehe001	51efe27d76	[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210 ) * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn	2026-04-09 11:05:10 +08:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00
ShaneGZhu	7005404ce3	[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253 ) * Delete contiguous ops. * fix scale * Delete unnecessary comments * fix style	2026-04-09 11:00:13 +08:00
AIbin	48d2bbeb74	fix dsa (#7252 )	2026-04-08 20:21:38 +08:00
Longzhi Wang	b262419db1	Revert "[Other] support video_fps args for video bench (#7077 )" (#7254 ) This reverts commit `938e7dd881`. Co-authored-by: TBD1 <798934910@qq.com>	2026-04-08 20:13:57 +08:00

1 2 3 4 5 ...

5039 Commits