Commit Graph

5082 Commits

Author SHA1 Message Date
AIbin 1fb8194191 [OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests

* Replace max_position_embeddings with max_model_len

* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
Zhang Yulong 738c658c54 [Benchmark] Update seed argument handling in benchmark_serving.py (#7356) 2026-04-13 16:05:50 +08:00
周周周 a6f0055d51 add ips check (#7352)
* commit

* commit

---------

Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-13 15:24:22 +08:00
liuruyan b34708604c [TI-consistent] support quant use pow2scale (#7308)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:01:53 -07:00
AIbin 6213ad5340 [Docs][BugFix] fix mla log (#7243)
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
Jiajun Ji cb03958b52 [XPU] Refactor get_padding_offset to single kernel. (#7029)
* [XPU] Refactor get_padding_offset to single kernel.

* add unittest.

* fix codestyle.

* remove cum_offsets_now.

* remove max_len.
2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun 26d6a20c2f [Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299)
* [Optim] Remove IPCLock between CacheManager and WorkerProcess

* Update envs.py

* Update worker_process.py

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
2026-04-12 13:59:34 +08:00
周周周 225fc8d222 use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340) 2026-04-11 22:39:43 +08:00
chen 4982aa000e [RL]moe bf16 ep support paddle batch_gemm (#7337)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin ba01d7a823 [Optimization] [OP] [Models] dsk del prefill mask (#7313)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests
2026-04-11 19:32:27 +08:00
JYChen 076ab07528 [RL] change glm rope_emb calculation (#7316)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:36:28 +08:00
YuBaoku fcf8b1336d [CI] Fix nightly test error and add container cleanup in build_rl (#7335)
* [CI] Fix nightly test error and add container cleanup in build_rl
2026-04-11 12:14:46 +08:00
Jiaxin Sui 6e5de2fd6d [XPU][CI]Update xtdk version in download_dependencies.sh (#7320) 2026-04-11 00:26:48 +08:00
YuBaoku 1269eda2f9 [CI] Ensure container cleanup after job to avoid resource leakage (#7315)
* [CI] Ensure container cleanup after job to avoid resource leakage

* [CI] Use prebuilt wheels to install xgrammar==0.1.19 and torch==2.6.0
2026-04-10 22:32:18 +08:00
sunxin 00005c92e0 [BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300) 2026-04-10 03:29:45 -07:00
zhangbo9674 627f0d9cc8 [RL] change rms norm for glm (#7269)
* change rms norm for glm

* refine code

* refine code

* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat 870dbac370 Use triton qk_norm both in Prefill and Decode (#7213)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-10 15:44:01 +08:00
YuBaoku 5c9fa43150 [Docs] Update Release Note (#7302) 2026-04-10 15:26:53 +08:00
yinwei 4aecaa70ba [XPU][Docs] Update Release Note (#7262)
* update

* update docs

* update docs

* update commit

* update commit

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-10 15:22:16 +08:00
bukejiyu 14d46181b8 [Loader] add multi-thread model loading (#6877)
* multi-thread-loader

* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake c1fb3112f8 [FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281)
* refactor quant cli param
2026-04-10 14:13:42 +08:00
Zhang Yulong 7614175e13 Disable fixed random seed in benchmark_dataset.py (#7263)
Commented out the random seed initialization to allow for varied randomness in benchmarks.
2026-04-10 13:56:14 +08:00
Jiang-Jia-Jun e327673737 Update nvidia_gpu.md 2026-04-10 13:53:04 +08:00
ming1753 734fbcffde [BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221) 2026-04-10 11:31:51 +08:00
AIbin 3c54a41131 [Docs][Feature]add fastdeploy-llm-integration skill & research-report skill (#7287)
* add fastdeploy-llm-integration skill &  research-report skill
2026-04-10 11:24:23 +08:00
YuBaoku b7b4fe6a69 [Docs][CI] Fix prebuilt wheel installation and update Docs (#7289)
* [CI] Fix prebuilt wheel installation and update Docs

* [CI] Update Dockerfile.gpu to restrict SM80/86/89/90, CUDA 12.6 and Python 3.10

* Update nvidia_gpu.md

* Update nvidia_gpu.md

* Revise NVIDIA GPU installation instructions

Updated installation instructions for PaddlePaddle and FastDeploy to remove specific CUDA version mentions and clarify support for multiple GPU architectures.

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-04-10 10:31:12 +08:00
YuBaoku ee73623c76 [CI] Set high-risk OOM tests for sequential execution (#7268) 2026-04-09 22:22:57 +08:00
YuBaoku 924690b791 [CI] Add no_proxy configuration for docker execution (#7283) 2026-04-09 19:20:33 +08:00
lizexu123 613f92ee8f [Feature] support nvfp4 tbo (#7259) 2026-04-09 17:29:39 +08:00
AIbin fcaf614133 [Docs]add dsk-3.2 doc (#7278)
* add dsk-3.2 doc
2026-04-09 17:28:25 +08:00
周周周 1782872d61 add deep_ep hopper test (#7206)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-09 17:23:54 +08:00
fxyfxy777 39ff38aba1 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) 2026-04-09 16:17:56 +08:00
Jiang-Jia-Jun 33682c6749 [Docs] Update docs for release/2.5 (#7267)
* Update docs for release/2.5

* Update English docs for release/2.5

- Update README_EN.md: add v2.5 news entry, reformat v2.4 entry with release link
- Update docs/get_started/installation/nvidia_gpu.md:
  - Docker image: 2.4.0 -> 2.5.0, notice now shows SM80/86/89/90 support
  - paddlepaddle-gpu: 3.3.0 -> 3.3.1, add CUDA 12.9 alternatives
  - fastdeploy-gpu: 2.4.0 -> 2.5.0, unified arch install with CUDA 12.9 option
- Update docs/zh/get_started/installation/nvidia_gpu.md:
  - Fix remaining paddlepaddle-gpu==3.3.0 refs in sections 4&5 -> 3.3.1

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/fa0be381-324e-4b0d-b7a6-e2c1fa12174f

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Clarify --extra-index-url usage in installation docs

Add note explaining that --extra-index-url is only for downloading
fastdeploy-gpu dependencies; fastdeploy-gpu itself must be installed
from the Paddle source specified by -i. Applied to both Chinese and
English nvidia_gpu.md installation guides.

Agent-Logs-Url: https://github.com/PaddlePaddle/FastDeploy/sessions/9fa8b3c9-7555-4eae-b9b9-026cddd7e74c

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>

* Update nvidia_gpu.md

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-04-09 16:07:18 +08:00
cloudforge1 85c6773e6c [CI]【Hackathon 10th Spring No.33】config 单测补充 (#6730)
* [CI]【Hackathon 10th Spring No.33】config 单测补充

* fix test_commit_config: reset fields before partial-file test

* [CI]【Hackathon 10th Spring No.33】boost delta coverage for architecture helper branches

* [CI]【Hackathon 10th Spring No.33】add version attr to model config mock

* [CI]【Hackathon 10th Spring No.33】add mrope, runner validation, tail_layer coverage

* [CI]【Hackathon 10th Spring No.33】boost: cover 96 more lines (FDConfig assertions, guided decoding, env branches)

* [CI]【Hackathon 10th Spring No.33】config unit test

* [CI]【Hackathon 10th Spring No.33】cover expert parallel branch

* fix: reset commit hash before _load_from_version_file test; block cuda import via setitem(None)

* refactor: convert to unittest.TestCase style per reviewer request

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
Co-authored-by: Tao Luo <luotao02@baidu.com>
2026-04-09 14:28:54 +08:00
cloudforge1 cefc724607 [CI]【Hackathon 10th Spring No.29】engine unit test (#6771)
* [CI]【Hackathon 10th Spring No.29】engine unit test

Merge with upstream test_engine.py (PR #7083) and add comprehensive
coverage for LLMEngine: lifecycle, worker signals, requests, utils,
stop_profile, and start error handling.

* fix: add deploy_modality to _make_cfg() — Copilot review

---------

Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
2026-04-09 13:45:59 +08:00
Jiaxin Sui 80d5d9fd32 [XPU][CI] lock xvllm version for fix bug (#7264)
* Remove duplicate NICs from environment variables

* Update version for xvllm in download_dependencies.sh
2026-04-09 12:44:27 +08:00
Bingoo 3d2326c1b9 [BugFix] detection jinja2 (#7251)
* detection jinja2

* format
2026-04-09 11:30:16 +08:00
xiaoxiaohehe001 51efe27d76 [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210)
* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen 43ace7af25 [RL] support moe-topk use topk_reduce_func (#7218)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu 7005404ce3 [DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253)
* Delete contiguous ops.

* fix scale

* Delete unnecessary comments

* fix style
2026-04-09 11:00:13 +08:00
AIbin 48d2bbeb74 fix dsa (#7252) 2026-04-08 20:21:38 +08:00
Longzhi Wang b262419db1 Revert "[Other] support video_fps args for video bench (#7077)" (#7254)
This reverts commit 938e7dd881.

Co-authored-by: TBD1 <798934910@qq.com>
2026-04-08 20:13:57 +08:00
chenjian 427efadaee [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159)
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
2026-04-08 19:30:54 +08:00
Jiajun Ji 9b970de029 [XPU] Add TP broadcast after sampling in XPU model runner to ensure consistent results across ranks. (#7096) 2026-04-08 19:26:53 +08:00
3em0 3749457476 [BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs (#7185)
numpy tobytes() only serializes raw element bytes without encoding shape
or dtype metadata. This means arrays with identical raw bytes but
different shapes (e.g. (6,4) vs (4,6)) or different dtypes (e.g.
float32 vs uint8 reinterpretation of same memory) produce the same
SHA-256 digest, leading to silent cache collisions in
ProcessorCacheManager / EncoderCacheManager / PrefixCacheManager.

Prepend a "{shape}|{dtype}|" header to the byte payload before hashing
so that shape and dtype participate in the digest.

Added test cases for shape and dtype sensitivity.
2026-04-08 04:26:02 -07:00
Jiaxin Sui fbc3aa93de [XPU][CI] Remove duplicate NICs from environment variables (#7244) 2026-04-08 19:14:15 +08:00
RichardWooSJTU 771d42c90b [TBO] Apply tbo to gpu_model_runner (#7165)
* apply tbo in gpu_model_runner

* fix
2026-04-08 16:55:17 +08:00
YuBaoku 4cd574cf90 [CI] Reduce execution time for ngram kernel tests (#7242) 2026-04-08 16:54:46 +08:00
Bingoo 043f2a16e3 support moe for sm103 (#7238) 2026-04-08 15:52:39 +08:00