Commit Graph

5054 Commits

Author SHA1 Message Date
周周周 5e54770b2e [Feature] 添加 MoE 层 latent mode 支持 (#7382) 2026-04-15 13:57:07 +08:00
lonelygsh f7a2418ce2 [Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402) 2026-04-15 12:45:23 +08:00
AIbin 8995a38fa4 fix dsa indexer norm to layernorm (#7398) 2026-04-15 11:42:45 +08:00
AIbin bb30f88f1a [Models] support MLA gate attention (#7404)
* support mla gate attn

* support mla gate attn
2026-04-15 11:42:34 +08:00
chen 616b29ce08 check init_flash_attn_version log (#7399) 2026-04-15 11:05:10 +08:00
cmcamdy 13b9fe7299 [XPU] add verify draft tokens (#6947)
* [XPU] add verify draft tokens

* fix test

* fix code style

* use sync cpy

* fix code style

* fix kernel check

* fix ramdom seed

* fix test

* fix check

* fix eos set

* fix verify

* fix verify
2026-04-15 10:18:33 +08:00
lonelygsh e0a1653b26 [Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel (#7349)
Co-authored-by: guanshihui] <guanshihui@baidu.com>
2026-04-14 20:57:11 +08:00
sunxin 7b0baced17 fix rl moe gate type (#7393) 2026-04-14 20:04:04 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
luukunn 9d9d79c457 [DataProcessor] add strict (#7307)
* add strict

* fix
2026-04-14 17:25:38 +08:00
kevin ff47701f31 [BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364)
## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
Bingoo 9c23e6154c [Others] replace tool_helpers to fast_dataindex (#7353)
* replace tool_helpers to fast_dataindex

* modify others requirement
2026-04-14 15:13:54 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
Yuanle Liu 8f21c9caa6 [BugFix] fix gitignore claude (#7381) 2026-04-13 20:32:45 -07:00
zhupengyang 27b00cf385 [XPU] glm-4.5-air (#7071) 2026-04-14 11:31:49 +08:00
chen 26c47c2afc update attn_mask_q 2 (#7371) 2026-04-13 23:06:04 +08:00
Yuanle Liu 0ddb6e461c [Optimization] 移除 num_blocks 上限限制 (#7241) 2026-04-13 07:07:41 -07:00
lonelygsh e83d45833f [Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels (#7166)
- speculate_limit_thinking_content_length: update current_base_step to
  step_idx+1 (step_idx now records history count before current round);
  remove incorrect step_idx decrement on accept_num truncation; mark
  step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
  step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
  formula (remove stale -accept_num offset); use <= condition so accept_idx
  maps directly to the accepted token that ends the stop sequence; fix
  accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
2026-04-13 20:53:42 +08:00
周周周 73bd4ab318 [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361)
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
YuBaoku 1e08ee74e5 [CI] Modify 4-card container startup config and move test case (#7363) 2026-04-13 05:23:49 -07:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
JYChen 5ddd1af756 remove fa4 requirements (#7143) 2026-04-13 19:24:20 +08:00
AIbin 1fb8194191 [OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests

* Replace max_position_embeddings with max_model_len

* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
Zhang Yulong 738c658c54 [Benchmark] Update seed argument handling in benchmark_serving.py (#7356) 2026-04-13 16:05:50 +08:00
周周周 a6f0055d51 add ips check (#7352)
* commit

* commit

---------

Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-13 15:24:22 +08:00
liuruyan b34708604c [TI-consistent] support quant use pow2scale (#7308)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:01:53 -07:00
AIbin 6213ad5340 [Docs][BugFix] fix mla log (#7243)
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
Jiajun Ji cb03958b52 [XPU] Refactor get_padding_offset to single kernel. (#7029)
* [XPU] Refactor get_padding_offset to single kernel.

* add unittest.

* fix codestyle.

* remove cum_offsets_now.

* remove max_len.
2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun 26d6a20c2f [Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299)
* [Optim] Remove IPCLock between CacheManager and WorkerProcess

* Update envs.py

* Update worker_process.py

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
2026-04-12 13:59:34 +08:00
周周周 225fc8d222 use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340) 2026-04-11 22:39:43 +08:00
chen 4982aa000e [RL]moe bf16 ep support paddle batch_gemm (#7337)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin ba01d7a823 [Optimization] [OP] [Models] dsk del prefill mask (#7313)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests
2026-04-11 19:32:27 +08:00
JYChen 076ab07528 [RL] change glm rope_emb calculation (#7316)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:36:28 +08:00
YuBaoku fcf8b1336d [CI] Fix nightly test error and add container cleanup in build_rl (#7335)
* [CI] Fix nightly test error and add container cleanup in build_rl
2026-04-11 12:14:46 +08:00
Jiaxin Sui 6e5de2fd6d [XPU][CI]Update xtdk version in download_dependencies.sh (#7320) 2026-04-11 00:26:48 +08:00
YuBaoku 1269eda2f9 [CI] Ensure container cleanup after job to avoid resource leakage (#7315)
* [CI] Ensure container cleanup after job to avoid resource leakage

* [CI] Use prebuilt wheels to install xgrammar==0.1.19 and torch==2.6.0
2026-04-10 22:32:18 +08:00
sunxin 00005c92e0 [BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300) 2026-04-10 03:29:45 -07:00
zhangbo9674 627f0d9cc8 [RL] change rms norm for glm (#7269)
* change rms norm for glm

* refine code

* refine code

* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat 870dbac370 Use triton qk_norm both in Prefill and Decode (#7213)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-10 15:44:01 +08:00
YuBaoku 5c9fa43150 [Docs] Update Release Note (#7302) 2026-04-10 15:26:53 +08:00
yinwei 4aecaa70ba [XPU][Docs] Update Release Note (#7262)
* update

* update docs

* update docs

* update commit

* update commit

---------

Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com>
2026-04-10 15:22:16 +08:00
bukejiyu 14d46181b8 [Loader] add multi-thread model loading (#6877)
* multi-thread-loader

* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake c1fb3112f8 [FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281)
* refactor quant cli param
2026-04-10 14:13:42 +08:00
Zhang Yulong 7614175e13 Disable fixed random seed in benchmark_dataset.py (#7263)
Commented out the random seed initialization to allow for varied randomness in benchmarks.
2026-04-10 13:56:14 +08:00
Jiang-Jia-Jun e327673737 Update nvidia_gpu.md 2026-04-10 13:53:04 +08:00
ming1753 734fbcffde [BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug (#7221) 2026-04-10 11:31:51 +08:00
AIbin 3c54a41131 [Docs][Feature]add fastdeploy-llm-integration skill & research-report skill (#7287)
* add fastdeploy-llm-integration skill &  research-report skill
2026-04-10 11:24:23 +08:00
YuBaoku b7b4fe6a69 [Docs][CI] Fix prebuilt wheel installation and update Docs (#7289)
* [CI] Fix prebuilt wheel installation and update Docs

* [CI] Update Dockerfile.gpu to restrict SM80/86/89/90, CUDA 12.6 and Python 3.10

* Update nvidia_gpu.md

* Update nvidia_gpu.md

* Revise NVIDIA GPU installation instructions

Updated installation instructions for PaddlePaddle and FastDeploy to remove specific CUDA version mentions and clarify support for multiple GPU architectures.

---------

Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
2026-04-10 10:31:12 +08:00
YuBaoku ee73623c76 [CI] Set high-risk OOM tests for sequential execution (#7268) 2026-04-09 22:22:57 +08:00