lonelygsh
f7a2418ce2
[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler ( #7402 )
2026-04-15 12:45:23 +08:00
AIbin
8995a38fa4
fix dsa indexer norm to layernorm ( #7398 )
2026-04-15 11:42:45 +08:00
AIbin
bb30f88f1a
[Models] support MLA gate attention ( #7404 )
...
* support mla gate attn
* support mla gate attn
2026-04-15 11:42:34 +08:00
chen
616b29ce08
check init_flash_attn_version log ( #7399 )
2026-04-15 11:05:10 +08:00
cmcamdy
13b9fe7299
[XPU] add verify draft tokens ( #6947 )
...
* [XPU] add verify draft tokens
* fix test
* fix code style
* use sync cpy
* fix code style
* fix kernel check
* fix ramdom seed
* fix test
* fix check
* fix eos set
* fix verify
* fix verify
2026-04-15 10:18:33 +08:00
lonelygsh
e0a1653b26
[Speculate Decoding] Fix bug of reasoning_phase_token_constraint kernel ( #7349 )
...
Co-authored-by: guanshihui] <guanshihui@baidu.com >
2026-04-14 20:57:11 +08:00
sunxin
7b0baced17
fix rl moe gate type ( #7393 )
2026-04-14 20:04:04 +08:00
Echo-Nie
8819a039c9
[Others] Fix typo ( #7280 )
...
* typo
* typo
* typo
* typo
2026-04-14 17:28:22 +08:00
luukunn
9d9d79c457
[DataProcessor] add strict ( #7307 )
...
* add strict
* fix
2026-04-14 17:25:38 +08:00
kevin
ff47701f31
[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario ( #7364 )
...
## Motivation
在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。
## Modifications
1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
`update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
Bingoo
9c23e6154c
[Others] replace tool_helpers to fast_dataindex ( #7353 )
...
* replace tool_helpers to fast_dataindex
* modify others requirement
2026-04-14 15:13:54 +08:00
xiaoxiaohehe001
abba29b348
[BugFix] fix mm rope ( #7274 )
2026-04-14 11:36:08 +08:00
Yuanle Liu
8f21c9caa6
[BugFix] fix gitignore claude ( #7381 )
2026-04-13 20:32:45 -07:00
zhupengyang
27b00cf385
[XPU] glm-4.5-air ( #7071 )
2026-04-14 11:31:49 +08:00
chen
26c47c2afc
update attn_mask_q 2 ( #7371 )
2026-04-13 23:06:04 +08:00
Yuanle Liu
0ddb6e461c
[Optimization] 移除 num_blocks 上限限制 ( #7241 )
2026-04-13 07:07:41 -07:00
lonelygsh
e83d45833f
[Speculate Decoding] Fix step_idx semantics in limit_thinking and set_stop_value kernels ( #7166 )
...
- speculate_limit_thinking_content_length: update current_base_step to
step_idx+1 (step_idx now records history count before current round);
remove incorrect step_idx decrement on accept_num truncation; mark
step_idx param as const.
- speculate_set_stop_value_multi_seqs: fix can_stop gate to use
step_idx_now+accept_num>=min_token_limit; fix skip check and pre_ids_idx
formula (remove stale -accept_num offset); use <= condition so accept_idx
maps directly to the accepted token that ends the stop sequence; fix
accept_tokens index (remove -1).
- Update unit tests for speculate_set_stop_value_multi_seqs kernel.
2026-04-13 20:53:42 +08:00
周周周
73bd4ab318
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 ( #7361 )
...
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
YuBaoku
1e08ee74e5
[CI] Modify 4-card container startup config and move test case ( #7363 )
2026-04-13 05:23:49 -07:00
freeliuzc
31e2a8bbad
[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap ( #7323 )
...
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
JYChen
5ddd1af756
remove fa4 requirements ( #7143 )
2026-04-13 19:24:20 +08:00
AIbin
1fb8194191
[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 ( #7359 )
...
* dsk del prefill mask
* dsk support 1M+ seq_len rope
* update rope tests
* Replace max_position_embeddings with max_model_len
* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
Zhang Yulong
738c658c54
[Benchmark] Update seed argument handling in benchmark_serving.py ( #7356 )
2026-04-13 16:05:50 +08:00
周周周
a6f0055d51
add ips check ( #7352 )
...
* commit
* commit
---------
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-04-13 15:24:22 +08:00
liuruyan
b34708604c
[TI-consistent] support quant use pow2scale ( #7308 )
...
* support quant use pow2scale
* fix
* fix
2026-04-13 00:01:53 -07:00
AIbin
6213ad5340
[Docs][BugFix] fix mla log ( #7243 )
...
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure
d659099415
[Cleanup] Replace torch proxy alias with public compat API ( #7348 )
2026-04-13 11:43:26 +08:00
Jiajun Ji
cb03958b52
[XPU] Refactor get_padding_offset to single kernel. ( #7029 )
...
* [XPU] Refactor get_padding_offset to single kernel.
* add unittest.
* fix codestyle.
* remove cum_offsets_now.
* remove max_len.
2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun
26d6a20c2f
[Optim] Remove IPCLock between CacheManager and WorkerProcess ( #7299 )
...
* [Optim] Remove IPCLock between CacheManager and WorkerProcess
* Update envs.py
* Update worker_process.py
---------
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com >
2026-04-12 13:59:34 +08:00
周周周
225fc8d222
use self.hidden_size not use self.fd_config.model_config.hidden_size ( #7340 )
2026-04-11 22:39:43 +08:00
chen
4982aa000e
[RL]moe bf16 ep support paddle batch_gemm ( #7337 )
...
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin
ba01d7a823
[Optimization] [OP] [Models] dsk del prefill mask ( #7313 )
...
* dsk del prefill mask
* dsk support 1M+ seq_len rope
* update rope tests
2026-04-11 19:32:27 +08:00
JYChen
076ab07528
[RL] change glm rope_emb calculation ( #7316 )
...
* change glm rope_emb calculation
* glm without EnforceFmulRN
* fix ci
2026-04-11 18:36:28 +08:00
YuBaoku
fcf8b1336d
[CI] Fix nightly test error and add container cleanup in build_rl ( #7335 )
...
* [CI] Fix nightly test error and add container cleanup in build_rl
2026-04-11 12:14:46 +08:00
Jiaxin Sui
6e5de2fd6d
[XPU][CI]Update xtdk version in download_dependencies.sh ( #7320 )
2026-04-11 00:26:48 +08:00
YuBaoku
1269eda2f9
[CI] Ensure container cleanup after job to avoid resource leakage ( #7315 )
...
* [CI] Ensure container cleanup after job to avoid resource leakage
* [CI] Use prebuilt wheels to install xgrammar==0.1.19 and torch==2.6.0
2026-04-10 22:32:18 +08:00
sunxin
00005c92e0
[BugFix] Fix mtp empty run issue in overlap schedule and EP model ( #7300 )
2026-04-10 03:29:45 -07:00
zhangbo9674
627f0d9cc8
[RL] change rms norm for glm ( #7269 )
...
* change rms norm for glm
* refine code
* refine code
* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat
870dbac370
Use triton qk_norm both in Prefill and Decode ( #7213 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-04-10 15:44:01 +08:00
YuBaoku
5c9fa43150
[Docs] Update Release Note ( #7302 )
2026-04-10 15:26:53 +08:00
yinwei
4aecaa70ba
[XPU][Docs] Update Release Note ( #7262 )
...
* update
* update docs
* update docs
* update commit
* update commit
---------
Co-authored-by: Jiaxin Sui <95567040+plusNew001@users.noreply.github.com >
2026-04-10 15:22:16 +08:00
bukejiyu
14d46181b8
[Loader] add multi-thread model loading ( #6877 )
...
* multi-thread-loader
* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake
c1fb3112f8
[FDConfig] Support CLI args for quantization params and add cudagraph validation ( #7281 )
...
* refactor quant cli param
2026-04-10 14:13:42 +08:00
Zhang Yulong
7614175e13
Disable fixed random seed in benchmark_dataset.py ( #7263 )
...
Commented out the random seed initialization to allow for varied randomness in benchmarks.
2026-04-10 13:56:14 +08:00
Jiang-Jia-Jun
e327673737
Update nvidia_gpu.md
2026-04-10 13:53:04 +08:00
ming1753
734fbcffde
[BugFix] Fix Async D2H copy bug & flash mash atten cache V out of bound bug ( #7221 )
2026-04-10 11:31:51 +08:00
AIbin
3c54a41131
[Docs][Feature]add fastdeploy-llm-integration skill & research-report skill ( #7287 )
...
* add fastdeploy-llm-integration skill & research-report skill
2026-04-10 11:24:23 +08:00
YuBaoku
b7b4fe6a69
[Docs][CI] Fix prebuilt wheel installation and update Docs ( #7289 )
...
* [CI] Fix prebuilt wheel installation and update Docs
* [CI] Update Dockerfile.gpu to restrict SM80/86/89/90, CUDA 12.6 and Python 3.10
* Update nvidia_gpu.md
* Update nvidia_gpu.md
* Revise NVIDIA GPU installation instructions
Updated installation instructions for PaddlePaddle and FastDeploy to remove specific CUDA version mentions and clarify support for multiple GPU architectures.
---------
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com >
2026-04-10 10:31:12 +08:00
YuBaoku
ee73623c76
[CI] Set high-risk OOM tests for sequential execution ( #7268 )
2026-04-09 22:22:57 +08:00
YuBaoku
924690b791
[CI] Add no_proxy configuration for docker execution ( #7283 )
2026-04-09 19:20:33 +08:00