luukunn
3f84d8d893
[DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline ( #7298 )
...
* merge mm processor
2026-04-15 19:01:06 +08:00
luukunn
14d556692b
[BugFix] fix tool call parser ( #7369 )
...
* fix tool call parser
* add unit test
* fix unit test
* add unit test
2026-04-15 16:21:46 +08:00
AIbin
8eebbcaf15
[BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit ( #7407 )
...
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
2026-04-15 15:55:11 +08:00
周周周
5e54770b2e
[Feature] 添加 MoE 层 latent mode 支持 ( #7382 )
2026-04-15 13:57:07 +08:00
lonelygsh
f7a2418ce2
[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler ( #7402 )
2026-04-15 12:45:23 +08:00
AIbin
8995a38fa4
fix dsa indexer norm to layernorm ( #7398 )
2026-04-15 11:42:45 +08:00
AIbin
bb30f88f1a
[Models] support MLA gate attention ( #7404 )
...
* support mla gate attn
* support mla gate attn
2026-04-15 11:42:34 +08:00
chen
616b29ce08
check init_flash_attn_version log ( #7399 )
2026-04-15 11:05:10 +08:00
sunxin
7b0baced17
fix rl moe gate type ( #7393 )
2026-04-14 20:04:04 +08:00
Echo-Nie
8819a039c9
[Others] Fix typo ( #7280 )
...
* typo
* typo
* typo
* typo
2026-04-14 17:28:22 +08:00
luukunn
9d9d79c457
[DataProcessor] add strict ( #7307 )
...
* add strict
* fix
2026-04-14 17:25:38 +08:00
kevin
ff47701f31
[BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario ( #7364 )
...
## Motivation
在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。
## Modifications
1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
`update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
xiaoxiaohehe001
abba29b348
[BugFix] fix mm rope ( #7274 )
2026-04-14 11:36:08 +08:00
zhupengyang
27b00cf385
[XPU] glm-4.5-air ( #7071 )
2026-04-14 11:31:49 +08:00
Yuanle Liu
0ddb6e461c
[Optimization] 移除 num_blocks 上限限制 ( #7241 )
2026-04-13 07:07:41 -07:00
周周周
73bd4ab318
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 ( #7361 )
...
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
freeliuzc
31e2a8bbad
[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap ( #7323 )
...
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
AIbin
1fb8194191
[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 ( #7359 )
...
* dsk del prefill mask
* dsk support 1M+ seq_len rope
* update rope tests
* Replace max_position_embeddings with max_model_len
* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
周周周
a6f0055d51
add ips check ( #7352 )
...
* commit
* commit
---------
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-04-13 15:24:22 +08:00
liuruyan
b34708604c
[TI-consistent] support quant use pow2scale ( #7308 )
...
* support quant use pow2scale
* fix
* fix
2026-04-13 00:01:53 -07:00
AIbin
6213ad5340
[Docs][BugFix] fix mla log ( #7243 )
...
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure
d659099415
[Cleanup] Replace torch proxy alias with public compat API ( #7348 )
2026-04-13 11:43:26 +08:00
Jiajun Ji
cb03958b52
[XPU] Refactor get_padding_offset to single kernel. ( #7029 )
...
* [XPU] Refactor get_padding_offset to single kernel.
* add unittest.
* fix codestyle.
* remove cum_offsets_now.
* remove max_len.
2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun
26d6a20c2f
[Optim] Remove IPCLock between CacheManager and WorkerProcess ( #7299 )
...
* [Optim] Remove IPCLock between CacheManager and WorkerProcess
* Update envs.py
* Update worker_process.py
---------
Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com >
2026-04-12 13:59:34 +08:00
周周周
225fc8d222
use self.hidden_size not use self.fd_config.model_config.hidden_size ( #7340 )
2026-04-11 22:39:43 +08:00
chen
4982aa000e
[RL]moe bf16 ep support paddle batch_gemm ( #7337 )
...
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin
ba01d7a823
[Optimization] [OP] [Models] dsk del prefill mask ( #7313 )
...
* dsk del prefill mask
* dsk support 1M+ seq_len rope
* update rope tests
2026-04-11 19:32:27 +08:00
JYChen
076ab07528
[RL] change glm rope_emb calculation ( #7316 )
...
* change glm rope_emb calculation
* glm without EnforceFmulRN
* fix ci
2026-04-11 18:36:28 +08:00
sunxin
00005c92e0
[BugFix] Fix mtp empty run issue in overlap schedule and EP model ( #7300 )
2026-04-10 03:29:45 -07:00
zhangbo9674
627f0d9cc8
[RL] change rms norm for glm ( #7269 )
...
* change rms norm for glm
* refine code
* refine code
* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat
870dbac370
Use triton qk_norm both in Prefill and Decode ( #7213 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-04-10 15:44:01 +08:00
bukejiyu
14d46181b8
[Loader] add multi-thread model loading ( #6877 )
...
* multi-thread-loader
* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake
c1fb3112f8
[FDConfig] Support CLI args for quantization params and add cudagraph validation ( #7281 )
...
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123
613f92ee8f
[Feature] support nvfp4 tbo ( #7259 )
2026-04-09 17:29:39 +08:00
fxyfxy777
39ff38aba1
[OP]Unify MoE op with moe_permute path for bf16 GLM ( #7164 )
2026-04-09 16:17:56 +08:00
xiaoxiaohehe001
51efe27d76
[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn ( #7210 )
...
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] fix_flash_mask_attn_sm90
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen
43ace7af25
[RL] support moe-topk use topk_reduce_func ( #7218 )
...
* support moe-topk use topk_reduce_func
* fix ep error
* fix ut
* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu
7005404ce3
[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend ( #7253 )
...
* Delete contiguous ops.
* fix scale
* Delete unnecessary comments
* fix style
2026-04-09 11:00:13 +08:00
AIbin
48d2bbeb74
fix dsa ( #7252 )
2026-04-08 20:21:38 +08:00
Longzhi Wang
b262419db1
Revert "[Other] support video_fps args for video bench ( #7077 )" ( #7254 )
...
This reverts commit 938e7dd881 .
Co-authored-by: TBD1 <798934910@qq.com >
2026-04-08 20:13:57 +08:00
chenjian
427efadaee
[Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 ( #7159 )
...
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1
* fix
2026-04-08 19:30:54 +08:00
Jiajun Ji
9b970de029
[XPU] Add TP broadcast after sampling in XPU model runner to ensure consistent results across ranks. ( #7096 )
2026-04-08 19:26:53 +08:00
3em0
3749457476
[BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs ( #7185 )
...
numpy tobytes() only serializes raw element bytes without encoding shape
or dtype metadata. This means arrays with identical raw bytes but
different shapes (e.g. (6,4) vs (4,6)) or different dtypes (e.g.
float32 vs uint8 reinterpretation of same memory) produce the same
SHA-256 digest, leading to silent cache collisions in
ProcessorCacheManager / EncoderCacheManager / PrefixCacheManager.
Prepend a "{shape}|{dtype}|" header to the byte payload before hashing
so that shape and dtype participate in the digest.
Added test cases for shape and dtype sensitivity.
2026-04-08 04:26:02 -07:00
RichardWooSJTU
771d42c90b
[TBO] Apply tbo to gpu_model_runner ( #7165 )
...
* apply tbo in gpu_model_runner
* fix
2026-04-08 16:55:17 +08:00
guozhuangzhuang
757bafe3bd
[Engine][DataProcessor] fix decode token ( #7102 )
2026-04-08 15:41:32 +08:00
GoldPancake
aa23e0f966
remove arctic_inference deps ( #7231 )
2026-04-08 15:25:14 +08:00
K11OntheBoat
bb48bcbaa2
Split enable_mm ( #7183 )
...
Co-authored-by: liuruian <liuruian@MacBook-Pro.local >
2026-04-08 11:25:41 +08:00
luukunn
8496ec71a6
[DataProcessor] Move image_processor to unified directory and add MultiModalProcessor ( #7109 )
...
* first commit
* step 9~10
* update multimodal
* update multimodal
* fix load tokenizer
* add unit test
* fix unit test & AdaptiveImageProcessor
* Delete unused code
2026-04-08 10:16:27 +08:00
GoldPancake
9d4fd19c3f
[Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding ( #7215 )
2026-04-07 20:22:28 +08:00
lizhenyun01
446b26bbc0
[Feature] support blackwell gemm in ht ( #7053 )
...
* [Feature] support blackwell gemm in ht
* [Feature] support ops for convert
* fix cuda error 716
* fix cuda error
* opt memory
* remove unused code
2026-04-07 19:52:51 +08:00