Commit Graph

2011 Commits

Author SHA1 Message Date
luukunn 3f84d8d893 [DataProcessor] Refactor multimodal processor: extract encoding strategies and unify MM processing pipeline (#7298)
* merge mm processor
2026-04-15 19:01:06 +08:00
luukunn 14d556692b [BugFix] fix tool call parser (#7369)
* fix tool call parser

* add unit test

* fix unit test

* add unit test
2026-04-15 16:21:46 +08:00
AIbin 8eebbcaf15 [BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit (#7407)
* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len

* fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len
2026-04-15 15:55:11 +08:00
周周周 5e54770b2e [Feature] 添加 MoE 层 latent mode 支持 (#7382) 2026-04-15 13:57:07 +08:00
lonelygsh f7a2418ce2 [Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402) 2026-04-15 12:45:23 +08:00
AIbin 8995a38fa4 fix dsa indexer norm to layernorm (#7398) 2026-04-15 11:42:45 +08:00
AIbin bb30f88f1a [Models] support MLA gate attention (#7404)
* support mla gate attn

* support mla gate attn
2026-04-15 11:42:34 +08:00
chen 616b29ce08 check init_flash_attn_version log (#7399) 2026-04-15 11:05:10 +08:00
sunxin 7b0baced17 fix rl moe gate type (#7393) 2026-04-14 20:04:04 +08:00
Echo-Nie 8819a039c9 [Others] Fix typo (#7280)
* typo

* typo

* typo

* typo
2026-04-14 17:28:22 +08:00
luukunn 9d9d79c457 [DataProcessor] add strict (#7307)
* add strict

* fix
2026-04-14 17:25:38 +08:00
kevin ff47701f31 [BugFix][PD Disaggregation][KVCache] Fix low cache hit rate in PD split scenario (#7364)
## Motivation

在 PD 分离场景下,decode 节点在接收 prefill 节点转发的请求后,没有及时更新 cache block 的命中信息,
导致 prefix cache 命中率低,影响推理性能。

## Modifications

1. 在 `_free_blocks_when_stop` 方法中,额外排除 prefill 节点(`splitwise_role == "prefill"`)
   的 cache block 更新,避免 prefill 节点重复更新 cache 导致状态混乱。
2. 在 decode 节点分配请求(`_alloc_requests_with_cache`)成功后,主动调用
   `update_cache_blocks` 使用 `need_prefill_tokens` 更新 cache block 信息,
   确保 decode 节点能正确感知已命中的 prefix cache。
2026-04-14 16:15:43 +08:00
xiaoxiaohehe001 abba29b348 [BugFix] fix mm rope (#7274) 2026-04-14 11:36:08 +08:00
zhupengyang 27b00cf385 [XPU] glm-4.5-air (#7071) 2026-04-14 11:31:49 +08:00
Yuanle Liu 0ddb6e461c [Optimization] 移除 num_blocks 上限限制 (#7241) 2026-04-13 07:07:41 -07:00
周周周 73bd4ab318 [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361)
[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持
2026-04-13 20:24:58 +08:00
freeliuzc 31e2a8bbad [Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323)
* support mtp overlap in pd-split mode with insert_task overlap
2026-04-13 19:41:17 +08:00
AIbin 1fb8194191 [OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests

* Replace max_position_embeddings with max_model_len

* 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.
2026-04-13 19:12:36 +08:00
周周周 a6f0055d51 add ips check (#7352)
* commit

* commit

---------

Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-13 15:24:22 +08:00
liuruyan b34708604c [TI-consistent] support quant use pow2scale (#7308)
* support quant use pow2scale

* fix

* fix
2026-04-13 00:01:53 -07:00
AIbin 6213ad5340 [Docs][BugFix] fix mla log (#7243)
* [Docs] Fix Chinese punctuation issues
2026-04-13 12:15:43 +08:00
Nyako Shigure d659099415 [Cleanup] Replace torch proxy alias with public compat API (#7348) 2026-04-13 11:43:26 +08:00
Jiajun Ji cb03958b52 [XPU] Refactor get_padding_offset to single kernel. (#7029)
* [XPU] Refactor get_padding_offset to single kernel.

* add unittest.

* fix codestyle.

* remove cum_offsets_now.

* remove max_len.
2026-04-13 11:04:50 +08:00
Jiang-Jia-Jun 26d6a20c2f [Optim] Remove IPCLock between CacheManager and WorkerProcess (#7299)
* [Optim] Remove IPCLock between CacheManager and WorkerProcess

* Update envs.py

* Update worker_process.py

---------

Co-authored-by: jiang-jia-jun <jiangjiajun@baidu.com>
2026-04-12 13:59:34 +08:00
周周周 225fc8d222 use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340) 2026-04-11 22:39:43 +08:00
chen 4982aa000e [RL]moe bf16 ep support paddle batch_gemm (#7337)
* moe bf16 ep support paddle batch_gemm
2026-04-11 21:51:12 +08:00
AIbin ba01d7a823 [Optimization] [OP] [Models] dsk del prefill mask (#7313)
* dsk del prefill mask

* dsk support 1M+ seq_len rope

* update rope tests
2026-04-11 19:32:27 +08:00
JYChen 076ab07528 [RL] change glm rope_emb calculation (#7316)
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
2026-04-11 18:36:28 +08:00
sunxin 00005c92e0 [BugFix] Fix mtp empty run issue in overlap schedule and EP model (#7300) 2026-04-10 03:29:45 -07:00
zhangbo9674 627f0d9cc8 [RL] change rms norm for glm (#7269)
* change rms norm for glm

* refine code

* refine code

* refine code
2026-04-10 01:02:37 -07:00
K11OntheBoat 870dbac370 Use triton qk_norm both in Prefill and Decode (#7213)
Co-authored-by: “liuruian” <liuruian@baidu.com>
2026-04-10 15:44:01 +08:00
bukejiyu 14d46181b8 [Loader] add multi-thread model loading (#6877)
* multi-thread-loader

* fix ut
2026-04-09 23:40:15 -07:00
GoldPancake c1fb3112f8 [FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281)
* refactor quant cli param
2026-04-10 14:13:42 +08:00
lizexu123 613f92ee8f [Feature] support nvfp4 tbo (#7259) 2026-04-09 17:29:39 +08:00
fxyfxy777 39ff38aba1 [OP]Unify MoE op with moe_permute path for bf16 GLM (#7164) 2026-04-09 16:17:56 +08:00
xiaoxiaohehe001 51efe27d76 [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210)
* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] fix_flash_mask_attn_sm90

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn

* [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn
2026-04-09 11:05:10 +08:00
JYChen 43ace7af25 [RL] support moe-topk use topk_reduce_func (#7218)
* support moe-topk use topk_reduce_func

* fix ep error

* fix ut

* fix ut
2026-04-09 11:01:03 +08:00
ShaneGZhu 7005404ce3 [DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253)
* Delete contiguous ops.

* fix scale

* Delete unnecessary comments

* fix style
2026-04-09 11:00:13 +08:00
AIbin 48d2bbeb74 fix dsa (#7252) 2026-04-08 20:21:38 +08:00
Longzhi Wang b262419db1 Revert "[Other] support video_fps args for video bench (#7077)" (#7254)
This reverts commit 938e7dd881.

Co-authored-by: TBD1 <798934910@qq.com>
2026-04-08 20:13:57 +08:00
chenjian 427efadaee [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1 (#7159)
* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* [Feature] Support set PREEMPTED_TOKEN_ID in GET_SAVE_OUTPUT_V1

* fix
2026-04-08 19:30:54 +08:00
Jiajun Ji 9b970de029 [XPU] Add TP broadcast after sampling in XPU model runner to ensure consistent results across ranks. (#7096) 2026-04-08 19:26:53 +08:00
3em0 3749457476 [BugFix] fix multimodal hasher hash collision risk when ndarray shape or dtype differs (#7185)
numpy tobytes() only serializes raw element bytes without encoding shape
or dtype metadata. This means arrays with identical raw bytes but
different shapes (e.g. (6,4) vs (4,6)) or different dtypes (e.g.
float32 vs uint8 reinterpretation of same memory) produce the same
SHA-256 digest, leading to silent cache collisions in
ProcessorCacheManager / EncoderCacheManager / PrefixCacheManager.

Prepend a "{shape}|{dtype}|" header to the byte payload before hashing
so that shape and dtype participate in the digest.

Added test cases for shape and dtype sensitivity.
2026-04-08 04:26:02 -07:00
RichardWooSJTU 771d42c90b [TBO] Apply tbo to gpu_model_runner (#7165)
* apply tbo in gpu_model_runner

* fix
2026-04-08 16:55:17 +08:00
guozhuangzhuang 757bafe3bd [Engine][DataProcessor] fix decode token (#7102) 2026-04-08 15:41:32 +08:00
GoldPancake aa23e0f966 remove arctic_inference deps (#7231) 2026-04-08 15:25:14 +08:00
K11OntheBoat bb48bcbaa2 Split enable_mm (#7183)
Co-authored-by: liuruian <liuruian@MacBook-Pro.local>
2026-04-08 11:25:41 +08:00
luukunn 8496ec71a6 [DataProcessor] Move image_processor to unified directory and add MultiModalProcessor (#7109)
* first commit

* step 9~10

* update multimodal

* update multimodal

* fix load tokenizer

* add unit test

* fix unit test & AdaptiveImageProcessor

* Delete unused code
2026-04-08 10:16:27 +08:00
GoldPancake 9d4fd19c3f [Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding (#7215) 2026-04-07 20:22:28 +08:00
lizhenyun01 446b26bbc0 [Feature] support blackwell gemm in ht (#7053)
* [Feature] support blackwell gemm in ht

* [Feature] support ops for convert

* fix cuda error 716

* fix cuda error

* opt memory

* remove unused code
2026-04-07 19:52:51 +08:00