luukunn
3651113ee5
[DataProcessor]Remove ENABLE_V1_DATA_PROCESSOR ( #7052 )
...
* remove ENABLE_V1_DATA_PROCESSOR
* fix unit test
* fix unit test
2026-04-01 09:53:41 +08:00
qwes5s5
ee2b965f5f
adjust config info ( #7054 )
2026-03-31 21:26:05 +08:00
Yonghua Li
a3cc3aa777
[BugFix] reset exist tasks signal in clear_data ( #7111 )
...
* [BugFix] reset exist tasks signal in clear_data
* [Fix] fix stale exist tasks signal after weight update
* [Chore] downgrade detected new requests log to DEBUG level
* [fix] adjust continue place
2026-03-31 21:24:08 +08:00
周周周
fd44bb7cbf
cpmmot ( #7105 )
...
Co-authored-by: “liuruian” <liuruian@baidu.com >
2026-03-31 16:13:44 +08:00
cloudforge1
5c5dc66aa7
[CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充 ( #6731 )
...
* [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充
* [CI]【Hackathon 10th Spring No.34】async_expert_loader 单测补充
---------
Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-31 15:29:35 +08:00
YilongGuo
dd61e7e421
[Qwen3VL] Add clear_grpah_opt_backend method to Qwen3VLForConditionalGeneration ( #7086 )
...
Add clear_grpah_opt_backend method that delegates to the underlying model
to clear cuda graph optimization backend.
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com >
2026-03-31 13:48:25 +08:00
YuBaoku
db6e637f4f
[CI] Remove skip logic for *.txt-only changes ( #7104 )
2026-03-31 13:24:50 +08:00
huicongyao
dd2aa10ed4
fix cuda graph capture failure in CI test ( #7094 )
2026-03-31 11:05:51 +08:00
qwes5s5
daa95244f7
abort requests ( #6992 )
2026-03-31 11:02:26 +08:00
Yonghua Li
6d9739f360
[BugFix] fix speculative gauge metrics in multi api server ( #7082 )
2026-03-31 10:52:50 +08:00
chenjian
6727df8286
[Optimization] Optimize ttft for prefill pd ( #6680 )
...
* optimize ttft
* fix
* fix
* fix ci
* fix ci
* fix
* fix bug
* fix
* add comments
* fix ci
* fix
* fix ci
* fix format
* update according to review
* add comment
* fix
* fix format
2026-03-30 20:36:23 +08:00
jackyYang6
05f2d95729
[RL] Adapt async rollout checkpoint update flow ( #7042 )
...
* update checkpoint-transfer flow and control update_weights params
* test: add update_weights route validation
2026-03-30 19:19:34 +08:00
yzwu
8789329457
[Iluvatar] Support wi4a16 group_gemm ( #7078 )
2026-03-30 19:03:51 +08:00
kevin
18062c55bb
[BugFix][KVCache] Fix mm hash boundary comparison in get_block_hash_extra_keys ( #6929 )
...
* [BugFix][KVCache] Fix mm hash boundary comparison in get_block_hash_extra_keys
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
* [BugFix][KVCache] Fix test_get_block_hash_extra_keys_boundary_cases assertions
## Motivation
测试用例 `test_get_block_hash_extra_keys_boundary_cases` 中,Block [4,8) 的
调用错误地传入了 `mm_idx=1`,跳过了 img0[2,5);但 img0 覆盖 token 4,token 4
属于 block [4,8),应被包含在 hash_keys 中。此外,所有 assertEqual 只校验了
hash_keys,未校验返回的 mm_idx 游标。
## Modifications
- `test_get_block_hash_extra_keys_boundary_cases`:
- 改为链式调用,用上一次返回的 mm_idx 作为下一次入参,模拟真实调用循环
- Block [4,8) 入参从 `mm_idx=1` 改为沿用上次返回的 `mm_idx=0`,期望值从 `[]` 改为 `["hash-0"]`
- 所有断言改为 `assertEqual((mm_idx, hash_keys), (...))` 同时校验游标
- `test_get_block_hash_extra_keys_no_overlap_at_boundaries`:
- Case B 入参从 `mm_idx=1` 改为 `mm_idx=0`(从头遍历,img-a 走 continue)
- 所有断言增加 mm_idx 校验
- `test_get_block_hash_extra_keys_image_crosses_block_boundary`:
- 所有断言增加 mm_idx 校验
- `test_get_block_hash_extra_keys_no_mm_inputs`:
- 断言增加 mm_idx 校验
- `test_get_block_hash_extra_keys_handles_multimodal_segments`:
- call2、call3 断言增加 mm_idx 校验
## Usage or Command
```bash
python -m pytest tests/cache_manager/test_prefix_cache_manager.py::TestPrefixCacheManagerCoverage -v -k "get_block_hash_extra_keys"
```
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
---------
Co-authored-by: chengyanfu <chengyanfu@baidu.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 17:13:31 +08:00
周周周
76cf5e9496
[append attention] clean code ( #7062 )
2026-03-30 15:07:53 +08:00
luukunn
b9f8873367
[Optimization]Merge Text processor ( #7030 )
...
* merge text processor
* update
* fix unit test
* merge messages2ids
* fix unit test
* 删除重复代码
* remove redundant code
* delete code
* fix unit test
2026-03-30 15:02:35 +08:00
Jiang-Jia-Jun
1670b011a5
Revert "[BugFix] Add lock to avoid generating nan when using storage cache (#…" ( #7075 )
...
This reverts commit 6d2ab8f2c0 .
2026-03-30 14:52:05 +08:00
jc
6d2ab8f2c0
[BugFix] Add lock to avoid generating nan when using storage cache ( #7046 )
...
* Add lock to avoid generating nan
* up
2026-03-30 14:50:32 +08:00
zhangbo9674
5c60e2fc6f
fix bug in cudagraph ( #7069 )
2026-03-30 14:24:23 +08:00
mpgemm
1a1d048774
[Feature] Support NVFP4 Flashinfer-cutedsl MoE on SM100 ( #6963 )
2026-03-30 11:37:04 +08:00
mouxin
61a9079c60
[Feature] Update logging ( #7072 )
2026-03-30 11:20:27 +08:00
Longzhi Wang
2eea6fa97a
[BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend ( #7028 )
...
* [BugFix] Fix kv cache int8 dynamic quant on flash and flash_mask backend
* add constexpr and code style clean
* add test
* fix code style
* fix test
2026-03-30 11:17:15 +08:00
mpgemm
7a20eaebe8
[Feature] Support cute cpp Encoder FA4 ( #7016 )
...
* add cute cpp fa4
* 删掉注释
* 修正合并错误
* sm_version放到函数内
* ci错误
2026-03-30 10:54:56 +08:00
kevin
9765fa7313
[Refactor] Replace --skip-mm-profiling with --deploy-modality text ( #7048 )
...
* [Feature] Support --skip-mm-profiling to skip multimodal token overhead in profiling
## Motivation
在多模态模型(如 Qwen2.5-VL、ERNIE4.5-VL 等)部署时,`get_max_chunk_tokens` 会在
基础 token 数之上额外叠加 mm token 数,用于 profiling 阶段预留显存。
某些场景下(如已知图像 token 数较小,或希望节省显存),用户希望跳过该多模态 token
额外开销的计算,直接使用文本 token 数进行 profiling。
## Modifications
- `fastdeploy/engine/args_utils.py`:`EngineArgs` 新增 `skip_mm_profiling: bool = False`
字段,parser 新增 `--skip-mm-profiling` 启动参数
- `fastdeploy/config.py`:`ModelConfig.__init__` 新增 `self.skip_mm_profiling = False`;
`FDConfig.get_max_chunk_tokens` 中增加 `not self.model_config.skip_mm_profiling` 判断,
开启后跳过 mm token 叠加,直接返回基础 `num_tokens`
## Usage or Command
启动服务时添加参数:
```bash
--skip-mm-profiling
```
## Checklist
- [x] Add at least a tag in the PR title.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. 本功能为配置参数透传,逻辑简单,已有相关 config 单元测试覆盖。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
* [Refactor] Replace skip_mm_profiling with deploy_modality=text to skip mm profiling
## Motivation
原 `--skip-mm-profiling` 参数与已有的 `deploy_modality` 参数功能存在语义重叠:
当以纯文本模式(`deploy_modality=text`)部署时,本就不需要为多模态 token 预留显存。
引入独立参数增加了配置复杂度,复用 `deploy_modality` 更加直观和一致。
## Modifications
- `fastdeploy/engine/args_utils.py`:删除 `EngineArgs.skip_mm_profiling` 字段及
`--skip-mm-profiling` 启动参数
- `fastdeploy/config.py`:删除 `ModelConfig.__init__` 中的 `self.skip_mm_profiling = False`;
`FDConfig.get_max_chunk_tokens` 中将条件改为
`self.deploy_modality != DeployModality.TEXT`,
当 deploy_modality 为 text 时直接返回 `max_num_batched_tokens`,跳过 mm token 叠加
## Usage or Command
```bash
# 以文本模式部署,跳过 mm token profiling 开销(替代原 --skip-mm-profiling)
python -m fastdeploy.entrypoints.openai.api_server \
--deploy-modality text \
--model /path/to/model \
...
```
## Checklist
- [x] Add at least a tag in the PR title.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. 本次为参数重构,逻辑等价替换,已有 config 单元测试覆盖。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-29 19:40:27 -07:00
YuBaoku
a7cbe3ff91
[CI] Adapt to codecov action changes for Node.js 24 ( #7064 )
2026-03-29 16:49:44 +08:00
YuBaoku
842c60809a
[CI] Align with Paddle layer_norm kernel update ( #7056 )
2026-03-27 22:58:01 +08:00
Zhang Yulong
f25760f4e6
[CI] Update docker run command in unit test coverage workflow ( #7050 )
...
Removed the --ipc=host option from the docker run command.
2026-03-27 19:53:09 +08:00
cmcamdy
bf8e9bf81d
[XPU] Fix speculate schedule ( #7049 )
...
* [BugFix] xpu fix speculate schedule cache kernel
* fix code style
2026-03-27 18:28:17 +08:00
cloudforge1
11ad95ba91
[CI]【Hackathon 10th Spring No.43】ernie4_5_mtp 单测补充 ( #6738 )
...
* [CI]【Hackathon 10th Spring No.43】ernie4_5_mtp 单测补充
* [CI]【Hackathon 10th Spring No.43】add mapping and forward branch coverage
---------
Co-authored-by: cloudforge1 <cloudforge1@users.noreply.github.com >
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com >
Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com >
2026-03-27 17:15:53 +08:00
fxyfxy777
8ff8236a6f
[Optimization] optimize fused_swiglu_fp8_quant_kernel ( #7007 )
...
* use sharemem
* B card test
* fix acc error
2026-03-27 16:10:16 +08:00
GoldPancake
6693bcd0e4
[BugFix] fix clear_parameters in draft cudagraph ( #7035 )
2026-03-27 15:28:50 +08:00
mouxin
6c24f1955c
[Feature] Update error logging ( #7045 )
2026-03-27 15:13:12 +08:00
YuBaoku
10c59f78d6
[CI] disable tests/e2e/test_Qwen3VLMoe_serving.py in unit_test ( #7044 )
2026-03-27 14:15:14 +08:00
Jiaxin Sui
c3ed7db28d
[XPU] [CI] Fix xpu ci bug ( #7014 )
...
* fix xpu ci bug
* Remove unnecessary blank line in conftest.py
* Update upload-artifact action to version 6
* Update _xpu_8cards_case_test.yml
* fix ci bug
* Change exit code on test failure to 1
* fix ci bug
* fix ci bug
* fix ci bug
* fix ci bug
* Update conftest.py
2026-03-27 10:29:34 +08:00
Zhang Yulong
a31d4bfbdf
[CI] update mtp case ( #7031 )
2026-03-27 10:21:37 +08:00
luukunn
14b17c06af
add completion_tokens default ( #7032 )
2026-03-26 21:06:23 +08:00
xiegegege
209e5cf7f4
[CE]add 21b mooncake yaml ( #7033 )
...
* [CE]add 21b cpu cache ,glm mtp,glm for rl config
* [CE]add 21b tp2 yaml
* [CE]add 21b mooncake yaml
* add fastdeploy benchmark,paddletest-155
* [CE] adjust vl wint4 config
* [CE]add glm mtp with updatemodel config
* [CE]fix
* fix
* test
* test
* test
---------
Co-authored-by: xiegegege <>
2026-03-26 20:01:05 +08:00
Yonghua Li
442514252c
[fix] remove all gather ep group control requests in normal cases ( #7026 )
2026-03-26 18:41:29 +08:00
Dangweichong
3c9fd818e3
[BugFix] Fix RDMA initializes failed ( #7025 )
2026-03-26 17:45:39 +08:00
huicongyao
25d64efdc4
[Speculative Decoding] Refactor Eagle MTP hidden states copy ( #6812 )
...
* reformat eagle_get_hidden_states & eagle_get_self_hidden_states
* readibility
* fix xpu bug
* fix coverage failure
* change luanch params & parallelize position_map compute
* Fix MTP-related bugs in FastDeploy centralized inference
* fix
* refactor mtp hidden_states process
* fix
* add unittest & optimize kernel
* remove useless code
* fix
2026-03-25 22:54:31 -07:00
freeliuzc
4fd877ed43
[Speculative Decoding] Support mtp expert-parallel and support different modality deploy ( #7018 )
...
* support mtp ep and support different modality
* fix default arg
2026-03-26 13:52:16 +08:00
YuBaoku
61ebac49ef
[CI] Fix test_communication.py and add port cleanup ( #7021 )
2026-03-26 10:56:40 +08:00
luukunn
e6804ba97d
[Optimization]Streaming requests return complete special tokens. ( #6998 )
...
* return special token
* add completions
* update
* fix
* add prompt_token_ids& completion_token_ids=None,
* fix unite test
2026-03-26 09:49:43 +08:00
luukunn
d5cb2767d7
[Optimization] Deduplicate shared image/video utilities across VL processors ( #6988 )
...
* step1~3
* fix import path
* 删除重复代码
* 删除重复代码
* 删除重复代码
* fix import path
* update
* fix import path
* add unit test
* fix
* update
* fix unit test
2026-03-26 09:49:33 +08:00
chen
1502b6f43e
add instantiations for decoder rope enfore_fmul_rn=true ( #7009 )
2026-03-25 22:22:10 +08:00
Jiang-Jia-Jun
482f951ee9
Update copilot-instructions.md
2026-03-25 21:09:24 +08:00
YuBaoku
b8bb34c7dd
[CI] disable tests/distributed/test_communication.py in unit_test ( #7019 )
2026-03-25 20:54:55 +08:00
Yonghua Li
a7f52c300d
[Feature] support v1 update/clear api for RL ( #6761 )
...
* [Feature] support v1 update/clear api for RL
* [fix] fix execute_model and add sleep/wakeup api
* [fix] fix mtp and key_prefix
* [chore] move _update_key_prefix to resume method
* [fix] make the interface safe to call multiple times
* [fix] fix some tiny bugs
* [chore] make small changes against pr review
* [docs] add docs for weight update
* [test] add some tests and update docs
* [style] fix code style check
* [test] fix ci
* [fix] fix stale control responses when control method timed out
* [chore] remove unused code
* [chore] fix code style
* [chore] optimize tags and key_prefix
* [test] fix ci
* [chore] fix code style
* [test] fix ci
* [fix] fix ep control
* [fix] fix ep control for engine cache queue
2026-03-25 19:18:46 +08:00
gongweibao
48cfb608aa
[FDConfig] Reduce FD_CUSTOM_AR_MAX_SIZE_MB default from 64 to 8 ( #6997 )
...
Most single-GPU and small-model deployments do not need 64MB custom
all-reduce buffers. Lowering the default to 8MB reduces unnecessary
shared memory allocation. Tests that require larger buffers now
explicitly set the value.
Co-authored-by: gongweibao <gognweibao@baidu.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-25 17:40:01 +08:00
freeliuzc
7a6c28781b
[Speculative Decoding] Optimize attn_mask_offset and fix mtp bug ( #7005 )
...
* optimize attn_mask_offset and optimize mtp usage
* delete useless branch
* fix kernel format
* fix kernel runner
2026-03-25 01:52:06 -07:00