FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
RAM	d8cdda86cb	[RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel (#7155 ) * [RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604) * Optimizate delete batch and fused put * refine code * refine code * refine code * Support suspend r3 * [RL] Fix R3 Empty bug with TP=1 (#6777) * Fix int32 overflow * refine code --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-04-21 01:51:09 -07:00
zhouchong	4c6364eb53	Unify num_experts_per_tok to moe_k in ModelConfig for MoE model compatibility (#7509 ) Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>	2026-04-21 15:20:30 +08:00
kevin	7707be8384	[Feature][KVCache] Implement Cache Manager V1 with GPU + CPU Cache Support (1/n) (#7097 ) * [Feature][KVCache] Support cache manager v1 architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update cache manager and related modules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update cache_manager and related modules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add node to evictable set in complete_swap_to_device When a node transitions from SWAP_TO_DEVICE to DEVICE via complete_swap_to_device, it was not being added to the _evictable_device set. This caused nodes with ref_count=0 to become "orphaned" - not appearing in any evictable set despite having cache_status=DEVICE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: update cache manager v1 and related modules - Add new cache_manager.py with cache management functionality - Add radix_tree.py for prefix caching - Update block_pool.py and metadata.py - Update request.py and resource_manager_v1.py for scheduling - Update gpu_model_runner.py for GPU model execution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cache): add cache controller v1 implementation - Add CacheController class for cache management - Update config.py with cache related configurations - Refactor gpu_model_runner.py for improved cache handling * feat(cache_manager): update cache manager v1 * fix(cache_manager): 修复 swap_cache H2D/D2H 方向的 block_ids 逻辑并清理 ForwardMeta ## Motivation 修复 swap_cache_optimized.cu 中 H2D 方向时 src/dst block_ids 使用错误的问题，并清理 ForwardMeta 中已废弃的 cache_controller 字段。 ## Modifications - fix: swap_cache_optimized.cu 中根据 D2H 模板参数正确选取 src/dst block_ids，修复 H2D 方向 src/dst 倒置 bug（同时修复 SwapCachePerLayerImpl 和 SwapCacheAllLayersBatchImpl） - refactor: cache_manager/v1/__init__.py 将 LayerSwapTimeoutError 导入从 cache_controller 改为 cache_utils（正确来源） - refactor: ForwardMeta 移除废弃的 cache_controller 字段 - refactor: gpu_model_runner.py 移除对应的 cache_controller 赋值语句 - test: 新增 tests/cache_manager/v1/test_swap_cache_ops.py 单元测试 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(cache_manager): refactor cache manager v1 and optimize swap ops ## Motivation 对 cache manager v1 进行重构和优化，精简代码结构，提升可维护性。 ## Modifications - 重构 transfer_manager.py，大幅精简代码逻辑 - 优化 swap_cache_optimized.cu GPU 算子实现 - 调整 cache_manager.py、cache_controller.py 逻辑，修复 free_device_blocks 方法缺失问题 - 更新 block_pool.py、cache_utils.py、metadata.py、radix_tree.py - 精简 gpu_model_runner.py、forward_meta.py、attention.py 中相关调用 - 更新对应单元测试（test_cache_controller、test_swap_cache_ops、test_transfer_manager） - 调整 config.py 中相关配置项 * [KVCache][MTP] 支持 cache_manager_v1 下的 MTP KV Cache 初始化及多模态 hash ## Motivation 在 enable_cache_manager_v1 路径下，MTP（speculative decode）的 KV Cache 需要由 CacheController 统一管理，以复用 swap/transfer 能力，同时修复多模态场景下 block hash 未携带 multimodal extra_keys 的问题。 ## Modifications - `cache_controller.py` - 新增 `initialize_mtp_kv_cache`：通过 CacheController 初始化 MTP KV Cache，并将其注册到 cache_kvs_map，使 transfer_manager 自动覆盖 MTP 层 - `initialize_host_cache` 中的 num_layers 改为包含 MTP 额外 cache 层数，保证 Host Cache 也为 MTP 分配足够空间 - `_free_gpu_cache` 改名为 `free_gpu_cache`（对外可调用） - `cache_utils.py` - 新增 `get_block_hash_extra_keys`：提取单个 block 内的多模态 hash 信息，对齐 PrefixCacheManager 的 multimodal extra_keys 逻辑 - `get_request_block_hasher` 中在 hash_block_tokens 时携带 extra_keys，修复多模态场景 prefix cache 命中率不准的问题 - `spec_decode/mtp.py` - `update_mtp_block_num` 新增 `skip_cache_init` 参数，避免 v1 cache manager 路径下重复初始化 MTP KV Cache - `gpu_model_runner.py` - `initialize_kv_cache(v1)` 路径：在主模型 cache 初始化后，调用 `cache_controller.initialize_mtp_kv_cache` 完成 MTP cache 创建 - `clear_cache` / `wakeup` / `reset` 等路径：respect `enable_cache_manager_v1` 标志，跳过重复的 proposer.initialize_kv_cache 调用 ## Usage or Command ```bash # 启动支持 MTP + cache_manager_v1 的推理服务（示例） bash run.sh ``` * fix(cache_manager): multi-GPU fix, mm hash boundary fix, and remove batch ops 1. Fix CuPy stream/event creation for multi-GPU: wrap all stream operations with cp.cuda.Device(device_id) context to ensure streams/events are bound to the correct device, preventing cross-device errors in multi-GPU setups. 2. Remove cudaSetDevice from SwapCacheAllLayers (handled by cupy context now). 3. Remove swap_cache_all_layers_batch op: simplified the implementation by removing the batch upload variant; all-layer transfers now use the standard swap_cache_all_layers with cupy device context. 4. Fix mm hash boundary comparison in get_block_hash_extra_keys: change strict less-than (<) to less-than-or-equal (<=) so that multimodal items ending exactly at block start are correctly excluded. 5. Extract config fields to KVCacheBase: model_config, cache_config, quant_config, parallel_config are now set in the base class __init__ to avoid duplication in CacheController and CacheManager subclasses. 6. Translate metadata.py docstrings from Chinese to English for broader contributor accessibility. 7. Add test_cache_utils.py: comprehensive unit tests for get_block_hash_extra_keys covering all boundary and overlap scenarios. 8. Expand test suite: test_request.py cache fields tests, test_radix_tree.py backup candidate tests, test_transfer_manager.py and test_cache_manager.py multi-GPU and concurrent operation tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix List import and move write_policy normalization to CacheManager ## Motivation 修复两处问题： 1. `fastdeploy/engine/request.py` 中 `List` 未导入导致 pre-commit F821 报错 2. `write_policy` 归一化逻辑（`write_through` → `write_through_selective`）不应放在 `FDConfig`，移至 `CacheManager.__init__` 中，使其只影响 Cache Manager V1 的内部逻辑 ## Modifications - `fastdeploy/engine/request.py`: 在 `typing` 导入中补充 `List`，删除重复的 `CacheSwapMetadata` TYPE_CHECKING 导入，修复 F821/F811 - `fastdeploy/config.py`: 删除 `write_policy` 归一化逻辑 - `fastdeploy/cache_manager/v1/cache_manager.py`: 将归一化逻辑移入 `CacheManager.__init__` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix pre-commit code style issues ## Motivation 修复 CI pre-commit 代码风格检查失败问题。 ## Modifications - `fastdeploy/engine/common_engine.py`: black 格式化 - `fastdeploy/worker/worker_process.py`: black 格式化 + isort 修复 - `fastdeploy/cache_manager/v1/storage/__init__.py`: isort 修复 - `fastdeploy/worker/gpu_worker.py`: isort 修复 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] update cache_manager_v1 modules ## Motivation 更新 Cache Manager V1 相关模块，完善版权信息、改进模块结构与可维护性。 ## Modifications - `fastdeploy/cache_manager/v1/` 系列模块：补充版权 header，优化代码结构 - `fastdeploy/config.py`：配置项更新 - `fastdeploy/engine/sched/resource_manager_v1.py`：调度相关更新 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] add BatchRequest.from_tasks and refactor worker task parsing ## Motivation 将 worker_process 中重复的 task 解析逻辑收敛到 BatchRequest，减少代码冗余，提升可维护性。 ## Modifications - `fastdeploy/engine/request.py`：新增 `BatchRequest.from_tasks()` 类方法，统一将 task_queue 任务分类为推理请求和控制请求 - `fastdeploy/worker/worker_process.py`：使用 `BatchRequest.from_tasks()` 替代内联解析逻辑，并修复重复的 control_reqs 处理块 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] add NUMA affinity for host cache and skip swap cache tests ## Motivation 优化 Host cache 内存分配的 NUMA 亲和性，减少跨 NUMA 访问延迟；同时跳过 swap cache ops 测试（当前环境不支持）。 ## Modifications - `fastdeploy/cache_manager/v1/cache_controller.py`： - 新增 `_get_numa_node_for_gpu()` 方法，通过 nvidia-smi 或 sysfs 获取 GPU 对应的 NUMA 节点 - 新增 `_bind_to_closest_numa_node()` 方法，绑定当前线程到 GPU 最近的 NUMA 节点 - 在 `initialize_host_cache()` 中调用 NUMA 绑定，优化 H2D 传输性能 - `tests/cache_manager/v1/test_swap_cache_ops.py`：跳过所有测试类（`TestSwapCacheAllLayersCorrectness`、`TestSwapCacheAllLayersPerformance`、`TestSwapCacheRandomBlockIndices`） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix unittest failures for cache_manager_v1 三个单测因接口变更或 Mock 方式问题导致失败，需修复。 - tests/distributed/chunked_moe.py：`setup_model_runner` 使用 `__new__` 跳过 `__init__`，补加 `enable_cache_manager_v1 = False`，修复 `AttributeError` - tests/engine/test_resource_manager.py：`PrefixCacheManager` 为局部导入，`patch` 路径改为定义位置 `fastdeploy.cache_manager.prefix_cache_manager.PrefixCacheManager` - tests/v1/test_resource_manager_v1.py：`_trigger_preempt` 第四参数已由 `list` 改为 `BatchRequest`，更新测试传参和断言 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] remove debug logging code ## Modifications - fastdeploy/engine/request.py：删除调试用 logger 及 prompt_hashes 中的 debug 日志 - fastdeploy/worker/worker_process.py：删除 __main__ 中的调试 import 和 print 语句 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix cupy device id caching and pickle for _match_result ## Motivation 修复两个 bug： 1. `transfer_manager.py` 中每次调用 `cp.cuda.runtime.getDevice()` 存在隐患，应在初始化时缓存为实例变量，保证后续操作使用一致的设备 ID。 2. `request.py` 的 `__getstate__` 未跳过 `_match_result`，该字段包含 BlockNode 树的父子循环引用，pickle 时会触发 `RecursionError`；同时补充 `__setstate__` 确保 unpickle 后字段恢复为安全默认值。 ## Modifications - `transfer_manager.py`：初始化时调用 `cp.cuda.runtime.getDevice()` 并缓存到 `self._cupy_device_id`，后续 `with cp.cuda.Device(...)` 和日志均使用该缓存值。 - `request.py`： - `__getstate__` 中将 `_match_result` 加入跳过集合 `_SKIP_KEYS`，避免循环引用导致 pickle 失败。 - 新增 `__setstate__`，unpickle 后将 `_block_hasher` 和 `_match_result` 恢复为 `None`。 ## Usage or Command * fix(test): fix unit test errors for _trigger_preempt and wakeup with MTP - Use BatchRequest instead of list in test_trigger_preempt_records_tasks - Add missing enable_cache_manager_v1 attr in TestSleepWakeupBehavior._make_runner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix gpu_free_block_list returning wrong block IDs ## Motivation `gpu_free_block_list` 的兼容 property 中误用了 `list(range(N))`，将 `available_blocks()` 的返回值当作整数传给 `range()`，导致返回 `[0, 1, ..., N-1]` 的假列表，而非真实的空闲 block ID。 ## Modifications - `cache_manager/v1/cache_manager.py`：将 `list(range(self._device_pool.available_blocks()))` 改为 `list(self._device_pool.available_blocks())` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] 修复 gpu_free_block_list 返回 int 导致 TypeError ## Motivation gpu_free_block_list 属性中调用 BlockPool.available_blocks()，该方法返回 int（空闲块数量），用 list() 包装 int 会触发 TypeError: 'int' object is not iterable。 ## Modifications 将 list(self._device_pool.available_blocks()) 改为 list(self._device_pool._free_blocks)，直接返回空闲块索引列表。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [KVCache][CacheManager] 适配 V1 CacheManager 的 pause/sleep/free_cache 操作 ## Motivation V1 CacheManager 引入了新的 reset_cache() 接口，pause 和 sleep 操作需要适配，同时 free_cache 需要支持可选的 clear_storage 参数。 ## Modifications - cache_controller.py: free_cache 新增 clear_storage 参数（默认 False），仅当 clear_storage=True 时才调用 _clear_storage()，避免不必要的 storage 清空 - common_engine.py: pause 和 sleep 操作中，当 ENABLE_V1_KVCACHE_MANAGER 时使用 cache_manager.reset_cache() 替代旧的 reset() 和 pause_transfer 逻辑 - gpu_model_runner.py: sleep 时仅在非 V1 cache manager 下执行 MTP cache 清除 ## Usage or Command # 启动服务（V1 CacheManager） python -m fastdeploy.entrypoints.openai.api_server \ --enable-v1-kvcache-manager \ ... * [BugFix][KVCache] fix missing enable_cache_manager_v1 in test mocks and remove unused select_blocks_for_backup - Remove unused `select_blocks_for_backup` method from radix_tree.py - Fix `match_prefix` default param `skip_storage=True` and log order in cache_manager.py - Sync test_gpu_model_runner.py with upstream/develop (add TestInsertTasksV1SplitwiseSuffix) - Add `enable_cache_manager_v1=False` to all mock runners to fix AttributeError in CI Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] simplify _free_blocks in ResourceManagerV1 for non-v1 path Remove redundant prefix_caching branch in else path; always call recycle_gpu_blocks with full block_tables for non-cache-manager-v1 case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [KVCache][Optimization][BugFix] fix and optimize block_pool, cache_manager, transfer_manager, request ## Motivation 修复 cache_manager v1 中若干代码质量问题，提升性能并消除潜在的类型不一致 Bug。 ## Modifications 1. block_pool.py：`BlockPool.allocate` 将逐个 pop 循环替换为切片 + 批量 set.update，消除 Python 循环开销，O(n) → O(k)（C 层批量操作） 2. cache_manager.py：`match_prefix` 在 prefix caching 关闭时提前 return 前写入空 `MatchResult()`，避免调用方解引用 `_match_result=None` 崩溃 3. transfer_manager.py：`_build_device_layer_indices` 在 `_cache_kvs_map` 为空时也重置四个层索引列表，防止残留旧 tensor 被 swap 算子使用 4. request.py：`BatchRequest.append_swap_metadata` / `append_evict_metadata` 构造 `CacheSwapMetadata` 时将 `src_type`/`dst_type` 从字符串改为 `CacheLevel` 枚举，与字段类型声明一致；补充 `CacheLevel` 导入；`match_result` 属性返回类型标注修正为 `Optional[MatchResult]` 5. resource_manager_v1.py：`_allocate_gpu_blocks` 日志从 `INFO` 降级为 `DEBUG`，消除高频调度路径的日志噪音 6. tests/engine/test_request.py：同步更新 `src_type`/`dst_type` 断言为 `CacheLevel` 枚举值，补充 `CacheLevel` 导入 ## Usage or Command 单元测试： ```bash source .venv/py310/bin/activate cd baidu/FastDeploy python -m pytest tests/cache_manager/v1/test_cache_manager.py -v python -m pytest tests/cache_manager/v1/test_transfer_manager.py -v python -m pytest tests/engine/test_request.py -v ``` * [BugFix][KVCache] Fix BlockPool.allocate returns all blocks when num_blocks=0 ## Motivation 当 `allocate(num_blocks=0)` 被调用时，Python 负索引陷阱导致严重错误： `-0 == 0`，所以 `self._free_blocks[-0:]` 等价于 `self._free_blocks[0:]`，会返回并清空整个空闲块列表，而非返回空列表。 ## Modifications 在 `BlockPool.allocate` 中增加对 `num_blocks == 0` 的提前判断，直接返回 `[]`，避免触发 Python 负索引陷阱。 ## Usage or Command ```bash # 运行相关单元测试验证修复 python -m pytest tests/cache_manager/v1/test_cache_manager.py -vv -s ``` * [KVCache][Test] add unit tests for cache_manager v1 modules ## Motivation 补全 cache_manager/v1 各模块的单测覆盖，确保核心方法有完整的测试保障。 ## Modifications 新增/补充以下测试文件，全部 326 个用例通过： - tests/cache_manager/v1/test_block_pool.py（新建）覆盖 BlockPool.get_metadata/set_metadata/resize、DeviceBlockPool/HostBlockPool - tests/cache_manager/v1/test_metadata.py（新建）覆盖 BlockNode、RadixTreeStats、MatchResult、CacheSwapMetadata、AsyncTaskHandler - tests/cache_manager/v1/test_cache_utils.py（补充）新增 hash_block_tokens、get_request_block_hasher、LayerDoneCounter 时间追踪及内部辅助方法 - tests/cache_manager/v1/test_radix_tree.py（补充）新增 TestCompleteSwapToDevice 专项测试类（6 个用例） - tests/cache_manager/v1/test_cache_manager.py（补充）新增 offload_to_host、load_from_host、pending backup 系列、prepare_prefetch_metadata - tests/cache_manager/v1/test_transfer_manager.py（补充）新增 _swap_single_layer 校验路径、sync_input/output_stream、record_input_stream_event ## Usage or Command ```bash # 运行所有新增单测 source .venv/py310/bin/activate python -m pytest tests/cache_manager/v1/test_block_pool.py \ tests/cache_manager/v1/test_metadata.py \ tests/cache_manager/v1/test_cache_utils.py \ tests/cache_manager/v1/test_radix_tree.py \ tests/cache_manager/v1/test_cache_manager.py \ tests/cache_manager/v1/test_transfer_manager.py -v # 期望结果：326 passed ``` --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-04-21 14:39:00 +08:00
freeliuzc	22a4f6019d	[Speculative Decoding][BugFix] Fix apply repeat times penalty kernel and change spec default verify strategy (#7467 ) * fix repeat_time kernel and change default spec verify strategy * fix unit_test	2026-04-18 00:38:01 +08:00
Bingoo	6b891da02b	[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660 ) * enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues	2026-04-16 14:10:19 +08:00
jc	e53f5184ac	PD deployment support without router (#7412 )	2026-04-15 20:13:07 +08:00
RichardWooSJTU	dec0b060fc	[Optimization] Auto set num_max_dispatch_tokens_per_rank (#7237 ) * auto set num_max_dispatch_tokens_per_rank * fix ci * fix ci * fix ci	2026-04-15 19:13:38 +08:00
AIbin	8eebbcaf15	[BugFix][Scheduler]Fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens limit (#7407 ) * fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len * fix FD_DISABLE_CHUNKED_PREFILL max_num_batched_tokens=max_model_len	2026-04-15 15:55:11 +08:00
bukejiyu	14d46181b8	[Loader] add multi-thread model loading (#6877 ) * multi-thread-loader * fix ut	2026-04-09 23:40:15 -07:00
GoldPancake	c1fb3112f8	[FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281 ) * refactor quant cli param	2026-04-10 14:13:42 +08:00
K11OntheBoat	bb48bcbaa2	Split enable_mm (#7183 ) Co-authored-by: liuruian <liuruian@MacBook-Pro.local>	2026-04-08 11:25:41 +08:00
GoldPancake	9d4fd19c3f	[Speculative Decoding] Auto-scale CUDA graph capture sizes for speculative decoding (#7215 )	2026-04-07 20:22:28 +08:00
kevin	9765fa7313	[Refactor] Replace --skip-mm-profiling with --deploy-modality text (#7048 ) * [Feature] Support --skip-mm-profiling to skip multimodal token overhead in profiling ## Motivation 在多模态模型（如 Qwen2.5-VL、ERNIE4.5-VL 等）部署时，`get_max_chunk_tokens` 会在基础 token 数之上额外叠加 mm token 数，用于 profiling 阶段预留显存。某些场景下（如已知图像 token 数较小，或希望节省显存），用户希望跳过该多模态 token 额外开销的计算，直接使用文本 token 数进行 profiling。 ## Modifications - `fastdeploy/engine/args_utils.py`：`EngineArgs` 新增 `skip_mm_profiling: bool = False` 字段，parser 新增 `--skip-mm-profiling` 启动参数 - `fastdeploy/config.py`：`ModelConfig.__init__` 新增 `self.skip_mm_profiling = False`； `FDConfig.get_max_chunk_tokens` 中增加 `not self.model_config.skip_mm_profiling` 判断，开启后跳过 mm token 叠加，直接返回基础 `num_tokens` ## Usage or Command 启动服务时添加参数： ```bash --skip-mm-profiling ``` ## Checklist - [x] Add at least a tag in the PR title. - [x] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. 本功能为配置参数透传，逻辑简单，已有相关 config 单元测试覆盖。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Refactor] Replace skip_mm_profiling with deploy_modality=text to skip mm profiling ## Motivation 原 `--skip-mm-profiling` 参数与已有的 `deploy_modality` 参数功能存在语义重叠：当以纯文本模式（`deploy_modality=text`）部署时，本就不需要为多模态 token 预留显存。引入独立参数增加了配置复杂度，复用 `deploy_modality` 更加直观和一致。 ## Modifications - `fastdeploy/engine/args_utils.py`：删除 `EngineArgs.skip_mm_profiling` 字段及 `--skip-mm-profiling` 启动参数 - `fastdeploy/config.py`：删除 `ModelConfig.__init__` 中的 `self.skip_mm_profiling = False`； `FDConfig.get_max_chunk_tokens` 中将条件改为 `self.deploy_modality != DeployModality.TEXT`，当 deploy_modality 为 text 时直接返回 `max_num_batched_tokens`，跳过 mm token 叠加 ## Usage or Command ```bash # 以文本模式部署，跳过 mm token profiling 开销（替代原 --skip-mm-profiling） python -m fastdeploy.entrypoints.openai.api_server \ --deploy-modality text \ --model /path/to/model \ ... ``` ## Checklist - [x] Add at least a tag in the PR title. - [x] Format your code, run `pre-commit` before commit. - [ ] Add unit tests. 本次为参数重构，逻辑等价替换，已有 config 单元测试覆盖。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-29 19:40:27 -07:00
freeliuzc	4fd877ed43	[Speculative Decoding] Support mtp expert-parallel and support different modality deploy (#7018 ) * support mtp ep and support different modality * fix default arg	2026-03-26 13:52:16 +08:00
Yonghua Li	a7f52c300d	[Feature] support v1 update/clear api for RL (#6761 ) * [Feature] support v1 update/clear api for RL * [fix] fix execute_model and add sleep/wakeup api * [fix] fix mtp and key_prefix * [chore] move _update_key_prefix to resume method * [fix] make the interface safe to call multiple times * [fix] fix some tiny bugs * [chore] make small changes against pr review * [docs] add docs for weight update * [test] add some tests and update docs * [style] fix code style check * [test] fix ci * [fix] fix stale control responses when control method timed out * [chore] remove unused code * [chore] fix code style * [chore] optimize tags and key_prefix * [test] fix ci * [chore] fix code style * [test] fix ci * [fix] fix ep control * [fix] fix ep control for engine cache queue	2026-03-25 19:18:46 +08:00
jc	950366e58d	[PD Disaggregation][RL] Register to router with version and support rdma eager connect for pd (#6718 ) * [Feature] Register to router with version info for PD disaggregation Add RegisterManager for PD (Prefill-Decode) disaggregated deployment: - All instances (Prefill/Decode) register to Router with heartbeat - Prefill instances fetch Decode instance list from Router - Prefill instances establish eager RDMA connections to Decode instances - Register info includes: host_ip, port, role, version, is_paused, connected_decodes Changes: - Add RegisterManager class for managing PD registration and RDMA connections - Add version field to ModelConfig for model version tracking - Add connected_decodes to register_info for tracking connected Decode instances - Add FD_ENABLE_PD_RDMA_EAGER_CONNECT environment variable Test fixes: - Add None checks for load_config in FDConfig.__init__ - Add version attribute to test mock model configs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refine * remove test --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 14:43:35 +08:00
ming1753	bb925c605f	[Other] Adjust GPUModelRunner to enhance compatibility (#6851 )	2026-03-16 14:49:19 +08:00
RAM	cdaf6dd400	[RL][Cherry-Pick] Support Fully Async and PrefixCache (#6599 ) * cherry-pick Support Fully Async and PrefixCache step 1 * copy routing_indices_cache.py from 2.4 * cherry-pick [RL] R3 Fix the bug for determining the end of a request (#6388) * cherry-pick [RL] Clear Requests status of R3 (#6569) * delete code * fix rename bug * fix status shape bug * fix ci	2026-03-12 01:13:30 -07:00
RichardWooSJTU	9f0778f991	[Feature] Support EP prefill with num_worst_tokens (#6574 ) * support num worst tokens * support num worst tokens * fix build error * support num worst tokens: fix errors * support num worst tokens: fix feild * support num worst tokens: delete requiements * replace permute and depermute op by pure cuda * replace permute and depermute op by pure cuda * fix ci * fix op * fix nan * fix code style --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-03-11 17:09:07 +08:00
freeliuzc	cf7934a4b2	[Speculative Decoding] Unify Spec and non-spec branch (#6685 ) * optimize spec-inference architecture * delete debug log * optimize spec_method usage && fix unit_test * add claude unit-test skill * fix some ugly bug * enhance robustness and bounds check * unify method & spec_method to method to avoid bug * activate CI * fix unit test * Unify logprobs computation for naive and speculative decoding, fix CUDA kernel * fix logprob bug && optimize verify kernel * fix exist_decode() judge	2026-03-10 23:58:44 -07:00
SunLei	5d9524fc3c	[Models][Feature] Support new ERNIE reward model and add return_token_ids to reward API (#6638 ) * reward model * Add support for pooling-based inference in the reward model * bugfix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-03-06 18:51:00 +08:00
yzwu	6674131b0b	[Iluvatar] Support CudaGraph and optimize flash_attn_unpadded and fused_neox_rope_embedding (#6553 )	2026-03-02 14:07:17 +08:00
Yonghua Li	ea4d10d174	[BugFix] fix cache int8 for pd disaggregated deployment (#6563 )	2026-03-01 13:43:55 +08:00
zccjjj	a2072fe20c	[XPU] support warmup with ep & remove apply_tp_fused_op (#6289 )	2026-02-28 15:40:36 +08:00
ming1753	97eee75677	[Feature] GPU Memory Optimization and Retirement of V0 Scheduler (#6407 ) * Optim GPU Mem Usage --------- Co-authored-by: huzesen <huzesen@baidu.com>	2026-02-28 15:07:43 +08:00
sunxin	53aaac69da	[Optimization] Enable BF16 gate computation for GLM and Qwen (#6457 ) * gate bf16 * add gate-fp32 * fix * update baseline * update * update * fix	2026-02-26 21:08:46 -08:00
GoldPancake	2178f2829b	[Speculative Decoding] Support suffix decoding (#6403 ) * support suffix decoding	2026-02-26 11:42:05 +08:00
Yuanle Liu	6d3fede240	[OP][Feature] 统一 limit_thinking_content_length CUDA 算子，支持回复长度限制与注入序列 (#6493 ) * Initial plan * Migrate PRs #6311, #6129, #6305 to develop and merge unit tests Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix * update * fix * fix ci * fix ci * Initial plan * test: add test_chat_with_response_max_tokens to test_EB_VL_Lite_serving.py Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add disable-thinking case to test_chat_with_response_max_tokens Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * test: add both reasoning_max_tokens and response_max_tokens case Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com> * fix ci * fix ci * fix ci --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: yuanlehome <23653004+yuanlehome@users.noreply.github.com>	2026-02-25 21:36:50 +08:00
jackyYang6	a29ee57e15	[Feature] Support ThinkingBudget Logits processor to control thinking content length (#6367 ) * feat: add thinking budget logits processor * add unittest * fix pre-commit * add unittest * docs: clarify operator-level vs logits processor usage and conflict guidance --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-02-25 14:17:09 +08:00
Yonghua Li	e2332a1112	[BugFix] fix num_cpu_blocks computation (#6438 ) * [BugFix] fix num_cpu_blocks computation * [fix] fix syntax and log * [fix] pre-commit * [fix] use getattr * [fix] ci test	2026-02-13 11:05:14 +08:00
kevin	d60daca4a8	[Feature] consider multimodal model when dummy run (#6045 ) * add mm do profile * updata code * update code * update code * update code * update test case * update code * update code * fix xpu bug * update code * add mm do profile * update test case * update code	2026-02-09 17:49:55 +08:00
bukejiyu	dc5917289d	[loader]supoort wint2 backend (#6139 ) * support wint2 * update	2026-02-08 22:42:36 -08:00
RAM	5b22e5dfe7	[RL] R3 Support Fused Put the Routing of All Layers (#6099 ) * fused put routing * fix bug * [draft commit]dynamic dtype * fix async put & numpy bug * fix unit8 test case	2026-02-03 04:13:16 -08:00
zccjjj	ee77ff9ebe	[config] fix assert message (#6310 )	2026-02-03 14:37:46 +08:00
CSWYF3634076	08c411518f	[Loader] support dummy load weight (#6169 ) * [Loader] support dummy load weight * [Loader] support dummy load weight v2 * [Loader] support dummy load weight unittest * [Loader] support dummy load weight unittest v2 * [Loader] support dummy load weight v3 docs and fp8	2026-01-26 13:58:53 +08:00
wangyifei	b7c5daa316	[RL] add pause, update_weights, resume interface for async RL (#6052 ) * support dynamic run_control_request through zmq from apiserver to common_engine * support pause/resume/is_paused/update_weights in apiserver->common_engine by common run_control_method * change /is_puased from HTTP POST method to GET method * add pause、resume、is_paused implementation * support engine <==> worker communication(request&response) * support sync weights through RDMA from checkpoint_transfer * support specified version, rsync_config in update_weights rpc call * add pause, update_weights, resume interface for async RL * bug fix: update_weights support using default arguments * fix typo * typo fix * typo fix * typo fix * add unitest for control request/response, localscheduler.get_inflight_requests, resource_manager_v1.preempted_all * add "rsync" to LoadConfig.load_strategy Literal type hints Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * typo fix * typo fix * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * check version/rsync params * add error log when version.txt not exists Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * raise specified ValueError when paramters check failed Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * tp barrier after run_control_method * encode 'engine_worker_queue_port' to unique name of worker2engine fmq queue * typo fix * typo fix --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-23 10:18:07 +08:00
Ryan	31c219d483	[Graph Optimization] Add `max_capture_shape_prefill` && `cudagraph_capture_sizes_prefill` (#6148 ) * Add max_capture_shape_dy2st parameter to YAML config * split cudagraph capture size between decode and prefill * rm if * add default value	2026-01-22 21:37:18 +08:00
yinwei	1e3c35496c	[XPU][Graph Optimization] XPU Support CUDAGraph (#6152 ) * support cuda graph	2026-01-22 14:41:56 +08:00
Haonan Luo	82057cb71f	Support MXFP4 for GPT-OSS (#5435 ) * support mxfp4 in gpt-oss * support mxfp4 in gpt-oss * add scope for flashinfer * remove torch code * update envs.FD_MXFP4_BACKEND * update process_weights_after_loading * update env name * support tp in gpt-oss, add e2e test * add flashinfer-python-paddle in requirements * fix import error * add test * add test * add test * add test	2026-01-22 14:21:01 +08:00
Ryan	dda27e50f5	[Graph Optimization] remove static_op_get_block_shape_and_split_kv_block from cudagraph (#6081 ) * rm static_op_get_block_shape_and_split_kv_block from cudagraph * update max_capture_shape * fallback: zeros -> empty to avoid coverage check * check graph_opt_config exists * add max_capture_shape_dy2st && full_cuda_graph: false -> true in 28B vl test * add use_cudagraph flag to control step_use_cudagraph	2026-01-20 14:05:18 +08:00
jackyYang6	988e0bc338	[Feature] Add PaddleFormers fallback backend (#5999 ) * feat(paddleformers): add dense text model fallback backend * docs(paddleformers): add user guide and fix code review issues * add fallback unit test * precommit format * fix pre-commit * fix: address code review feedback * docs: add PaddleFormers backend documentation (EN) and simplify installation --------- Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com> Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-19 21:50:50 +08:00
Jingfeng Wu	7d44009f39	[FDConfig] transfer metrics_port (#6056 ) * transfer metrics_port * transfer metrics_port	2026-01-19 19:58:57 +08:00
ddchenhao66	3685474799	[XPU] xpu support mm prefill batch (#6072 ) Co-authored-by: ddchenhao66 <dhaochen163.com>	2026-01-19 14:36:35 +08:00
freeliuzc	49617d9832	[Feature]Support tag phase token enforce generation (#6034 ) * support tag phase token enforce generation * optimize note and some feature * fix sampler unit test --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-15 03:59:55 -08:00
RAM	b3f59fd9b5	[RL][CI] Support Async R3 And Add Accuracy Test (#5937 ) * add bs1 r3 test case * async put * r3 test case 1.0 * success run eb5 * refine test case * pre-commit * add eb45 & glm testcase * format code * add p2pstore requirements * support only last turn * R3 use worker log * refine code &fix ci bug * refine error mesg * fix empty input bug * Success set acc ci of eb45 and glm45 * refine code * fix bug	2026-01-14 04:25:06 -08:00
chenjian	74d0f1c01f	[Optim] Robust sync status when preempted happens (#5796 ) * [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-01-14 12:07:33 +08:00
Yonghua Li	456637002d	[BugFix] fix cache transfer manager updating/clearing (#5930 ) * [fix] fix cache transfer manager updating/clearing * [fix] fix code style * [fix] fix config * [fix] fix engine client * [fix] let worker update kv cache status signal * [fix] update worker process * [fix] fix clear/update for case if comm group is shutdown * [fix] update dynamic weight manager * [fix] fix port * [fix] add num_cpu_blocks arg for async_llm, and remove unnecessary waiting	2026-01-13 05:09:29 -08:00
lzy	223b2f5d86	Support setting communication groups in custom_allreduce and the all-to-all\transpose fused operator during the decoding phase. (#5917 )	2026-01-12 14:09:39 +08:00
GoldPancake	4e10ae5d99	[Speculative Decoding] Optimize draft logprob (#5842 ) * optimize draft logprob * fix ut	2025-12-31 13:35:56 +08:00
CSWYF3634076	9286403570	[Models] Add Qwen3-VL Model Support (#5763 ) * support v1 loader * remove useless code * remove useless * [Model] support Qwen3VL images success * [Model] support Qwen3VL rope_3d * [Model] support Qwen3VL remove log * [Model] support Qwen3VL RL * [Model] support Qwen3VL tp * [Model] support Qwen3VL video * [Model] support Qwen3VL fix ernievl * [Model] support Qwen3VL fix get_image_boundaries.cc array out of bounds * [Model] support Qwen3VL fix multi card * [Model] support Qwen3VL file close * [Model] support Qwen3VL fix ce * [Model] support Qwen3VL fix unittest * [Model] support Qwen3VL add unittest --------- Co-authored-by: Ayakouji <yuhongh@qq.com>	2025-12-29 17:39:33 +08:00

1 2 3 4 5

224 Commits