FastDeploy

mirror of https://github.com/PaddlePaddle/FastDeploy.git synced 2026-04-23 00:17:25 +08:00

Author	SHA1	Message	Date
RAM	d8cdda86cb	[RL][Cherry-Pick] Fix the out-of-bounds issue caused by int32 in the R3 kernel (#7155 ) * [RL]Perf: Optimize batch delete prefix and fused put in R3 (#6604) * Optimizate delete batch and fused put * refine code * refine code * refine code * Support suspend r3 * [RL] Fix R3 Empty bug with TP=1 (#6777) * Fix int32 overflow * refine code --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-04-21 01:51:09 -07:00
kevin	7707be8384	[Feature][KVCache] Implement Cache Manager V1 with GPU + CPU Cache Support (1/n) (#7097 ) * [Feature][KVCache] Support cache manager v1 architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update cache manager and related modules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update cache_manager and related modules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add node to evictable set in complete_swap_to_device When a node transitions from SWAP_TO_DEVICE to DEVICE via complete_swap_to_device, it was not being added to the _evictable_device set. This caused nodes with ref_count=0 to become "orphaned" - not appearing in any evictable set despite having cache_status=DEVICE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: update cache manager v1 and related modules - Add new cache_manager.py with cache management functionality - Add radix_tree.py for prefix caching - Update block_pool.py and metadata.py - Update request.py and resource_manager_v1.py for scheduling - Update gpu_model_runner.py for GPU model execution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cache): add cache controller v1 implementation - Add CacheController class for cache management - Update config.py with cache related configurations - Refactor gpu_model_runner.py for improved cache handling * feat(cache_manager): update cache manager v1 * fix(cache_manager): 修复 swap_cache H2D/D2H 方向的 block_ids 逻辑并清理 ForwardMeta ## Motivation 修复 swap_cache_optimized.cu 中 H2D 方向时 src/dst block_ids 使用错误的问题，并清理 ForwardMeta 中已废弃的 cache_controller 字段。 ## Modifications - fix: swap_cache_optimized.cu 中根据 D2H 模板参数正确选取 src/dst block_ids，修复 H2D 方向 src/dst 倒置 bug（同时修复 SwapCachePerLayerImpl 和 SwapCacheAllLayersBatchImpl） - refactor: cache_manager/v1/__init__.py 将 LayerSwapTimeoutError 导入从 cache_controller 改为 cache_utils（正确来源） - refactor: ForwardMeta 移除废弃的 cache_controller 字段 - refactor: gpu_model_runner.py 移除对应的 cache_controller 赋值语句 - test: 新增 tests/cache_manager/v1/test_swap_cache_ops.py 单元测试 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(cache_manager): refactor cache manager v1 and optimize swap ops ## Motivation 对 cache manager v1 进行重构和优化，精简代码结构，提升可维护性。 ## Modifications - 重构 transfer_manager.py，大幅精简代码逻辑 - 优化 swap_cache_optimized.cu GPU 算子实现 - 调整 cache_manager.py、cache_controller.py 逻辑，修复 free_device_blocks 方法缺失问题 - 更新 block_pool.py、cache_utils.py、metadata.py、radix_tree.py - 精简 gpu_model_runner.py、forward_meta.py、attention.py 中相关调用 - 更新对应单元测试（test_cache_controller、test_swap_cache_ops、test_transfer_manager） - 调整 config.py 中相关配置项 * [KVCache][MTP] 支持 cache_manager_v1 下的 MTP KV Cache 初始化及多模态 hash ## Motivation 在 enable_cache_manager_v1 路径下，MTP（speculative decode）的 KV Cache 需要由 CacheController 统一管理，以复用 swap/transfer 能力，同时修复多模态场景下 block hash 未携带 multimodal extra_keys 的问题。 ## Modifications - `cache_controller.py` - 新增 `initialize_mtp_kv_cache`：通过 CacheController 初始化 MTP KV Cache，并将其注册到 cache_kvs_map，使 transfer_manager 自动覆盖 MTP 层 - `initialize_host_cache` 中的 num_layers 改为包含 MTP 额外 cache 层数，保证 Host Cache 也为 MTP 分配足够空间 - `_free_gpu_cache` 改名为 `free_gpu_cache`（对外可调用） - `cache_utils.py` - 新增 `get_block_hash_extra_keys`：提取单个 block 内的多模态 hash 信息，对齐 PrefixCacheManager 的 multimodal extra_keys 逻辑 - `get_request_block_hasher` 中在 hash_block_tokens 时携带 extra_keys，修复多模态场景 prefix cache 命中率不准的问题 - `spec_decode/mtp.py` - `update_mtp_block_num` 新增 `skip_cache_init` 参数，避免 v1 cache manager 路径下重复初始化 MTP KV Cache - `gpu_model_runner.py` - `initialize_kv_cache(v1)` 路径：在主模型 cache 初始化后，调用 `cache_controller.initialize_mtp_kv_cache` 完成 MTP cache 创建 - `clear_cache` / `wakeup` / `reset` 等路径：respect `enable_cache_manager_v1` 标志，跳过重复的 proposer.initialize_kv_cache 调用 ## Usage or Command ```bash # 启动支持 MTP + cache_manager_v1 的推理服务（示例） bash run.sh ``` * fix(cache_manager): multi-GPU fix, mm hash boundary fix, and remove batch ops 1. Fix CuPy stream/event creation for multi-GPU: wrap all stream operations with cp.cuda.Device(device_id) context to ensure streams/events are bound to the correct device, preventing cross-device errors in multi-GPU setups. 2. Remove cudaSetDevice from SwapCacheAllLayers (handled by cupy context now). 3. Remove swap_cache_all_layers_batch op: simplified the implementation by removing the batch upload variant; all-layer transfers now use the standard swap_cache_all_layers with cupy device context. 4. Fix mm hash boundary comparison in get_block_hash_extra_keys: change strict less-than (<) to less-than-or-equal (<=) so that multimodal items ending exactly at block start are correctly excluded. 5. Extract config fields to KVCacheBase: model_config, cache_config, quant_config, parallel_config are now set in the base class __init__ to avoid duplication in CacheController and CacheManager subclasses. 6. Translate metadata.py docstrings from Chinese to English for broader contributor accessibility. 7. Add test_cache_utils.py: comprehensive unit tests for get_block_hash_extra_keys covering all boundary and overlap scenarios. 8. Expand test suite: test_request.py cache fields tests, test_radix_tree.py backup candidate tests, test_transfer_manager.py and test_cache_manager.py multi-GPU and concurrent operation tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix List import and move write_policy normalization to CacheManager ## Motivation 修复两处问题： 1. `fastdeploy/engine/request.py` 中 `List` 未导入导致 pre-commit F821 报错 2. `write_policy` 归一化逻辑（`write_through` → `write_through_selective`）不应放在 `FDConfig`，移至 `CacheManager.__init__` 中，使其只影响 Cache Manager V1 的内部逻辑 ## Modifications - `fastdeploy/engine/request.py`: 在 `typing` 导入中补充 `List`，删除重复的 `CacheSwapMetadata` TYPE_CHECKING 导入，修复 F821/F811 - `fastdeploy/config.py`: 删除 `write_policy` 归一化逻辑 - `fastdeploy/cache_manager/v1/cache_manager.py`: 将归一化逻辑移入 `CacheManager.__init__` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix pre-commit code style issues ## Motivation 修复 CI pre-commit 代码风格检查失败问题。 ## Modifications - `fastdeploy/engine/common_engine.py`: black 格式化 - `fastdeploy/worker/worker_process.py`: black 格式化 + isort 修复 - `fastdeploy/cache_manager/v1/storage/__init__.py`: isort 修复 - `fastdeploy/worker/gpu_worker.py`: isort 修复 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] update cache_manager_v1 modules ## Motivation 更新 Cache Manager V1 相关模块，完善版权信息、改进模块结构与可维护性。 ## Modifications - `fastdeploy/cache_manager/v1/` 系列模块：补充版权 header，优化代码结构 - `fastdeploy/config.py`：配置项更新 - `fastdeploy/engine/sched/resource_manager_v1.py`：调度相关更新 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] add BatchRequest.from_tasks and refactor worker task parsing ## Motivation 将 worker_process 中重复的 task 解析逻辑收敛到 BatchRequest，减少代码冗余，提升可维护性。 ## Modifications - `fastdeploy/engine/request.py`：新增 `BatchRequest.from_tasks()` 类方法，统一将 task_queue 任务分类为推理请求和控制请求 - `fastdeploy/worker/worker_process.py`：使用 `BatchRequest.from_tasks()` 替代内联解析逻辑，并修复重复的 control_reqs 处理块 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] add NUMA affinity for host cache and skip swap cache tests ## Motivation 优化 Host cache 内存分配的 NUMA 亲和性，减少跨 NUMA 访问延迟；同时跳过 swap cache ops 测试（当前环境不支持）。 ## Modifications - `fastdeploy/cache_manager/v1/cache_controller.py`： - 新增 `_get_numa_node_for_gpu()` 方法，通过 nvidia-smi 或 sysfs 获取 GPU 对应的 NUMA 节点 - 新增 `_bind_to_closest_numa_node()` 方法，绑定当前线程到 GPU 最近的 NUMA 节点 - 在 `initialize_host_cache()` 中调用 NUMA 绑定，优化 H2D 传输性能 - `tests/cache_manager/v1/test_swap_cache_ops.py`：跳过所有测试类（`TestSwapCacheAllLayersCorrectness`、`TestSwapCacheAllLayersPerformance`、`TestSwapCacheRandomBlockIndices`） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix unittest failures for cache_manager_v1 三个单测因接口变更或 Mock 方式问题导致失败，需修复。 - tests/distributed/chunked_moe.py：`setup_model_runner` 使用 `__new__` 跳过 `__init__`，补加 `enable_cache_manager_v1 = False`，修复 `AttributeError` - tests/engine/test_resource_manager.py：`PrefixCacheManager` 为局部导入，`patch` 路径改为定义位置 `fastdeploy.cache_manager.prefix_cache_manager.PrefixCacheManager` - tests/v1/test_resource_manager_v1.py：`_trigger_preempt` 第四参数已由 `list` 改为 `BatchRequest`，更新测试传参和断言 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] remove debug logging code ## Modifications - fastdeploy/engine/request.py：删除调试用 logger 及 prompt_hashes 中的 debug 日志 - fastdeploy/worker/worker_process.py：删除 __main__ 中的调试 import 和 print 语句 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix cupy device id caching and pickle for _match_result ## Motivation 修复两个 bug： 1. `transfer_manager.py` 中每次调用 `cp.cuda.runtime.getDevice()` 存在隐患，应在初始化时缓存为实例变量，保证后续操作使用一致的设备 ID。 2. `request.py` 的 `__getstate__` 未跳过 `_match_result`，该字段包含 BlockNode 树的父子循环引用，pickle 时会触发 `RecursionError`；同时补充 `__setstate__` 确保 unpickle 后字段恢复为安全默认值。 ## Modifications - `transfer_manager.py`：初始化时调用 `cp.cuda.runtime.getDevice()` 并缓存到 `self._cupy_device_id`，后续 `with cp.cuda.Device(...)` 和日志均使用该缓存值。 - `request.py`： - `__getstate__` 中将 `_match_result` 加入跳过集合 `_SKIP_KEYS`，避免循环引用导致 pickle 失败。 - 新增 `__setstate__`，unpickle 后将 `_block_hasher` 和 `_match_result` 恢复为 `None`。 ## Usage or Command * fix(test): fix unit test errors for _trigger_preempt and wakeup with MTP - Use BatchRequest instead of list in test_trigger_preempt_records_tasks - Add missing enable_cache_manager_v1 attr in TestSleepWakeupBehavior._make_runner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix gpu_free_block_list returning wrong block IDs ## Motivation `gpu_free_block_list` 的兼容 property 中误用了 `list(range(N))`，将 `available_blocks()` 的返回值当作整数传给 `range()`，导致返回 `[0, 1, ..., N-1]` 的假列表，而非真实的空闲 block ID。 ## Modifications - `cache_manager/v1/cache_manager.py`：将 `list(range(self._device_pool.available_blocks()))` 改为 `list(self._device_pool.available_blocks())` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] 修复 gpu_free_block_list 返回 int 导致 TypeError ## Motivation gpu_free_block_list 属性中调用 BlockPool.available_blocks()，该方法返回 int（空闲块数量），用 list() 包装 int 会触发 TypeError: 'int' object is not iterable。 ## Modifications 将 list(self._device_pool.available_blocks()) 改为 list(self._device_pool._free_blocks)，直接返回空闲块索引列表。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [KVCache][CacheManager] 适配 V1 CacheManager 的 pause/sleep/free_cache 操作 ## Motivation V1 CacheManager 引入了新的 reset_cache() 接口，pause 和 sleep 操作需要适配，同时 free_cache 需要支持可选的 clear_storage 参数。 ## Modifications - cache_controller.py: free_cache 新增 clear_storage 参数（默认 False），仅当 clear_storage=True 时才调用 _clear_storage()，避免不必要的 storage 清空 - common_engine.py: pause 和 sleep 操作中，当 ENABLE_V1_KVCACHE_MANAGER 时使用 cache_manager.reset_cache() 替代旧的 reset() 和 pause_transfer 逻辑 - gpu_model_runner.py: sleep 时仅在非 V1 cache manager 下执行 MTP cache 清除 ## Usage or Command # 启动服务（V1 CacheManager） python -m fastdeploy.entrypoints.openai.api_server \ --enable-v1-kvcache-manager \ ... * [BugFix][KVCache] fix missing enable_cache_manager_v1 in test mocks and remove unused select_blocks_for_backup - Remove unused `select_blocks_for_backup` method from radix_tree.py - Fix `match_prefix` default param `skip_storage=True` and log order in cache_manager.py - Sync test_gpu_model_runner.py with upstream/develop (add TestInsertTasksV1SplitwiseSuffix) - Add `enable_cache_manager_v1=False` to all mock runners to fix AttributeError in CI Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] simplify _free_blocks in ResourceManagerV1 for non-v1 path Remove redundant prefix_caching branch in else path; always call recycle_gpu_blocks with full block_tables for non-cache-manager-v1 case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [KVCache][Optimization][BugFix] fix and optimize block_pool, cache_manager, transfer_manager, request ## Motivation 修复 cache_manager v1 中若干代码质量问题，提升性能并消除潜在的类型不一致 Bug。 ## Modifications 1. block_pool.py：`BlockPool.allocate` 将逐个 pop 循环替换为切片 + 批量 set.update，消除 Python 循环开销，O(n) → O(k)（C 层批量操作） 2. cache_manager.py：`match_prefix` 在 prefix caching 关闭时提前 return 前写入空 `MatchResult()`，避免调用方解引用 `_match_result=None` 崩溃 3. transfer_manager.py：`_build_device_layer_indices` 在 `_cache_kvs_map` 为空时也重置四个层索引列表，防止残留旧 tensor 被 swap 算子使用 4. request.py：`BatchRequest.append_swap_metadata` / `append_evict_metadata` 构造 `CacheSwapMetadata` 时将 `src_type`/`dst_type` 从字符串改为 `CacheLevel` 枚举，与字段类型声明一致；补充 `CacheLevel` 导入；`match_result` 属性返回类型标注修正为 `Optional[MatchResult]` 5. resource_manager_v1.py：`_allocate_gpu_blocks` 日志从 `INFO` 降级为 `DEBUG`，消除高频调度路径的日志噪音 6. tests/engine/test_request.py：同步更新 `src_type`/`dst_type` 断言为 `CacheLevel` 枚举值，补充 `CacheLevel` 导入 ## Usage or Command 单元测试： ```bash source .venv/py310/bin/activate cd baidu/FastDeploy python -m pytest tests/cache_manager/v1/test_cache_manager.py -v python -m pytest tests/cache_manager/v1/test_transfer_manager.py -v python -m pytest tests/engine/test_request.py -v ``` * [BugFix][KVCache] Fix BlockPool.allocate returns all blocks when num_blocks=0 ## Motivation 当 `allocate(num_blocks=0)` 被调用时，Python 负索引陷阱导致严重错误： `-0 == 0`，所以 `self._free_blocks[-0:]` 等价于 `self._free_blocks[0:]`，会返回并清空整个空闲块列表，而非返回空列表。 ## Modifications 在 `BlockPool.allocate` 中增加对 `num_blocks == 0` 的提前判断，直接返回 `[]`，避免触发 Python 负索引陷阱。 ## Usage or Command ```bash # 运行相关单元测试验证修复 python -m pytest tests/cache_manager/v1/test_cache_manager.py -vv -s ``` * [KVCache][Test] add unit tests for cache_manager v1 modules ## Motivation 补全 cache_manager/v1 各模块的单测覆盖，确保核心方法有完整的测试保障。 ## Modifications 新增/补充以下测试文件，全部 326 个用例通过： - tests/cache_manager/v1/test_block_pool.py（新建）覆盖 BlockPool.get_metadata/set_metadata/resize、DeviceBlockPool/HostBlockPool - tests/cache_manager/v1/test_metadata.py（新建）覆盖 BlockNode、RadixTreeStats、MatchResult、CacheSwapMetadata、AsyncTaskHandler - tests/cache_manager/v1/test_cache_utils.py（补充）新增 hash_block_tokens、get_request_block_hasher、LayerDoneCounter 时间追踪及内部辅助方法 - tests/cache_manager/v1/test_radix_tree.py（补充）新增 TestCompleteSwapToDevice 专项测试类（6 个用例） - tests/cache_manager/v1/test_cache_manager.py（补充）新增 offload_to_host、load_from_host、pending backup 系列、prepare_prefetch_metadata - tests/cache_manager/v1/test_transfer_manager.py（补充）新增 _swap_single_layer 校验路径、sync_input/output_stream、record_input_stream_event ## Usage or Command ```bash # 运行所有新增单测 source .venv/py310/bin/activate python -m pytest tests/cache_manager/v1/test_block_pool.py \ tests/cache_manager/v1/test_metadata.py \ tests/cache_manager/v1/test_cache_utils.py \ tests/cache_manager/v1/test_radix_tree.py \ tests/cache_manager/v1/test_cache_manager.py \ tests/cache_manager/v1/test_transfer_manager.py -v # 期望结果：326 passed ``` --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>	2026-04-21 14:39:00 +08:00
yzwu	e4a4573080	[Iluvatar] Fix cannot import name mtp_save_first_token (#7495 ) Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>	2026-04-21 14:09:08 +08:00
周周周	609f649dd7	[OP] Add flashmla baseline implementation and precision test (#7477 )	2026-04-21 13:37:52 +08:00
ZhijunLStudio	a0c39cc9af	[Typo] Fix parameter name typo in slice_fn: paramter -> parameter (#7462 ) Fix the typo in the internal parameter name of slice_fn(): weight_or_paramter -> weight_or_parameter. No functional changes. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-20 10:06:02 +08:00
GoldPancake	df3b4e12f4	[Speculative Decoding] Add MTP logprob support for PD disaggregation (#7442 ) * support mtp logprob in pd * fix * fix * fix * fix xpu bugs	2026-04-17 21:37:38 +08:00
yzwu	3b9d6c60d3	[Iiluvatar] fix ci error and update readme (#7453 )	2026-04-17 20:42:56 +08:00
AIbin	6ce4854714	[Feature] Support MOE Cutlass backend for latent MOE (#7428 ) * support moe cutlass backend latent moe	2026-04-16 22:11:49 +08:00
ShaneGZhu	2d8338f9e4	[Optimization][DeepSeekV3.2]Reducing slot_mapping compute frequency from twice per layer to a single pre-processing step. (#7367 )	2026-04-16 19:54:12 +08:00
RichardWooSJTU	420a8c1af5	fix deep gemm import (#7425 )	2026-04-16 17:56:56 +08:00
Jiajun Ji	29495b2cf1	[XPU] Unify Spec and non-spec branch.(#6947 ) (#7180 ) * [XPU] cherry-pick PR-6947 * [XPU] use unified_update_model_status. * refactor xpu_model_runner. * refactor sampler. * fix codestyle. * Fix XPU speculative decoding: rename output tensors to cu_seqlens_q_output/batch_id_per_token_output, correct WRAPPER_CHECK_PTR types, and fix dynamic gather shape in verify_draft_tokens path. * fix codestyle. * replace output_padding_offset with is_speculative flag in gather_next_token. * rename hiddden_states. * unify cu_seqlens_q_output and batch_id_per_token_output init. --------- Co-authored-by: cmcamdy <1027740945@qq.com>	2026-04-16 14:58:38 +08:00
RuohengMa	de0c5e68fb	[XPU] Split the block_attn operator into smaller operators (#6798 ) * spliced block_attn * adapt to latest vllm * fix unit tests * delete mtp+cudagraph 4 cards test * fix vl model * fix mtp * fix slot mapping	2026-04-16 14:28:40 +08:00
Bingoo	6b891da02b	[Optimization] enable trtllm_all_reduce fusion kernel in glm model (#6660 ) * enable trtllm_all_reduce fusion kernel in glm model * fix conflict * format update * fix a bug * modify test * modify test * support empty tensor and modify test * fix test_linear config issues * modify test name * add edge test case * modify format * fix conflict * modify default max token num in trtllm_allreduce_fusion * add max token num branch for trtllm_allreduce_fusion * fix format * fix rmsnorm config issue * modify 2025 to 2026 * using compat grard * Lazily import flashinfer.comm and fix test config issue * fix test issues * add flashinfer cache dir clean machine * fix some issues	2026-04-16 14:10:19 +08:00
GoldPancake	a498720a75	[RL] Add clear_graph_opt_backend for glm4_mtp (#7378 ) * add clear_grpah func * fix spell	2026-04-15 19:44:15 +08:00
周周周	5e54770b2e	[Feature] 添加 MoE 层 latent mode 支持 (#7382 )	2026-04-15 13:57:07 +08:00
lonelygsh	f7a2418ce2	[Speculate Decoding] Fix reasoning_phase_token_constraint call args in SpeculativeSampler (#7402 )	2026-04-15 12:45:23 +08:00
AIbin	8995a38fa4	fix dsa indexer norm to layernorm (#7398 )	2026-04-15 11:42:45 +08:00
AIbin	bb30f88f1a	[Models] support MLA gate attention (#7404 ) * support mla gate attn * support mla gate attn	2026-04-15 11:42:34 +08:00
chen	616b29ce08	check init_flash_attn_version log (#7399 )	2026-04-15 11:05:10 +08:00
Echo-Nie	8819a039c9	[Others] Fix typo (#7280 ) * typo * typo * typo * typo	2026-04-14 17:28:22 +08:00
xiaoxiaohehe001	abba29b348	[BugFix] fix mm rope (#7274 )	2026-04-14 11:36:08 +08:00
zhupengyang	27b00cf385	[XPU] glm-4.5-air (#7071 )	2026-04-14 11:31:49 +08:00
周周周	73bd4ab318	[Feature] 为 FusedMoE 添加 hidden_size 显式参数支持 (#7361 ) [Feature] 为 FusedMoE 添加 hidden_size 显式参数支持	2026-04-13 20:24:58 +08:00
freeliuzc	31e2a8bbad	[Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap (#7323 ) * support mtp overlap in pd-split mode with insert_task overlap	2026-04-13 19:41:17 +08:00
AIbin	1fb8194191	[OP][Models][Optimization] 优化 RoPE CUDA kernel 并更新 DeepSeek V3 配置 (#7359 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests * Replace max_position_embeddings with max_model_len * 1D grid: gridDim.x has a maximum size of 2^31-1, far exceeding the actual number of tokens.	2026-04-13 19:12:36 +08:00
liuruyan	b34708604c	[TI-consistent] support quant use pow2scale (#7308 ) * support quant use pow2scale * fix * fix	2026-04-13 00:01:53 -07:00
AIbin	6213ad5340	[Docs][BugFix] fix mla log (#7243 ) * [Docs] Fix Chinese punctuation issues	2026-04-13 12:15:43 +08:00
Nyako Shigure	d659099415	[Cleanup] Replace torch proxy alias with public compat API (#7348 )	2026-04-13 11:43:26 +08:00
Jiajun Ji	cb03958b52	[XPU] Refactor get_padding_offset to single kernel. (#7029 ) * [XPU] Refactor get_padding_offset to single kernel. * add unittest. * fix codestyle. * remove cum_offsets_now. * remove max_len.	2026-04-13 11:04:50 +08:00
周周周	225fc8d222	use self.hidden_size not use self.fd_config.model_config.hidden_size (#7340 )	2026-04-11 22:39:43 +08:00
chen	4982aa000e	[RL]moe bf16 ep support paddle batch_gemm (#7337 ) * moe bf16 ep support paddle batch_gemm	2026-04-11 21:51:12 +08:00
AIbin	ba01d7a823	[Optimization] [OP] [Models] dsk del prefill mask (#7313 ) * dsk del prefill mask * dsk support 1M+ seq_len rope * update rope tests	2026-04-11 19:32:27 +08:00
JYChen	076ab07528	[RL] change glm rope_emb calculation (#7316 ) * change glm rope_emb calculation * glm without EnforceFmulRN * fix ci	2026-04-11 18:36:28 +08:00
zhangbo9674	627f0d9cc8	[RL] change rms norm for glm (#7269 ) * change rms norm for glm * refine code * refine code * refine code	2026-04-10 01:02:37 -07:00
K11OntheBoat	870dbac370	Use triton qk_norm both in Prefill and Decode (#7213 ) Co-authored-by: “liuruian” <liuruian@baidu.com>	2026-04-10 15:44:01 +08:00
bukejiyu	14d46181b8	[Loader] add multi-thread model loading (#6877 ) * multi-thread-loader * fix ut	2026-04-09 23:40:15 -07:00
GoldPancake	c1fb3112f8	[FDConfig] Support CLI args for quantization params and add cudagraph validation (#7281 ) * refactor quant cli param	2026-04-10 14:13:42 +08:00
lizexu123	613f92ee8f	[Feature] support nvfp4 tbo (#7259 )	2026-04-09 17:29:39 +08:00
fxyfxy777	39ff38aba1	[OP]Unify MoE op with moe_permute path for bf16 GLM (#7164 )	2026-04-09 16:17:56 +08:00
xiaoxiaohehe001	51efe27d76	[BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn (#7210 ) * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] fix_flash_mask_attn_sm90 * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn * [BugFix] Fix batch_size derivation and relax shape checks in SM90 flash_mask_attn	2026-04-09 11:05:10 +08:00
JYChen	43ace7af25	[RL] support moe-topk use topk_reduce_func (#7218 ) * support moe-topk use topk_reduce_func * fix ep error * fix ut * fix ut	2026-04-09 11:01:03 +08:00
ShaneGZhu	7005404ce3	[DeepSeekV3.2][Graph Optimization]Remove synchronous operation to avoid capture fail and unnecessary contiguous in DSA Backend (#7253 ) * Delete contiguous ops. * fix scale * Delete unnecessary comments * fix style	2026-04-09 11:00:13 +08:00
AIbin	48d2bbeb74	fix dsa (#7252 )	2026-04-08 20:21:38 +08:00
K11OntheBoat	bb48bcbaa2	Split enable_mm (#7183 ) Co-authored-by: liuruian <liuruian@MacBook-Pro.local>	2026-04-08 11:25:41 +08:00
lizhenyun01	446b26bbc0	[Feature] support blackwell gemm in ht (#7053 ) * [Feature] support blackwell gemm in ht * [Feature] support ops for convert * fix cuda error 716 * fix cuda error * opt memory * remove unused code	2026-04-07 19:52:51 +08:00
sunxin	ae2f9f4d22	[BugFix] Enable moe_gate_fp32 using FD_ENABLE_RL (#7130 ) * rl gate fp32 * clean	2026-04-06 21:07:38 -07:00
周周周	18f012457d	[OP][Optimization] Remove ENABLE_PREFILL template parameter in multi_query_append_attention_warp1_4_kernel (#7201 )	2026-04-07 11:21:57 +08:00
Bingoo	2068656a85	[Optimization] merge matmul and add (#6986 ) * merge matmul and add * modify format * using paddle.nn.functional.linear * using _C_ops.linear * using paddle.nn.functional.linear * add FLAGS_use_legacy_linear env var in test case * fix format * add assert and remove env * modify format * using matmul for no bias * modify accurate baseline	2026-04-03 18:02:03 +08:00
AIbin	1090f8b123	[Models]support GLM4.7 Flash && Ernie_MLA (#7139 ) * support GLM4.7 Flash && Ernie_MLA	2026-04-03 17:41:33 +08:00
lizexu123	5f612a348d	[BugFix] fix flashinfer-cutedsl moe nvfp4 (#7120 ) * fix nvfp4 * fix * add document * fix nvfp4 * support eb5 * support bka * support eb5 * support xpu * fix * fix * add import cutedsl * fix * fix * fix test * fix H卡 * update document * fix * update document * update document * fix	2026-04-03 15:43:19 +08:00

1 2 3 4 5 ...

757 Commits