mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 00:17:25 +08:00
7707be8384
* [Feature][KVCache] Support cache manager v1 architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update cache manager and related modules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update cache_manager and related modules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add node to evictable set in complete_swap_to_device When a node transitions from SWAP_TO_DEVICE to DEVICE via complete_swap_to_device, it was not being added to the _evictable_device set. This caused nodes with ref_count=0 to become "orphaned" - not appearing in any evictable set despite having cache_status=DEVICE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: update cache manager v1 and related modules - Add new cache_manager.py with cache management functionality - Add radix_tree.py for prefix caching - Update block_pool.py and metadata.py - Update request.py and resource_manager_v1.py for scheduling - Update gpu_model_runner.py for GPU model execution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(cache): add cache controller v1 implementation - Add CacheController class for cache management - Update config.py with cache related configurations - Refactor gpu_model_runner.py for improved cache handling * feat(cache_manager): update cache manager v1 * fix(cache_manager): 修复 swap_cache H2D/D2H 方向的 block_ids 逻辑并清理 ForwardMeta ## Motivation 修复 swap_cache_optimized.cu 中 H2D 方向时 src/dst block_ids 使用错误的问题, 并清理 ForwardMeta 中已废弃的 cache_controller 字段。 ## Modifications - fix: swap_cache_optimized.cu 中根据 D2H 模板参数正确选取 src/dst block_ids, 修复 H2D 方向 src/dst 倒置 bug(同时修复 SwapCachePerLayerImpl 和 SwapCacheAllLayersBatchImpl) - refactor: cache_manager/v1/__init__.py 将 LayerSwapTimeoutError 导入从 cache_controller 改为 cache_utils(正确来源) - refactor: ForwardMeta 移除废弃的 cache_controller 字段 - refactor: gpu_model_runner.py 移除对应的 cache_controller 赋值语句 - test: 新增 tests/cache_manager/v1/test_swap_cache_ops.py 单元测试 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(cache_manager): refactor cache manager v1 and optimize swap ops ## Motivation 对 cache manager v1 进行重构和优化,精简代码结构,提升可维护性。 ## Modifications - 重构 transfer_manager.py,大幅精简代码逻辑 - 优化 swap_cache_optimized.cu GPU 算子实现 - 调整 cache_manager.py、cache_controller.py 逻辑,修复 free_device_blocks 方法缺失问题 - 更新 block_pool.py、cache_utils.py、metadata.py、radix_tree.py - 精简 gpu_model_runner.py、forward_meta.py、attention.py 中相关调用 - 更新对应单元测试(test_cache_controller、test_swap_cache_ops、test_transfer_manager) - 调整 config.py 中相关配置项 * [KVCache][MTP] 支持 cache_manager_v1 下的 MTP KV Cache 初始化及多模态 hash ## Motivation 在 enable_cache_manager_v1 路径下,MTP(speculative decode)的 KV Cache 需要由 CacheController 统一管理,以复用 swap/transfer 能力,同时修复多模态场景下 block hash 未携带 multimodal extra_keys 的问题。 ## Modifications - `cache_controller.py` - 新增 `initialize_mtp_kv_cache`:通过 CacheController 初始化 MTP KV Cache, 并将其注册到 cache_kvs_map,使 transfer_manager 自动覆盖 MTP 层 - `initialize_host_cache` 中的 num_layers 改为包含 MTP 额外 cache 层数,保证 Host Cache 也为 MTP 分配足够空间 - `_free_gpu_cache` 改名为 `free_gpu_cache`(对外可调用) - `cache_utils.py` - 新增 `get_block_hash_extra_keys`:提取单个 block 内的多模态 hash 信息, 对齐 PrefixCacheManager 的 multimodal extra_keys 逻辑 - `get_request_block_hasher` 中在 hash_block_tokens 时携带 extra_keys, 修复多模态场景 prefix cache 命中率不准的问题 - `spec_decode/mtp.py` - `update_mtp_block_num` 新增 `skip_cache_init` 参数,避免 v1 cache manager 路径下重复初始化 MTP KV Cache - `gpu_model_runner.py` - `initialize_kv_cache(v1)` 路径:在主模型 cache 初始化后,调用 `cache_controller.initialize_mtp_kv_cache` 完成 MTP cache 创建 - `clear_cache` / `wakeup` / `reset` 等路径:respect `enable_cache_manager_v1` 标志,跳过重复的 proposer.initialize_kv_cache 调用 ## Usage or Command ```bash # 启动支持 MTP + cache_manager_v1 的推理服务(示例) bash run.sh ``` * fix(cache_manager): multi-GPU fix, mm hash boundary fix, and remove batch ops 1. Fix CuPy stream/event creation for multi-GPU: wrap all stream operations with cp.cuda.Device(device_id) context to ensure streams/events are bound to the correct device, preventing cross-device errors in multi-GPU setups. 2. Remove cudaSetDevice from SwapCacheAllLayers (handled by cupy context now). 3. Remove swap_cache_all_layers_batch op: simplified the implementation by removing the batch upload variant; all-layer transfers now use the standard swap_cache_all_layers with cupy device context. 4. Fix mm hash boundary comparison in get_block_hash_extra_keys: change strict less-than (<) to less-than-or-equal (<=) so that multimodal items ending exactly at block start are correctly excluded. 5. Extract config fields to KVCacheBase: model_config, cache_config, quant_config, parallel_config are now set in the base class __init__ to avoid duplication in CacheController and CacheManager subclasses. 6. Translate metadata.py docstrings from Chinese to English for broader contributor accessibility. 7. Add test_cache_utils.py: comprehensive unit tests for get_block_hash_extra_keys covering all boundary and overlap scenarios. 8. Expand test suite: test_request.py cache fields tests, test_radix_tree.py backup candidate tests, test_transfer_manager.py and test_cache_manager.py multi-GPU and concurrent operation tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix List import and move write_policy normalization to CacheManager ## Motivation 修复两处问题: 1. `fastdeploy/engine/request.py` 中 `List` 未导入导致 pre-commit F821 报错 2. `write_policy` 归一化逻辑(`write_through` → `write_through_selective`)不应放在 `FDConfig`,移至 `CacheManager.__init__` 中,使其只影响 Cache Manager V1 的内部逻辑 ## Modifications - `fastdeploy/engine/request.py`: 在 `typing` 导入中补充 `List`,删除重复的 `CacheSwapMetadata` TYPE_CHECKING 导入,修复 F821/F811 - `fastdeploy/config.py`: 删除 `write_policy` 归一化逻辑 - `fastdeploy/cache_manager/v1/cache_manager.py`: 将归一化逻辑移入 `CacheManager.__init__` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix pre-commit code style issues ## Motivation 修复 CI pre-commit 代码风格检查失败问题。 ## Modifications - `fastdeploy/engine/common_engine.py`: black 格式化 - `fastdeploy/worker/worker_process.py`: black 格式化 + isort 修复 - `fastdeploy/cache_manager/v1/storage/__init__.py`: isort 修复 - `fastdeploy/worker/gpu_worker.py`: isort 修复 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] update cache_manager_v1 modules ## Motivation 更新 Cache Manager V1 相关模块,完善版权信息、改进模块结构与可维护性。 ## Modifications - `fastdeploy/cache_manager/v1/` 系列模块:补充版权 header,优化代码结构 - `fastdeploy/config.py`:配置项更新 - `fastdeploy/engine/sched/resource_manager_v1.py`:调度相关更新 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] add BatchRequest.from_tasks and refactor worker task parsing ## Motivation 将 worker_process 中重复的 task 解析逻辑收敛到 BatchRequest,减少代码冗余,提升可维护性。 ## Modifications - `fastdeploy/engine/request.py`:新增 `BatchRequest.from_tasks()` 类方法,统一将 task_queue 任务分类为推理请求和控制请求 - `fastdeploy/worker/worker_process.py`:使用 `BatchRequest.from_tasks()` 替代内联解析逻辑,并修复重复的 control_reqs 处理块 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [Feature][KVCache] add NUMA affinity for host cache and skip swap cache tests ## Motivation 优化 Host cache 内存分配的 NUMA 亲和性,减少跨 NUMA 访问延迟; 同时跳过 swap cache ops 测试(当前环境不支持)。 ## Modifications - `fastdeploy/cache_manager/v1/cache_controller.py`: - 新增 `_get_numa_node_for_gpu()` 方法,通过 nvidia-smi 或 sysfs 获取 GPU 对应的 NUMA 节点 - 新增 `_bind_to_closest_numa_node()` 方法,绑定当前线程到 GPU 最近的 NUMA 节点 - 在 `initialize_host_cache()` 中调用 NUMA 绑定,优化 H2D 传输性能 - `tests/cache_manager/v1/test_swap_cache_ops.py`:跳过所有测试类(`TestSwapCacheAllLayersCorrectness`、`TestSwapCacheAllLayersPerformance`、`TestSwapCacheRandomBlockIndices`) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix unittest failures for cache_manager_v1 三个单测因接口变更或 Mock 方式问题导致失败,需修复。 - tests/distributed/chunked_moe.py:`setup_model_runner` 使用 `__new__` 跳过 `__init__`,补加 `enable_cache_manager_v1 = False`,修复 `AttributeError` - tests/engine/test_resource_manager.py:`PrefixCacheManager` 为局部导入,`patch` 路径改为定义位置 `fastdeploy.cache_manager.prefix_cache_manager.PrefixCacheManager` - tests/v1/test_resource_manager_v1.py:`_trigger_preempt` 第四参数已由 `list` 改为 `BatchRequest`,更新测试传参和断言 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] remove debug logging code ## Modifications - fastdeploy/engine/request.py:删除调试用 logger 及 prompt_hashes 中的 debug 日志 - fastdeploy/worker/worker_process.py:删除 __main__ 中的调试 import 和 print 语句 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix cupy device id caching and pickle for _match_result ## Motivation 修复两个 bug: 1. `transfer_manager.py` 中每次调用 `cp.cuda.runtime.getDevice()` 存在隐患,应在初始化时缓存为实例变量,保证后续操作使用一致的设备 ID。 2. `request.py` 的 `__getstate__` 未跳过 `_match_result`,该字段包含 BlockNode 树的父子循环引用,pickle 时会触发 `RecursionError`;同时补充 `__setstate__` 确保 unpickle 后字段恢复为安全默认值。 ## Modifications - `transfer_manager.py`:初始化时调用 `cp.cuda.runtime.getDevice()` 并缓存到 `self._cupy_device_id`,后续 `with cp.cuda.Device(...)` 和日志均使用该缓存值。 - `request.py`: - `__getstate__` 中将 `_match_result` 加入跳过集合 `_SKIP_KEYS`,避免循环引用导致 pickle 失败。 - 新增 `__setstate__`,unpickle 后将 `_block_hasher` 和 `_match_result` 恢复为 `None`。 ## Usage or Command * fix(test): fix unit test errors for _trigger_preempt and wakeup with MTP - Use BatchRequest instead of list in test_trigger_preempt_records_tasks - Add missing enable_cache_manager_v1 attr in TestSleepWakeupBehavior._make_runner Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] fix gpu_free_block_list returning wrong block IDs ## Motivation `gpu_free_block_list` 的兼容 property 中误用了 `list(range(N))`, 将 `available_blocks()` 的返回值当作整数传给 `range()`, 导致返回 `[0, 1, ..., N-1]` 的假列表,而非真实的空闲 block ID。 ## Modifications - `cache_manager/v1/cache_manager.py`:将 `list(range(self._device_pool.available_blocks()))` 改为 `list(self._device_pool.available_blocks())` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] 修复 gpu_free_block_list 返回 int 导致 TypeError ## Motivation gpu_free_block_list 属性中调用 BlockPool.available_blocks(), 该方法返回 int(空闲块数量),用 list() 包装 int 会触发 TypeError: 'int' object is not iterable。 ## Modifications 将 list(self._device_pool.available_blocks()) 改为 list(self._device_pool._free_blocks),直接返回空闲块索引列表。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [KVCache][CacheManager] 适配 V1 CacheManager 的 pause/sleep/free_cache 操作 ## Motivation V1 CacheManager 引入了新的 reset_cache() 接口,pause 和 sleep 操作需要适配, 同时 free_cache 需要支持可选的 clear_storage 参数。 ## Modifications - cache_controller.py: free_cache 新增 clear_storage 参数(默认 False), 仅当 clear_storage=True 时才调用 _clear_storage(),避免不必要的 storage 清空 - common_engine.py: pause 和 sleep 操作中,当 ENABLE_V1_KVCACHE_MANAGER 时 使用 cache_manager.reset_cache() 替代旧的 reset() 和 pause_transfer 逻辑 - gpu_model_runner.py: sleep 时仅在非 V1 cache manager 下执行 MTP cache 清除 ## Usage or Command # 启动服务(V1 CacheManager) python -m fastdeploy.entrypoints.openai.api_server \ --enable-v1-kvcache-manager \ ... * [BugFix][KVCache] fix missing enable_cache_manager_v1 in test mocks and remove unused select_blocks_for_backup - Remove unused `select_blocks_for_backup` method from radix_tree.py - Fix `match_prefix` default param `skip_storage=True` and log order in cache_manager.py - Sync test_gpu_model_runner.py with upstream/develop (add TestInsertTasksV1SplitwiseSuffix) - Add `enable_cache_manager_v1=False` to all mock runners to fix AttributeError in CI Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [BugFix][KVCache] simplify _free_blocks in ResourceManagerV1 for non-v1 path Remove redundant prefix_caching branch in else path; always call recycle_gpu_blocks with full block_tables for non-cache-manager-v1 case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * [KVCache][Optimization][BugFix] fix and optimize block_pool, cache_manager, transfer_manager, request ## Motivation 修复 cache_manager v1 中若干代码质量问题,提升性能并消除潜在的类型不一致 Bug。 ## Modifications 1. **block_pool.py**:`BlockPool.allocate` 将逐个 pop 循环替换为切片 + 批量 set.update,消除 Python 循环开销,O(n) → O(k)(C 层批量操作) 2. **cache_manager.py**:`match_prefix` 在 prefix caching 关闭时提前 return 前写入空 `MatchResult()`,避免调用方解引用 `_match_result=None` 崩溃 3. **transfer_manager.py**:`_build_device_layer_indices` 在 `_cache_kvs_map` 为空时也重置四个层索引列表,防止残留旧 tensor 被 swap 算子使用 4. **request.py**:`BatchRequest.append_swap_metadata` / `append_evict_metadata` 构造 `CacheSwapMetadata` 时将 `src_type`/`dst_type` 从字符串改为 `CacheLevel` 枚举,与字段类型声明一致;补充 `CacheLevel` 导入;`match_result` 属性返回类型标注修正为 `Optional[MatchResult]` 5. **resource_manager_v1.py**:`_allocate_gpu_blocks` 日志从 `INFO` 降级为 `DEBUG`,消除高频调度路径的日志噪音 6. **tests/engine/test_request.py**:同步更新 `src_type`/`dst_type` 断言为 `CacheLevel` 枚举值,补充 `CacheLevel` 导入 ## Usage or Command 单元测试: ```bash source .venv/py310/bin/activate cd baidu/FastDeploy python -m pytest tests/cache_manager/v1/test_cache_manager.py -v python -m pytest tests/cache_manager/v1/test_transfer_manager.py -v python -m pytest tests/engine/test_request.py -v ``` * [BugFix][KVCache] Fix BlockPool.allocate returns all blocks when num_blocks=0 ## Motivation 当 `allocate(num_blocks=0)` 被调用时,Python 负索引陷阱导致严重错误: `-0 == 0`,所以 `self._free_blocks[-0:]` 等价于 `self._free_blocks[0:]`, 会返回并清空整个空闲块列表,而非返回空列表。 ## Modifications 在 `BlockPool.allocate` 中增加对 `num_blocks == 0` 的提前判断,直接返回 `[]`, 避免触发 Python 负索引陷阱。 ## Usage or Command ```bash # 运行相关单元测试验证修复 python -m pytest tests/cache_manager/v1/test_cache_manager.py -vv -s ``` * [KVCache][Test] add unit tests for cache_manager v1 modules ## Motivation 补全 cache_manager/v1 各模块的单测覆盖,确保核心方法有完整的测试保障。 ## Modifications 新增/补充以下测试文件,全部 326 个用例通过: - tests/cache_manager/v1/test_block_pool.py(新建) 覆盖 BlockPool.get_metadata/set_metadata/resize、DeviceBlockPool/HostBlockPool - tests/cache_manager/v1/test_metadata.py(新建) 覆盖 BlockNode、RadixTreeStats、MatchResult、CacheSwapMetadata、AsyncTaskHandler - tests/cache_manager/v1/test_cache_utils.py(补充) 新增 hash_block_tokens、get_request_block_hasher、LayerDoneCounter 时间追踪及内部辅助方法 - tests/cache_manager/v1/test_radix_tree.py(补充) 新增 TestCompleteSwapToDevice 专项测试类(6 个用例) - tests/cache_manager/v1/test_cache_manager.py(补充) 新增 offload_to_host、load_from_host、pending backup 系列、prepare_prefetch_metadata - tests/cache_manager/v1/test_transfer_manager.py(补充) 新增 _swap_single_layer 校验路径、sync_input/output_stream、record_input_stream_event ## Usage or Command ```bash # 运行所有新增单测 source .venv/py310/bin/activate python -m pytest tests/cache_manager/v1/test_block_pool.py \ tests/cache_manager/v1/test_metadata.py \ tests/cache_manager/v1/test_cache_utils.py \ tests/cache_manager/v1/test_radix_tree.py \ tests/cache_manager/v1/test_cache_manager.py \ tests/cache_manager/v1/test_transfer_manager.py -v # 期望结果:326 passed ``` --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
833 lines
33 KiB
Python
833 lines
33 KiB
Python
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
"""setup for FastDeploy custom ops"""
|
|
import importlib
|
|
import json
|
|
import os
|
|
import shutil
|
|
import subprocess
|
|
import sys
|
|
import tarfile
|
|
from pathlib import Path
|
|
|
|
import paddle
|
|
from paddle.utils.cpp_extension import CppExtension, CUDAExtension, setup
|
|
from setuptools import find_namespace_packages, find_packages
|
|
|
|
|
|
def load_module_from_path(module_name, path):
|
|
"""
|
|
load python module from path
|
|
"""
|
|
spec = importlib.util.spec_from_file_location(module_name, path)
|
|
module = importlib.util.module_from_spec(spec)
|
|
sys.modules[module_name] = module
|
|
spec.loader.exec_module(module)
|
|
return module
|
|
|
|
|
|
def update_git_repo():
|
|
try:
|
|
print("update third party repo...", flush=True)
|
|
original_dir = os.getcwd()
|
|
submodule_dir = os.path.dirname(os.path.abspath(__file__))
|
|
third_party_path = os.path.join(submodule_dir, "third_party")
|
|
root_path = Path(third_party_path)
|
|
|
|
# check if third_party is empty
|
|
update_third_party = False
|
|
for dirpath in root_path.iterdir():
|
|
if dirpath.is_dir():
|
|
has_content = any(dirpath.iterdir())
|
|
if not has_content:
|
|
update_third_party = True
|
|
|
|
if update_third_party:
|
|
os.chdir(submodule_dir)
|
|
subprocess.run(
|
|
"git submodule sync --recursive && git submodule update --init --recursive",
|
|
shell=True,
|
|
check=True,
|
|
text=True,
|
|
)
|
|
else:
|
|
print(
|
|
"\033[33m[===WARNING===]third_party directory already exists, skip clone and update.\033[0m",
|
|
flush=True,
|
|
)
|
|
|
|
# apply deep gemm patch
|
|
deep_gemm_dir = "third_party/DeepGEMM"
|
|
dst_path = os.path.join(submodule_dir, deep_gemm_dir)
|
|
patch = "0001-DeepGEMM-95e81b3.patch"
|
|
patch_source = os.path.join(submodule_dir, patch)
|
|
patch_destination = os.path.join(dst_path, patch)
|
|
if not os.path.exists(patch_destination):
|
|
shutil.copy(patch_source, patch_destination)
|
|
apply_cmd = ["git", "apply", patch]
|
|
os.chdir(dst_path)
|
|
subprocess.run(apply_cmd, check=True)
|
|
os.chdir(original_dir)
|
|
except subprocess.CalledProcessError:
|
|
raise Exception("Git submodule update and apply patch failed. Maybe network connection is poor.")
|
|
|
|
|
|
ROOT_DIR = Path(__file__).parent.parent
|
|
|
|
# cannot import envs directly because it depends on fastdeploy,
|
|
# which is not installed yet
|
|
envs = load_module_from_path("envs", os.path.join(ROOT_DIR, "fastdeploy", "envs.py"))
|
|
|
|
archs = json.loads(envs.FD_BUILDING_ARCS)
|
|
use_bf16 = envs.FD_CPU_USE_BF16 == "True"
|
|
|
|
update_git_repo()
|
|
|
|
|
|
def download_and_extract(url, destination_directory):
|
|
"""
|
|
Download a .tar.gz file using wget to the destination directory
|
|
and extract its contents without renaming the downloaded file.
|
|
|
|
:param url: The URL of the .tar.gz file to download.
|
|
:param destination_directory: The directory where the file should be downloaded and extracted.
|
|
"""
|
|
os.makedirs(destination_directory, exist_ok=True)
|
|
|
|
filename = os.path.basename(url)
|
|
file_path = os.path.join(destination_directory, filename)
|
|
|
|
try:
|
|
subprocess.run(
|
|
["wget", "-O", file_path, url],
|
|
check=True,
|
|
)
|
|
print(f"Downloaded: {file_path}")
|
|
|
|
with tarfile.open(file_path, "r:gz") as tar:
|
|
tar.extractall(path=destination_directory)
|
|
print(f"Extracted: {file_path} to {destination_directory}")
|
|
os.remove(file_path)
|
|
print(f"Deleted downloaded file: {file_path}")
|
|
except subprocess.CalledProcessError as e:
|
|
print(f"Error downloading file: {e}")
|
|
except Exception as e:
|
|
print(f"Error extracting file: {e}")
|
|
|
|
|
|
def get_sm_version(archs):
|
|
"""
|
|
Get sm version of paddle.
|
|
"""
|
|
arch_set = set(archs)
|
|
if len(arch_set) == 0:
|
|
try:
|
|
prop = paddle.device.cuda.get_device_properties()
|
|
cc = prop.major * 10 + prop.minor
|
|
arch_set.add(cc)
|
|
except ValueError:
|
|
pass
|
|
return list(arch_set)
|
|
|
|
|
|
def get_nvcc_version():
|
|
"""
|
|
Get cuda version of nvcc.
|
|
"""
|
|
nvcc_output = subprocess.check_output(["nvcc", "--version"], universal_newlines=True)
|
|
output = nvcc_output.split()
|
|
release_idx = output.index("release") + 1
|
|
nvcc_cuda_version = float(output[release_idx].split(",")[0])
|
|
return nvcc_cuda_version
|
|
|
|
|
|
def get_gencode_flags(archs):
|
|
"""
|
|
Get gencode flags for current device or input.
|
|
"""
|
|
cc_s = get_sm_version(archs)
|
|
flags = []
|
|
for cc_val in cc_s:
|
|
if cc_val == 90:
|
|
arch_code = "90a"
|
|
flags += [
|
|
"-gencode",
|
|
f"arch=compute_{arch_code},code=sm_{arch_code}",
|
|
]
|
|
elif cc_val == 100: # Assuming 100 is the code for Blackwell SM10.x
|
|
# Per NVIDIA dev blog, for CUTLASS and architecture-specific features on CC 10.0, use '100a'
|
|
# https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/
|
|
# "The CUTLASS build instructions specify using the a flag when building for devices of CC 9.0 and 10.0"
|
|
arch_code = "100a"
|
|
flags += [
|
|
"-gencode",
|
|
f"arch=compute_{arch_code},code=sm_{arch_code}",
|
|
]
|
|
else:
|
|
flags += ["-gencode", f"arch=compute_{cc_val},code=sm_{cc_val}"]
|
|
return flags
|
|
|
|
|
|
def get_compile_parallelism():
|
|
"""
|
|
Decide safe compile parallelism for both build workers and nvcc threads.
|
|
"""
|
|
cpu_count = os.cpu_count() or 1
|
|
|
|
max_jobs_env = os.getenv("MAX_JOBS")
|
|
if max_jobs_env is not None:
|
|
try:
|
|
max_jobs = int(max_jobs_env)
|
|
if max_jobs < 1:
|
|
raise ValueError
|
|
except ValueError as exc:
|
|
raise ValueError(f"Invalid MAX_JOBS={max_jobs_env!r}, expected a positive integer.") from exc
|
|
else:
|
|
# Cap default build workers to avoid OOM in high-core CI runners.
|
|
max_jobs = min(cpu_count, 32)
|
|
os.environ["MAX_JOBS"] = str(max_jobs)
|
|
|
|
# Limit nvcc internal threads to avoid resource exhaustion when Paddle's
|
|
# ThreadPoolExecutor also launches many parallel compilations.
|
|
# Total threads ~= (number of parallel compile jobs) * nvcc_threads.
|
|
nvcc_threads = min(max_jobs, 4)
|
|
return max_jobs, nvcc_threads
|
|
|
|
|
|
def find_end_files(directory, end_str):
|
|
"""
|
|
Find files with end str in directory.
|
|
"""
|
|
gen_files = []
|
|
for root, dirs, files in os.walk(directory):
|
|
for file in files:
|
|
if file.endswith(end_str):
|
|
gen_files.append(os.path.join(root, file))
|
|
return gen_files
|
|
|
|
|
|
if paddle.is_compiled_with_rocm():
|
|
# NOTE(@duanyanhui): paddle.is_compiled_with_cuda() returns True when paddle compiled with rocm.
|
|
# so we need to check if paddle compiled with rocm at first.
|
|
sources = [
|
|
"gpu_ops/save_with_output_msg.cc",
|
|
"gpu_ops/get_output.cc",
|
|
"gpu_ops/get_output_msg_with_topk.cc",
|
|
"gpu_ops/save_output_msg_with_topk.cc",
|
|
"gpu_ops/transfer_output.cc",
|
|
"gpu_ops/set_value_by_flags_and_idx.cu",
|
|
"gpu_ops/token_penalty_multi_scores.cu",
|
|
"gpu_ops/stop_generation.cu",
|
|
"gpu_ops/stop_generation_multi_ends.cu",
|
|
"gpu_ops/get_padding_offset.cu",
|
|
"gpu_ops/update_inputs.cu",
|
|
"gpu_ops/rebuild_padding.cu",
|
|
"gpu_ops/step.cu",
|
|
"gpu_ops/set_data_ipc.cu",
|
|
"gpu_ops/unset_data_ipc.cu",
|
|
"gpu_ops/moe/tritonmoe_preprocess.cu",
|
|
"gpu_ops/step_system_cache.cu",
|
|
"gpu_ops/get_output_ep.cc",
|
|
"gpu_ops/speculate_decoding/speculate_get_padding_offset.cu",
|
|
"gpu_ops/speculate_decoding/speculate_get_output.cc",
|
|
"gpu_ops/share_external_data.cu",
|
|
"gpu_ops/speculate_decoding/speculate_clear_accept_nums.cu",
|
|
"gpu_ops/speculate_decoding/speculate_get_output_padding_offset.cu",
|
|
"gpu_ops/speculate_decoding/speculate_get_seq_lens_output.cu",
|
|
"gpu_ops/speculate_decoding/speculate_save_output.cc",
|
|
"gpu_ops/speculate_decoding/speculate_set_value_by_flags_and_idx.cu",
|
|
"gpu_ops/speculate_decoding/speculate_step.cu",
|
|
"gpu_ops/speculate_decoding/speculate_step_system_cache.cu",
|
|
"gpu_ops/speculate_decoding/speculate_update_v3.cu",
|
|
"gpu_ops/get_position_ids_and_mask_encoder_batch.cu",
|
|
"gpu_ops/fused_rotary_position_encoding.cu",
|
|
"gpu_ops/step_reschedule.cu",
|
|
]
|
|
setup(
|
|
name="fastdeploy_ops",
|
|
ext_modules=CUDAExtension(
|
|
sources=sources,
|
|
extra_compile_args={
|
|
"cxx": ["-O3"],
|
|
"hipcc": [
|
|
"-O3",
|
|
"--gpu-max-threads-per-block=1024",
|
|
"-U__HIP_NO_HALF_OPERATORS__",
|
|
"-U__HIP_NO_HALF_CONVERSIONS__",
|
|
"-U__HIP_NO_BFLOAT16_OPERATORS__",
|
|
"-U__HIP_NO_BFLOAT16_CONVERSIONS__",
|
|
"-U__HIP_NO_BFLOAT162_OPERATORS__",
|
|
"-U__HIP_NO_BFLOAT162_CONVERSIONS__",
|
|
"-DPADDLE_DEV",
|
|
"-Ithird_party/nlohmann_json/include",
|
|
"-Igpu_ops",
|
|
],
|
|
},
|
|
),
|
|
)
|
|
elif paddle.is_compiled_with_cuda():
|
|
sources = [
|
|
"gpu_ops/helper.cu",
|
|
"gpu_ops/save_with_output_msg.cc",
|
|
"gpu_ops/get_output.cc",
|
|
"gpu_ops/get_output_msg_with_topk.cc",
|
|
"gpu_ops/save_output_msg_with_topk.cc",
|
|
"gpu_ops/transfer_output.cc",
|
|
"gpu_ops/set_mask_value.cu",
|
|
"gpu_ops/set_value_by_flags_and_idx.cu",
|
|
"gpu_ops/ngram_mask.cu",
|
|
"gpu_ops/gather_idx.cu",
|
|
"gpu_ops/get_output_ep.cc",
|
|
"gpu_ops/get_mm_split_fuse.cc",
|
|
"gpu_ops/get_img_boundaries.cc",
|
|
"gpu_ops/token_penalty_multi_scores.cu",
|
|
"gpu_ops/token_penalty_only_once.cu",
|
|
"gpu_ops/stop_generation.cu",
|
|
"gpu_ops/stop_generation_multi_ends.cu",
|
|
"gpu_ops/set_flags.cu",
|
|
"gpu_ops/update_inputs_v1.cu",
|
|
"gpu_ops/recover_decode_task.cu",
|
|
"gpu_ops/step.cu",
|
|
"gpu_ops/step_reschedule.cu",
|
|
"gpu_ops/fused_get_rotary_embedding.cu",
|
|
"gpu_ops/get_padding_offset.cu",
|
|
"gpu_ops/update_inputs.cu",
|
|
"gpu_ops/update_inputs_beam.cu",
|
|
"gpu_ops/beam_search_softmax.cu",
|
|
"gpu_ops/rebuild_padding.cu",
|
|
"gpu_ops/set_data_ipc.cu",
|
|
"gpu_ops/unset_data_ipc.cu",
|
|
"gpu_ops/read_data_ipc.cu",
|
|
"gpu_ops/enforce_generation.cu",
|
|
"gpu_ops/dequant_int8.cu",
|
|
"gpu_ops/tune_cublaslt_gemm.cu",
|
|
"gpu_ops/swap_cache_batch.cu",
|
|
"gpu_ops/swap_cache.cu",
|
|
"gpu_ops/swap_cache_layout.cu",
|
|
"gpu_ops/swap_cache_optimized.cu", # 新增:优化的 KV cache 换入算子
|
|
"gpu_ops/step_system_cache.cu",
|
|
"gpu_ops/cpp_extensions.cc",
|
|
"gpu_ops/share_external_data.cu",
|
|
"gpu_ops/fused_mask_swiglu_fp8_quant_kernel.cu",
|
|
"gpu_ops/per_token_quant_fp8.cu",
|
|
"gpu_ops/update_split_fuse_input.cu",
|
|
"gpu_ops/text_image_index_out.cu",
|
|
"gpu_ops/text_image_gather_scatter.cu",
|
|
"gpu_ops/sample_kernels/rejection_top_p_sampling.cu",
|
|
"gpu_ops/sample_kernels/top_k_renorm_probs.cu",
|
|
"gpu_ops/sample_kernels/min_p_sampling_from_probs.cu",
|
|
"gpu_ops/get_position_ids_and_mask_encoder_batch.cu",
|
|
"gpu_ops/fused_rotary_position_encoding.cu",
|
|
"gpu_ops/noaux_tc.cu",
|
|
"gpu_ops/noaux_tc_redundant.cu",
|
|
"gpu_ops/custom_all_reduce/all_reduce.cu",
|
|
"gpu_ops/merge_prefill_decode_output.cu",
|
|
"gpu_ops/limit_thinking_content_length.cu",
|
|
"gpu_ops/update_attn_mask_offsets.cu",
|
|
"gpu_ops/fused_neox_rope_embedding.cu",
|
|
"gpu_ops/gelu_tanh.cu",
|
|
"gpu_ops/reasoning_phase_token_constraint.cu",
|
|
"gpu_ops/get_attn_mask_q.cu",
|
|
]
|
|
sm_versions = get_sm_version(archs)
|
|
# Some kernels in this file require SM75+ instructions. Exclude them when building SM70 (V100).
|
|
disable_gelu_tanh = 70 in sm_versions
|
|
if disable_gelu_tanh:
|
|
sources = [s for s in sources if s != "gpu_ops/gelu_tanh.cu"]
|
|
|
|
# pd_disaggregation
|
|
sources += [
|
|
"gpu_ops/remote_cache_kv_ipc.cc",
|
|
"gpu_ops/open_shm_and_get_meta_signal.cc",
|
|
"gpu_ops/init_signal_layerwise.cc",
|
|
"gpu_ops/get_data_ptr_ipc.cu",
|
|
"gpu_ops/ipc_sent_key_value_cache_by_remote_ptr.cu",
|
|
]
|
|
|
|
dg_third_party_include_dirs = (
|
|
"third_party/cutlass/include/cute",
|
|
"third_party/cutlass/include/cutlass",
|
|
)
|
|
|
|
dg_include_dir = "third_party/DeepGEMM/deep_gemm/include"
|
|
os.makedirs(dg_include_dir, exist_ok=True)
|
|
|
|
for d in dg_third_party_include_dirs:
|
|
dirname = d.split("/")[-1]
|
|
src_dir = d
|
|
dst_dir = os.path.join(dg_include_dir, dirname)
|
|
|
|
# Remove existing directory if it exists
|
|
if os.path.exists(dst_dir):
|
|
if os.path.islink(dst_dir):
|
|
os.unlink(dst_dir)
|
|
else:
|
|
shutil.rmtree(dst_dir)
|
|
print(f"Copying {src_dir} to {dst_dir}")
|
|
|
|
# Copy the directory
|
|
try:
|
|
shutil.copytree(src_dir, dst_dir)
|
|
except Exception as e:
|
|
raise RuntimeError(f"Failed to copy from {src_dir} to {dst_dir}: {e}")
|
|
|
|
cc_compile_args = []
|
|
nvcc_compile_args = get_gencode_flags(archs)
|
|
if disable_gelu_tanh:
|
|
cc_compile_args += ["-DDISABLE_GELU_TANH_OP"]
|
|
nvcc_compile_args += ["-DDISABLE_GELU_TANH_OP"]
|
|
nvcc_compile_args += ["-DPADDLE_DEV"]
|
|
nvcc_compile_args += ["-DPADDLE_ON_INFERENCE"]
|
|
nvcc_compile_args += ["-DPy_LIMITED_API=0x03090000"]
|
|
nvcc_compile_args += [
|
|
"-Igpu_ops/cutlass_kernels",
|
|
"-Ithird_party/cutlass/include",
|
|
"-Ithird_party/cutlass/tools/util/include",
|
|
"-Igpu_ops/fp8_gemm_with_cutlass",
|
|
"-Igpu_ops",
|
|
"-Ithird_party/nlohmann_json/include",
|
|
]
|
|
max_jobs, nvcc_threads = get_compile_parallelism()
|
|
print(f"MAX_JOBS = {max_jobs}, nvcc -t = {nvcc_threads}")
|
|
nvcc_compile_args += ["-t", str(nvcc_threads)]
|
|
|
|
nvcc_version = get_nvcc_version()
|
|
print(f"nvcc_version = {nvcc_version}")
|
|
|
|
# CUDA 13.0+ (CCCL 3.0) changes the default -static-global-template-stub behavior
|
|
# Restore old linking behavior to allow kernel symbols to be visible in shared libraries
|
|
if nvcc_version >= 13.0:
|
|
nvcc_compile_args += ["-static-global-template-stub=false"]
|
|
|
|
if nvcc_version >= 12.0:
|
|
sources += ["gpu_ops/sample_kernels/air_top_p_sampling.cu"]
|
|
cc = max(sm_versions)
|
|
print(f"cc = {cc}")
|
|
fp8_auto_gen_directory = "gpu_ops/cutlass_kernels/fp8_gemm_fused/autogen"
|
|
if os.path.isdir(fp8_auto_gen_directory):
|
|
shutil.rmtree(fp8_auto_gen_directory)
|
|
|
|
if cc >= 75:
|
|
cc_compile_args += ["-DENABLE_SM75_EXT_OPS"]
|
|
nvcc_compile_args += [
|
|
"-DENABLE_SM75_EXT_OPS",
|
|
"-DENABLE_SCALED_MM_C2X=1",
|
|
"-Igpu_ops/cutlass_kernels/w8a8",
|
|
]
|
|
sources += [
|
|
"gpu_ops/cutlass_kernels/w8a8/scaled_mm_entry.cu",
|
|
"gpu_ops/cutlass_kernels/w8a8/scaled_mm_c2x.cu",
|
|
"gpu_ops/quantization/common.cu",
|
|
# cpp_extensions.cc always registers these two ops; include their kernels on SM75 as well.
|
|
"gpu_ops/moe/moe_deepgemm_permute.cu",
|
|
"gpu_ops/moe/moe_deepgemm_depermute.cu",
|
|
]
|
|
|
|
if cc >= 80:
|
|
cc_compile_args += ["-DENABLE_SM80_EXT_OPS"]
|
|
nvcc_compile_args += ["-DENABLE_SM80_EXT_OPS"]
|
|
# append_attention
|
|
os.system(
|
|
"python utils/auto_gen_template_instantiation.py --config gpu_ops/append_attn/template_config.json --output gpu_ops/append_attn/template_instantiation/autogen"
|
|
)
|
|
sources += ["gpu_ops/append_attention.cu"]
|
|
sources += find_end_files("gpu_ops/append_attn", ".cu")
|
|
# sparse indexer
|
|
sources += find_end_files("gpu_ops/sparse_indexer", ".cu")
|
|
# mla
|
|
sources += ["gpu_ops/multi_head_latent_attention.cu"]
|
|
# gemm_dequant
|
|
sources += ["gpu_ops/int8_gemm_with_cutlass/gemm_dequant.cu"]
|
|
# speculate_decoding
|
|
sources += find_end_files("gpu_ops/speculate_decoding", ".cu")
|
|
sources += find_end_files("gpu_ops/speculate_decoding", ".cc")
|
|
nvcc_compile_args += ["-DENABLE_BF16"]
|
|
# moe
|
|
os.system("python gpu_ops/moe/moe_wna16_marlin_utils/generate_kernels.py")
|
|
os.system(
|
|
"python utils/auto_gen_template_instantiation.py --config gpu_ops/moe/template_config.json --output gpu_ops/moe/template_instantiation/autogen"
|
|
)
|
|
sources += find_end_files("gpu_ops/cutlass_kernels/moe_gemm/", ".cu")
|
|
sources += find_end_files("gpu_ops/cutlass_kernels/w4a8_moe/", ".cu")
|
|
sources += find_end_files("gpu_ops/moe/", ".cu")
|
|
nvcc_compile_args += ["-Igpu_ops/moe"]
|
|
|
|
if cc >= 89:
|
|
# Running generate fp8 gemm codes.
|
|
# Common for SM89, SM90, SM100 (Blackwell)
|
|
nvcc_compile_args += ["-DENABLE_FP8"]
|
|
nvcc_compile_args += ["-Igpu_ops/cutlass_kernels/fp8_gemm_fused/autogen"]
|
|
# This script seems general enough for different SM versions, specific templates are chosen by CUTLASS.
|
|
os.system("python utils/auto_gen_visitor_fp8_gemm_fused_kernels.py")
|
|
|
|
if cc >= 90: # Hopper and newer
|
|
# SM90 (Hopper) specific auto-generation and flags
|
|
if cc == 90: # Only for SM90
|
|
nvcc_compile_args += [
|
|
# The gencode for 90a is added in get_gencode_flags now
|
|
# "-gencode",
|
|
# "arch=compute_90a,code=compute_90a",
|
|
"-O3",
|
|
"-DNDEBUG", # NDEBUG is common, consider moving if not specific to 90a
|
|
]
|
|
print("SM90: Running SM90-specific FP8 kernel auto-generation.")
|
|
os.system("python utils/auto_gen_fp8_fp8_gemm_fused_kernels_sm90.py")
|
|
os.system("python utils/auto_gen_fp8_fp8_dual_gemm_fused_kernels_sm90.py")
|
|
os.system("python utils/auto_gen_fp8_fp8_block_gemm_fused_kernels_sm90.py")
|
|
|
|
nvcc_compile_args += [
|
|
"-DENABLE_SCALED_MM_SM90=1",
|
|
]
|
|
sources += [
|
|
"gpu_ops/fp8_gemm_with_cutlass/fp8_fp8_half_block_gemm.cu",
|
|
"gpu_ops/cutlass_kernels/w8a8/scaled_mm_c3x_sm90.cu",
|
|
"gpu_ops/cutlass_kernels/w8a8/c3x/scaled_mm_sm90_fp8.cu",
|
|
"gpu_ops/cutlass_kernels/w8a8/c3x/scaled_mm_sm90_int8.cu",
|
|
"gpu_ops/cutlass_kernels/w8a8/c3x/scaled_mm_azp_sm90_int8.cu",
|
|
]
|
|
elif cc == 100 and nvcc_version >= 12.9: # Blackwell SM100 specifics
|
|
print("SM100 (Blackwell): Applying SM100 configurations.")
|
|
nvcc_compile_args += [
|
|
# The gencode for 100a is added in get_gencode_flags
|
|
# "-gencode",
|
|
# "arch=compute_100a,code=compute_100a",
|
|
"-O3", # Common optimization flag
|
|
"-DNDEBUG", # Common debug flag
|
|
# Potentially add -DENABLE_SM100_FEATURES if specific macros are identified
|
|
]
|
|
# Placeholder for SM100-specific kernel auto-generation scripts
|
|
# These might be needed if Blackwell has new FP8 hardware features
|
|
# not covered by existing generic CUTLASS templates or SM90 scripts.
|
|
# print("SM100: Running SM100-specific FP8 kernel auto-generation (if any).")
|
|
# os.system("python utils/auto_gen_fp8_fp8_gemm_fused_kernels_sm100.py") # Example
|
|
# os.system("python utils/auto_gen_fp8_fp8_dual_gemm_fused_kernels_sm100.py") # Example
|
|
|
|
# Add SM100 specific sources if any, e.g., for new hardware intrinsics
|
|
# sources += ["gpu_ops/cutlass_kernels/w8a8/c4x_sm100.cu"] # Example
|
|
pass # No SM100 specific sources identified yet beyond what CUTLASS handles
|
|
else: # For cc >= 89 but not 90 or 100 (e.g. SM89)
|
|
print(f"SM{cc}: Running generic FP8 kernel auto-generation.")
|
|
os.system("python utils/auto_gen_fp8_fp8_gemm_fused_kernels.py")
|
|
os.system("python utils/auto_gen_fp8_fp8_dual_gemm_fused_kernels.py")
|
|
|
|
else: # For cc == 89 (Ada)
|
|
print("SM89: Running generic FP8 kernel auto-generation.")
|
|
os.system("python utils/auto_gen_fp8_fp8_gemm_fused_kernels.py")
|
|
os.system("python utils/auto_gen_fp8_fp8_dual_gemm_fused_kernels.py")
|
|
|
|
# Common FP8 sources for SM89+
|
|
sources += [
|
|
"gpu_ops/fp8_gemm_with_cutlass/fp8_fp8_half_gemm.cu",
|
|
"gpu_ops/fp8_gemm_with_cutlass/fp8_fp8_fp8_dual_gemm.cu",
|
|
"gpu_ops/fp8_gemm_with_cutlass/fp8_fp8_half_cuda_core_gemm.cu",
|
|
"gpu_ops/fp8_gemm_with_cutlass/per_channel_fp8_fp8_half_gemm.cu",
|
|
"gpu_ops/cutlass_kernels/fp8_gemm_fused/visitor_fp8_gemm_fused.cu",
|
|
"gpu_ops/scaled_gemm_f8_i4_f16_gemm.cu",
|
|
"gpu_ops/scaled_gemm_f8_i4_f16_weight_quantize.cu",
|
|
"gpu_ops/cutlass_kernels/cutlass_heuristic.cu",
|
|
"gpu_ops/cutlass_kernels/cutlass_preprocessors.cu",
|
|
"gpu_ops/fused_hadamard_quant_fp8.cu",
|
|
]
|
|
|
|
sources += find_end_files(fp8_auto_gen_directory, ".cu")
|
|
|
|
if cc >= 90 and nvcc_version >= 12.0:
|
|
# Hopper optimized mla
|
|
sources += find_end_files("gpu_ops/mla_attn", ".cu")
|
|
sources += ["gpu_ops/flash_mask_attn/flash_mask_attn.cu"]
|
|
cc_compile_args += ["-DENABLE_FLASH_MASK_ATTENTION"]
|
|
sources += find_end_files("gpu_ops/moba_attn/moba_decoder_attn/", ".cu")
|
|
sources += find_end_files("gpu_ops/moba_attn/moba_encoder_attn/", ".cu")
|
|
sources += find_end_files("gpu_ops/moba_attn/moba_process/", ".cu")
|
|
sources += ["gpu_ops/moba_attn/moba_attn.cu"]
|
|
os.system("python utils/auto_gen_w4afp8_gemm_kernel.py")
|
|
sources += find_end_files("gpu_ops/w4afp8_gemm", ".cu")
|
|
os.system("python utils/auto_gen_wfp8afp8_sparse_gemm_kernel.py")
|
|
sources += find_end_files("gpu_ops/wfp8afp8_sparse_gemm", ".cu")
|
|
os.system("python gpu_ops/machete/generate.py")
|
|
sources += find_end_files("gpu_ops/machete", ".cu")
|
|
cc_compile_args += ["-DENABLE_MACHETE"]
|
|
|
|
# Deduplicate translation units while preserving order. Some files are
|
|
# appended explicitly for SM75 and also discovered by later directory globs.
|
|
sources = list(dict.fromkeys(sources))
|
|
|
|
setup(
|
|
name="fastdeploy_ops",
|
|
ext_modules=CUDAExtension(
|
|
sources=sources,
|
|
extra_compile_args={"cxx": cc_compile_args, "nvcc": nvcc_compile_args},
|
|
libraries=["cublasLt"],
|
|
extra_link_args=["-lcuda", "-lnvidia-ml"],
|
|
),
|
|
packages=find_packages(where="third_party/DeepGEMM"),
|
|
package_dir={"": "third_party/DeepGEMM"},
|
|
package_data={
|
|
"deep_gemm": [
|
|
"include/deep_gemm/**/*",
|
|
"include/cute/**/*",
|
|
"include/cutlass/**/*",
|
|
]
|
|
},
|
|
include_package_data=True,
|
|
)
|
|
elif paddle.is_compiled_with_xpu():
|
|
assert False, "For XPU, please use setup_ops.py in the xpu_ops directory to compile custom ops."
|
|
elif paddle.is_compiled_with_custom_device("iluvatar_gpu"):
|
|
_iluvatar_clang_cuda_flags = ["-Wno-non-pod-varargs", "-DPADDLE_DEV", "-DPADDLE_WITH_CUSTOM_DEVICE"]
|
|
setup(
|
|
name="fastdeploy_ops",
|
|
ext_modules=CUDAExtension(
|
|
extra_compile_args={
|
|
"cxx": _iluvatar_clang_cuda_flags,
|
|
"nvcc": _iluvatar_clang_cuda_flags,
|
|
},
|
|
sources=[
|
|
"gpu_ops/save_with_output_msg.cc",
|
|
"gpu_ops/get_output.cc",
|
|
"gpu_ops/get_output_msg_with_topk.cc",
|
|
"gpu_ops/save_output_msg_with_topk.cc",
|
|
"gpu_ops/transfer_output.cc",
|
|
"gpu_ops/get_padding_offset.cu",
|
|
"gpu_ops/set_value_by_flags_and_idx.cu",
|
|
"gpu_ops/rebuild_padding.cu",
|
|
"gpu_ops/update_inputs.cu",
|
|
"gpu_ops/stop_generation_multi_ends.cu",
|
|
"gpu_ops/step.cu",
|
|
"gpu_ops/token_penalty_multi_scores.cu",
|
|
"gpu_ops/sample_kernels/rejection_top_p_sampling.cu",
|
|
"gpu_ops/sample_kernels/top_k_renorm_probs.cu",
|
|
"gpu_ops/text_image_index_out.cu",
|
|
"gpu_ops/text_image_gather_scatter.cu",
|
|
"gpu_ops/set_data_ipc.cu",
|
|
"gpu_ops/limit_thinking_content_length.cu",
|
|
"gpu_ops/recover_decode_task.cu",
|
|
"gpu_ops/update_inputs_v1.cu",
|
|
"gpu_ops/get_img_boundaries.cc",
|
|
"gpu_ops/fused_neox_rope_embedding.cu",
|
|
"gpu_ops/get_output_ep.cc",
|
|
"iluvatar_ops/moe_dispatch.cu",
|
|
"iluvatar_ops/moe_reduce.cu",
|
|
"iluvatar_ops/flash_attn_unpadded.cu",
|
|
"iluvatar_ops/paged_attn.cu",
|
|
"iluvatar_ops/prefill_fused_attn.cu",
|
|
"iluvatar_ops/mixed_fused_attn.cu",
|
|
"iluvatar_ops/w8a16_group_gemm.cu",
|
|
"iluvatar_ops/w8a16_group_gemv.cu",
|
|
"iluvatar_ops/wi4a16_group_gemm.cu",
|
|
"iluvatar_ops/wi4a16_weight_quantize.cu",
|
|
"iluvatar_ops/restore_tokens_per_expert.cu",
|
|
"iluvatar_ops/runtime/iluvatar_context.cc",
|
|
"iluvatar_ops/cpp_extensions.cc",
|
|
],
|
|
include_dirs=["iluvatar_ops/runtime", "gpu_ops"],
|
|
extra_link_args=[
|
|
"-lcuinfer",
|
|
],
|
|
),
|
|
)
|
|
elif paddle.is_compiled_with_custom_device("gcu"):
|
|
setup(
|
|
name="fastdeploy_ops",
|
|
ext_modules=CppExtension(
|
|
sources=[
|
|
"gpu_ops/save_with_output_msg.cc",
|
|
"gpu_ops/get_output.cc",
|
|
"gpu_ops/get_output_msg_with_topk.cc",
|
|
]
|
|
),
|
|
)
|
|
elif paddle.device.is_compiled_with_custom_device("metax_gpu"):
|
|
maca_path = os.getenv("MACA_PATH", "/opt/maca")
|
|
sources = [
|
|
"gpu_ops/update_inputs_v1.cu",
|
|
"gpu_ops/save_with_output_msg.cc",
|
|
"gpu_ops/get_output.cc",
|
|
"gpu_ops/get_output_msg_with_topk.cc",
|
|
"gpu_ops/save_output_msg_with_topk.cc",
|
|
"gpu_ops/transfer_output.cc",
|
|
"gpu_ops/save_with_output.cc",
|
|
"gpu_ops/set_mask_value.cu",
|
|
"gpu_ops/set_value_by_flags_and_idx.cu",
|
|
"gpu_ops/ngram_mask.cu",
|
|
"gpu_ops/gather_idx.cu",
|
|
"gpu_ops/get_output_ep.cc",
|
|
"gpu_ops/token_penalty_multi_scores.cu",
|
|
"gpu_ops/token_penalty_only_once.cu",
|
|
"gpu_ops/stop_generation.cu",
|
|
"gpu_ops/stop_generation_multi_ends.cu",
|
|
"gpu_ops/set_flags.cu",
|
|
"gpu_ops/fused_get_rotary_embedding.cu",
|
|
"gpu_ops/get_padding_offset.cu",
|
|
"gpu_ops/update_inputs.cu",
|
|
"gpu_ops/update_inputs_beam.cu",
|
|
"gpu_ops/beam_search_softmax.cu",
|
|
"gpu_ops/rebuild_padding.cu",
|
|
"gpu_ops/step.cu",
|
|
"gpu_ops/step_reschedule.cu",
|
|
"gpu_ops/step_system_cache.cu",
|
|
"gpu_ops/set_data_ipc.cu",
|
|
"gpu_ops/read_data_ipc.cu",
|
|
"gpu_ops/dequant_int8.cu",
|
|
"gpu_ops/share_external_data.cu",
|
|
"gpu_ops/recover_decode_task.cu",
|
|
"gpu_ops/noaux_tc.cu",
|
|
"gpu_ops/noaux_tc_redundant.cu",
|
|
"gpu_ops/fused_rotary_position_encoding.cu",
|
|
"gpu_ops/text_image_gather_scatter.cu",
|
|
"gpu_ops/text_image_index_out.cu",
|
|
"gpu_ops/get_position_ids_and_mask_encoder_batch.cu",
|
|
"gpu_ops/limit_thinking_content_length.cu",
|
|
"gpu_ops/update_attn_mask_offsets.cu",
|
|
"gpu_ops/append_attn/mla_cache_kernel.cu",
|
|
"gpu_ops/append_attn/get_block_shape_and_split_kv_block.cu",
|
|
"gpu_ops/moe/tritonmoe_preprocess.cu",
|
|
"gpu_ops/moe/moe_topk_select.cu",
|
|
"gpu_ops/get_img_boundaries.cc",
|
|
"gpu_ops/remote_cache_kv_ipc.cc",
|
|
"gpu_ops/sample_kernels/rejection_top_p_sampling.cu",
|
|
"gpu_ops/sample_kernels/top_k_renorm_probs.cu",
|
|
"gpu_ops/sample_kernels/min_p_sampling_from_probs.cu",
|
|
"gpu_ops/get_data_ptr_ipc.cu",
|
|
"gpu_ops/ipc_sent_key_value_cache_by_remote_ptr.cu",
|
|
"gpu_ops/unset_data_ipc.cu",
|
|
"gpu_ops/swap_cache_batch.cu",
|
|
"gpu_ops/gelu_tanh.cu",
|
|
"metax_ops/moe_dispatch.cu",
|
|
"metax_ops/moe_ffn.cu",
|
|
"metax_ops/moe_reduce.cu",
|
|
"metax_ops/fused_moe.cu",
|
|
"metax_ops/cache_kv_with_rope.cu",
|
|
"metax_ops/cpp_extensions.cc",
|
|
"metax_ops/split_merge_qkv.cu",
|
|
]
|
|
|
|
sources += find_end_files("gpu_ops/speculate_decoding", ".cu")
|
|
sources += find_end_files("gpu_ops/speculate_decoding", ".cc")
|
|
|
|
metax_extra_compile_args = {
|
|
"cxx": ["-O3"],
|
|
"nvcc": [
|
|
"-O3",
|
|
"-Ithird_party/nlohmann_json/include",
|
|
"-Igpu_ops",
|
|
"-DPADDLE_DEV",
|
|
"-DPADDLE_WITH_CUSTOM_DEVICE_METAX_GPU",
|
|
"-Xcompiler",
|
|
"-Wno-non-pod-varargs",
|
|
],
|
|
}
|
|
|
|
def get_maca_version(version_file: str = "/opt/maca/Version.txt") -> list[int]:
|
|
try:
|
|
with open(version_file, "r", encoding="utf-8") as f:
|
|
version_str = f.readline().strip()
|
|
target_version = [int(part) for part in version_str.split(":")[1].split(".")]
|
|
except Exception as e:
|
|
print(f"Trigger exception: {type(e).__name__} - {e}")
|
|
raise
|
|
return target_version
|
|
|
|
maca_version = get_maca_version(f"{maca_path}/Version.txt")
|
|
if len(maca_version) == 4:
|
|
major_version = maca_version[0]
|
|
minor_version = maca_version[1]
|
|
patch_version = maca_version[2]
|
|
build_version = maca_version[3]
|
|
|
|
cur_maca_version = (
|
|
((major_version & 0xFF) << 24)
|
|
| ((minor_version & 0xFF) << 16)
|
|
| ((patch_version & 0xFF) << 8)
|
|
| ((build_version & 0xFF) << 0)
|
|
)
|
|
metax_extra_compile_args["nvcc"].append(f"-DMACA_VERSION={cur_maca_version}")
|
|
else:
|
|
raise ValueError(f"MACA version invalid - {maca_version}")
|
|
|
|
setup(
|
|
name="fastdeploy_ops",
|
|
ext_modules=CUDAExtension(
|
|
sources=sources,
|
|
extra_compile_args=metax_extra_compile_args,
|
|
library_dirs=[os.path.join(maca_path, "lib")],
|
|
extra_link_args=["-lruntime_cu", "-lmctlassEx"],
|
|
include_dirs=[
|
|
os.path.join(maca_path, "include"),
|
|
os.path.join(maca_path, "include/mcr"),
|
|
os.path.join(maca_path, "include/common"),
|
|
os.path.join(maca_path, "include/mcfft"),
|
|
os.path.join(maca_path, "include/mcrand"),
|
|
os.path.join(maca_path, "include/mcsparse"),
|
|
os.path.join(maca_path, "include/mcblas"),
|
|
os.path.join(maca_path, "include/mcsolver"),
|
|
],
|
|
),
|
|
)
|
|
elif paddle.is_compiled_with_custom_device("intel_hpu"):
|
|
setup(
|
|
name="fastdeploy_ops",
|
|
ext_modules=CppExtension(
|
|
sources=[
|
|
"gpu_ops/get_output.cc",
|
|
]
|
|
),
|
|
)
|
|
else:
|
|
use_bf16 = envs.FD_CPU_USE_BF16 == "True"
|
|
|
|
# cc flags
|
|
paddle_extra_compile_args = [
|
|
"-std=c++17",
|
|
"-shared",
|
|
"-fPIC",
|
|
"-Wno-parentheses",
|
|
"-DPADDLE_WITH_CUSTOM_KERNEL",
|
|
"-DPADDLE_ON_INFERENCE",
|
|
"-Wall",
|
|
"-O3",
|
|
"-g",
|
|
"-lstdc++fs",
|
|
"-D_GLIBCXX_USE_CXX11_ABI=1",
|
|
"-DPy_LIMITED_API=0x03090000",
|
|
]
|
|
|
|
setup(
|
|
name="fastdeploy_cpu_ops",
|
|
ext_modules=CppExtension(
|
|
sources=[
|
|
"gpu_ops/save_with_output_msg.cc",
|
|
"gpu_ops/get_output.cc",
|
|
"gpu_ops/get_output_msg_with_topk.cc",
|
|
"gpu_ops/save_output_msg_with_topk.cc",
|
|
"gpu_ops/transfer_output.cc",
|
|
"cpu_ops/rebuild_padding.cc",
|
|
"cpu_ops/simd_sort.cc",
|
|
"cpu_ops/set_value_by_flags.cc",
|
|
"cpu_ops/token_penalty_multi_scores.cc",
|
|
"cpu_ops/stop_generation_multi_ends.cc",
|
|
"cpu_ops/update_inputs.cc",
|
|
"cpu_ops/get_padding_offset.cc",
|
|
],
|
|
extra_link_args=[
|
|
"-Wl,-rpath,$ORIGIN/x86-simd-sort/builddir",
|
|
"-Wl,-rpath,$ORIGIN/xFasterTransformer/build",
|
|
],
|
|
extra_compile_args=paddle_extra_compile_args,
|
|
),
|
|
packages=find_namespace_packages(where="third_party"),
|
|
package_dir={"": "third_party"},
|
|
include_package_data=True,
|
|
)
|