[Docs] Update code overview documentation (#6568)

* [Docs] Update code overview documentation - Add comprehensive FastDeploy code structure overview - Include detailed module descriptions and development guides - Add quick development guide for common tasks - Update both English and Chinese versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Docs] Update code overview documentation format - Convert file path links from [file](path) to `file` inline code format - Add proper spacing for better readability in markdown tables - Maintain consistent formatting across English and Chinese docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-22 16:07:51 +08:00 · 2026-02-28 16:37:01 +08:00
parent 5d42f19e0a
commit fa21fd95c4
2 changed files with 891 additions and 46 deletions
@@ -1,27 +1,449 @@
 [简体中文](../zh/usage/code_overview.md)

-# Code Overview
+# FastDeploy Code Structure Overview

-Below is an overview of the FastDeploy code structure and functionality organized by directory.
+This document provides a detailed overview of the FastDeploy codebase structure, helping developers quickly understand each module's functionality for development and feature extension.

- ```custom_ops```: Contains C++ operators used by FastDeploy for large model inference. Operators for different hardware are placed in corresponding subdirectories (e.g., `cpu_ops`, `gpu_ops`). The root-level `setup_*.py` files are used to compile these C++ operators.
- ```dockerfiles```: Stores Dockerfiles for building FastDeploy runtime environment images.
- ```docs```: Documentation related to the FastDeploy codebase.
- ```fastdeploy```
-  - ```agent```: Scripts for launching large model services.
-  - ```cache_manager```: Cache management module for large models.
-  - ```engine```: Core engine classes for managing large model execution.
-  - ```entrypoints```: User-facing APIs for interaction.
-  - ```input```: Input processing module, including preprocessing, multimodal input handling, tokenization, etc.
-  - ```model_executor```
-    - ```layers```: Layer modules required for large model architecture.
-    - ```model_runner```: Model inference execution module.
-    - ```models```: Built-in large model classes in FastDeploy.
-    - ```ops```: Python-callable operator modules compiled from `custom_ops`, organized by hardware platform.
-  - ```output```: Post-processing for large model outputs.
-  - ```platforms```: Platform-specific modules for underlying hardware support.
-  - ```scheduler```: Request scheduling module for large models.
-  - ```metrics```: Core component for collecting, managing, and exporting Prometheus metrics, tracking key runtime performance data (e.g., request latency, resource utilization, successful request counts).
-  - ```splitwise```: Modules related to PD disaggregation deployment.
- ```scripts```/```tools```: Utility scripts for FastDeploy operations (e.g., compilation, unit testing, code style fixes).
- ```test```: Code for unit testing and validation.
+---
+
+## Directory Overview
+
+```
+FastDeploy/
+├── fastdeploy/          # Core code directory
+├── custom_ops/          # C++/CUDA custom operators
+├── tests/               # Unit tests
+├── scripts/             # Utility scripts
+├── tools/               # Development tools
+├── docs/                # Documentation
+├── examples/            # Example code
+├── benchmarks/          # Performance benchmarks
+├── dockerfiles/         # Docker image build files
+└── setup.py             # Python package installation script
+```
+
+---
+
+## I. Core Code Directory (fastdeploy/)
+
+The main entry file `fastdeploy/__init__.py` exports core classes:
+
+- `LLM` - Main entry class, offline inference interface
+- `SamplingParams` - Sampling parameter configuration
+- `ModelRegistry` - Model registry
+- `version` - Version information
+
+### 1. engine/ - Core Engine Module
+
+**Function**: Manages LLM inference lifecycle and coordinates components.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `engine.py` | `LLMEngine` core engine class, manages scheduler, preprocessor, resource manager | Entry point for modifying engine behavior, adding new components |
+| `async_llm.py` | Async LLM interface, `AsyncRequestQueue` request queue management | Async inference, streaming output development |
+| `request.py` | Core request data structures: `Request`, `RequestOutput`, `RequestStatus` | Adding request fields, modifying request processing logic |
+| `sampling_params.py` | `SamplingParams` sampling parameter configuration | Adding new sampling strategy parameters |
+| `args_utils.py` | `EngineArgs` engine argument parsing | Adding new engine configuration parameters |
+| `resource_manager.py` | GPU/CPU resource management | Resource allocation optimization |
+
+**Subdirectory**:
+
+- `sched/` - Core scheduling implementation, contains `resource_manager_v1.py` (**core scheduling logic**)
+
+---
+
+### 2. model_executor/ - Model Executor
+
+**Function**: Core execution module for model inference, containing model definitions, layers, operators.
+
+#### 2.1 models/ - Model Implementations
+
+| File/Directory | Function | Development Guide |
+|----------------|----------|-------------------|
+| `model_base.py` | `ModelRegistry` model registration base class | **Must read for adding new models** |
+| `deepseek_v3.py` | DeepSeek V3 model | MoE large model reference |
+| `ernie4_5_moe.py` | ERNIE 4.5 MoE model | Baidu's flagship model |
+| `ernie4_5_mtp.py` | ERNIE 4.5 MTP multi-token prediction | Speculative decoding model |
+| `qwen2.py` | Qwen2 model | General model reference |
+| `qwen3.py` | Qwen3 model | Latest model reference |
+| `ernie4_5_vl/` | ERNIE 4.5 vision-language model | Multimodal model development reference |
+| `qwen2_5_vl/` | Qwen2.5 VL multimodal model | VL model reference |
+| `paddleocr_vl/` | PaddleOCR VL model | OCR multimodal reference |
+
+#### 2.2 layers/ - Network Layer Implementations
+
+| Subdirectory/File | Function | Development Guide |
+|-------------------|----------|-------------------|
+| `attention/` | Attention mechanism implementations (flash_attn, append_attn, mla_attn) | **First choice for attention performance optimization** |
+| `moe/` | MoE layer implementations (Cutlass, Triton, DeepGEMM backends) | MoE performance optimization |
+| `quantization/` | Quantization layers (FP8, W4A8, WINT2, Weight-only) | Quantization scheme development |
+| `linear.py` | Linear layer implementation | Matrix multiplication optimization |
+| `embeddings.py` | Embedding layer implementation | Word embedding modification |
+| `normalization.py` | Normalization layers (RMSNorm, LayerNorm) | Normalization optimization |
+| `rotary_embedding.py` | Rotary Position Encoding ROPE | Position encoding modification |
+| `sample/` | Sampler implementation | Sampling strategy development |
+| `backends/` | Hardware backend implementations (cuda, xpu, dcu, hpu, metax, gcu, npu) | **Entry point for new hardware adaptation** |
+
+#### 2.3 Other Submodules
+
+| Directory | Function | Development Guide |
+|-----------|----------|-------------------|
+| `model_loader/` | Model weight loader | New model format support |
+| `guided_decoding/` | Guided decoding (JSON/regex constrained output) | Structured output development |
+| `graph_optimization/` | Graph optimization (CUDA Graph) | Inference performance optimization |
+| `logits_processor/` | Logits processor | Output control logic |
+| `ops/` | Python-callable operators (organized by hardware platform) | Operator call entry point |
+
+**Key Files**:
+
+- `model_base.py` - Model base class, registry definition
+- `pre_and_post_process.py` - Pre/post processing utilities
+
+---
+
+### 3. scheduler/ - Scheduler Module
+
+**Function**: Request scheduling, supporting single-node, distributed, PD disaggregation scenarios.
+
+> **Note**:
+> - Core scheduling logic is mainly implemented in `engine/sched/resource_manager_v1.py`
+> - Schedulers in this directory are being **gradually deprecated**. For PD disaggregation scheduling, use `router/` or `golang_router/`
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `global_scheduler.py` | `GlobalScheduler` distributed scheduler (Redis) | (Being deprecated) |
+| `local_scheduler.py` | `LocalScheduler` local scheduler | (Being deprecated) |
+| `splitwise_scheduler.py` | `SplitwiseScheduler` PD disaggregation scheduling | (Being deprecated, use router) |
+| `dp_scheduler.py` | Data parallel scheduler | (Being deprecated) |
+| `config.py` | `SchedulerConfig` scheduling configuration | Scheduling parameter adjustment |
+| `storage.py` | Storage adapter, wraps Redis connection | Storage layer modification |
+
+**Core Scheduling Implementation** (`engine/sched/`):
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `resource_manager_v1.py` | Core scheduling logic, contains `ScheduledDecodeTask`, `ScheduledPreemptTask` task classes | **First choice for scheduling strategy modification** |
+
+---
+
+### 4. entrypoints/ - API Entry Points
+
+**Function**: External service interfaces, including offline inference and online API services.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `llm.py` | `LLM` main entry class, offline inference interface | **Entry point for using FastDeploy** |
+| `engine_client.py` | Engine client | Request forwarding logic modification |
+
+#### 4.1 openai/ - OpenAI Compatible API
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `api_server.py` | FastAPI server | **Deployment service entry point** |
+| `protocol.py` | OpenAI protocol definition | API format modification |
+| `serving_chat.py` | Chat Completion API | Chat interface development |
+| `serving_completion.py` | Completion API | Completion interface development |
+| `serving_embedding.py` | Embedding API | Vectorization interface |
+| `tool_parsers/` | Tool call parsers | Function Calling development |
+
+---
+
+### 5. worker/ - Worker Process Module
+
+**Function**: Actual execution process for model inference.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `gpu_model_runner.py` | **GPU model runner** (core inference loop) | **First choice for inference flow modification** |
+| `gpu_worker.py` | GPU Worker process management | Worker lifecycle management |
+| `xpu_model_runner.py` | XPU model runner | Kunlun chip adaptation |
+| `hpu_model_runner.py` | HPU model runner | Intel HPU adaptation |
+| `worker_process.py` | Worker process base class | Process management logic |
+
+---
+
+### 6. input/ - Input Processing Module
+
+**Function**: Input data preprocessing, including tokenization, multimodal input processing.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `text_processor.py` | `BaseDataProcessor` text processor base class | Input processing extension |
+| `ernie4_5_processor.py` | ERNIE 4.5 input processor | Baidu model input processing |
+| `ernie4_5_tokenizer.py` | ERNIE 4.5 tokenizer | Tokenization logic modification |
+| `preprocess.py` | Input preprocessing utilities | Preprocessing flow |
+
+**Multimodal Processing Subdirectories**:
+
+| Directory | Function |
+|-----------|----------|
+| `ernie4_5_vl_processor/` | ERNIE 4.5 VL image/video processing |
+| `qwen_vl_processor/` | Qwen VL multimodal processing |
+| `paddleocr_vl_processor/` | PaddleOCR VL processing |
+
+---
+
+### 7. output/ - Output Processing Module
+
+**Function**: Inference result post-processing, streaming output management.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `token_processor.py` | `TokenProcessor` token output processing | Streaming output, speculative decoding |
+| `pooler.py` | Pooling output processing | Embedding output |
+| `stream_transfer_data.py` | Streaming transfer data structure | Data transfer format |
+
+---
+
+### 8. cache_manager/ - Cache Management Module
+
+**Function**: KV Cache management, supporting prefix caching, cross-device transfer.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `prefix_cache_manager.py` | `PrefixCacheManager` prefix tree cache | **First choice for KV Cache optimization** |
+| `cache_transfer_manager.py` | KV Cache cross-device transfer | PD disaggregation cache transfer |
+| `cache_data.py` | `BlockNode`, `CacheStatus` data structures | Cache data definition |
+| `multimodal_cache_manager.py` | Multimodal cache management | Multimodal caching |
+
+**Subdirectory**:
+
+- `transfer_factory/` - Cache transfer factory (IPC, RDMA)
+
+---
+
+### 9. platforms/ - Hardware Platform Support
+
+**Function**: Multi-hardware platform adaptation, defining operators and features for each platform.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `base.py` | `Platform` base class, `_Backend` enum | **Entry point for new hardware adaptation** |
+| `cuda.py` | NVIDIA CUDA platform | GPU optimization |
+| `xpu.py` | Baidu Kunlun XPU platform | Kunlun chip adaptation |
+| `dcu.py` | AMD DCU (ROCm) platform | AMD GPU adaptation |
+| `maca.py` | MetaX GPU (MACA) platform | Biren GPU adaptation |
+| `intel_hpu.py` | Intel HPU platform | Intel Gaudi adaptation |
+| `iluvatar.py` | Iluvatar GPU platform | Iluvatar adaptation |
+
+---
+
+### 10. metrics/ - Monitoring Metrics Module
+
+**Function**: Prometheus metric collection, performance monitoring.
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `metrics.py` | Prometheus metric definition | Adding new monitoring metrics |
+| `stats.py` | ZMQ metric statistics | Distributed monitoring |
+| `trace_util.py` | OpenTelemetry distributed tracing | Link tracing |
+
+---
+
+### 11. Other Important Modules
+
+| Directory | Function | Development Guide |
+|-----------|----------|-------------------|
+| `inter_communicator/` | Inter-process communication (ZMQ) | Engine-Worker communication modification |
+| `spec_decode/` | Speculative decoding (MTP, N-gram) | Speculative decoding strategy development |
+| `distributed/` | Distributed communication (AllReduce) | Distributed inference development |
+| `multimodal/` | Multimodal data processing | Multimodal feature extension |
+| `reasoning/` | Reasoning mode parsing (DeepSeek R1 style) | Chain-of-thought parsing |
+| `router/` | Request router, **recommended for PD disaggregation** | **First choice for PD disaggregation deployment** |
+| `golang_router/` | Go-implemented router, better PD inter-scheduling performance | **High-performance PD disaggregation scenarios** |
+| `eplb/` | Expert Parallel load balancing | MoE load balancing |
+| `rl/` | Reinforcement learning Rollout | RLHF scenarios |
+| `plugins/` | Plugin system | Custom extensions |
+| `logger/` | Logging module | Log format modification |
+| `trace/` | Tracing module | Performance analysis |
+
+---
+
+### 12. Configuration Files
+
+| File | Function | Development Guide |
+|------|----------|-------------------|
+| `config.py` | `FDConfig` main configuration class | **Entry point for configuration parameter modification** |
+| `envs.py` | Environment variable configuration | Adding new environment variables |
+| `utils.py` | General utility functions | Utility function reuse |
+
+---
+
+## II. Custom Operators Directory (custom_ops/)
+
+**Function**: C++/CUDA high-performance operator implementations, organized by hardware platform.
+
+```
+custom_ops/
+├── gpu_ops/           # NVIDIA GPU operators (main)
+├── cpu_ops/           # CPU operators
+├── xpu_ops/           # Baidu Kunlun XPU operators
+├── iluvatar_ops/      # Iluvatar GPU operators
+├── metax_ops/         # MetaX GPU operators
+├── utils/             # Common utilities
+└── third_party/       # Third-party libraries (cutlass, DeepGEMM)
+```
+
+### gpu_ops/ - GPU Operator Details
+
+| Directory/File | Function | Development Guide |
+|----------------|----------|-------------------|
+| `append_attn/` | Append Attention implementation | **First choice for attention optimization** |
+| `moe/` | MoE operators (fused_moe, expert_dispatch) | MoE performance optimization |
+| `flash_mask_attn/` | Flash Mask Attention | Attention mask optimization |
+| `mla_attn/` | Multi-Head Latent Attention | MLA model support |
+| `machete/` | Machete GEMM | Matrix multiplication optimization |
+| `quantization/` | Quantization operators | Quantization performance optimization |
+| `sample_kernels/` | Sampling operators | Sampling performance optimization |
+| `speculate_decoding/` | Speculative decoding operators | Speculative decoding optimization |
+| `cutlass_kernels/` | CUTLASS kernels | High-performance GEMM |
+| `cpp_extensions.cc` | C++ extension entry | **Entry point for new operator registration** |
+| `append_attention.cu` | Append Attention core | Attention core implementation |
+
+**Key Operator Files**:
+
+- `fused_rotary_position_encoding.cu` - Fused rotary position encoding
+- `multi_head_latent_attention.cu` - MLA attention
+- `per_token_quant_fp8.cu` - FP8 quantization
+
+---
+
+## III. Test Directory (tests/)
+
+**Function**: Unit tests and end-to-end tests, organized by module.
+
+```
+tests/
+├── e2e/               # End-to-end service tests
+├── operators/         # Operator unit tests
+├── model_executor/    # Model executor tests
+├── model_loader/      # Model loading tests
+├── layers/            # Network layer tests
+├── scheduler/         # Scheduler tests
+├── cache_manager/     # Cache management tests
+├── entrypoints/       # API entry tests
+├── input/             # Input processing tests
+├── output/            # Output processing tests
+├── metrics/           # Metric tests
+├── distributed/       # Distributed tests
+├── graph_optimization/# Graph optimization tests
+├── quantization/      # Quantization tests
+├── multimodal/        # Multimodal tests
+├── xpu_ci/            # XPU CI tests
+├── ce/                # CE environment tests
+├── ci_use/            # CI utility tests
+└── conftest.py        # pytest configuration
+```
+
+### Test Directory Details
+
+| Directory | Content | Development Guide |
+|-----------|---------|-------------------|
+| `e2e/` | Complete service tests for each model (ERNIE, Qwen, DeepSeek, etc.) | **Service integration testing** |
+| `operators/` | Operator unit tests (`test_fused_moe.py`, `test_flash_mask_attn.py`, etc.) | **Required tests for operator development** |
+| `layers/` | Network layer tests (attention, moe, quantization) | Network layer testing |
+| `model_executor/` | Model execution flow tests | Model execution testing |
+| `scheduler/` | Scheduler function tests | Scheduling logic verification |
+| `cache_manager/` | Cache management tests | Cache logic verification |
+
+---
+
+## IV. Scripts Directory (scripts/)
+
+**Function**: CI/CD, performance tuning, utility scripts.
+
+| File | Function | Usage Scenario |
+|------|----------|----------------|
+| `run_unittest.sh` | Unit test runner | Local testing |
+| `run_ci_xpu.sh` | XPU CI runner | Kunlun CI |
+| `run_ci_hpu.sh` | HPU CI runner | Intel HPU CI |
+| `run_ci_dcu.sh` | DCU CI runner | AMD DCU CI |
+| `coverage_run.sh` | Code coverage statistics | Code quality |
+| `tune_cublaslt_int8_gemm.py` | cuBLASLt INT8 GEMM tuning | Performance tuning |
+| `tune_cutlass_fp8_gemm.py` | CUTLASS FP8 GEMM tuning | Performance tuning |
+| `offline_w4a8.py` | Offline W4A8 quantization tool | Model quantization |
+| `extract_mtp_weight_from_safetensor.py` | MTP weight extraction | Model processing |
+
+---
+
+## V. Other Directories
+
+### docs/ - Documentation
+
+- Usage documentation, API documentation, architecture design documents
+
+### examples/ - Example Code
+
+- Model usage examples, deployment examples
+
+### benchmarks/ - Performance Benchmarks
+
+- Performance test scripts, benchmark data
+
+### tools/ - Development Tools
+
+- `codestyle/` - Code style checking tools
+- `dockerfile/` - Docker build tools
+
+### dockerfiles/ - Docker Images
+
+- Dockerfiles for each platform runtime environment
+
+---
+
+## VI. Quick Development Guide
+
+### Adding a New Model
+
+1. Reference `models/model_base.py` to understand model registration mechanism
+2. Create new model file under `models/`
+3. Add corresponding input processor under `input/`
+4. Add tests under `tests/model_executor/`
+
+### Adding a New Operator
+
+1. Implement CUDA operator under `custom_ops/gpu_ops/`
+2. Register operator in `cpp_extensions.cc`
+3. Add Python wrapper under `model_executor/ops/gpu/`
+4. Add tests under `tests/operators/`
+
+### New Hardware Platform Adaptation
+
+1. Reference `platforms/base.py` to create new platform class
+2. Create hardware operator directory under `custom_ops/`
+3. Create backend implementation under `model_executor/layers/backends/`
+4. Create model runner under `worker/`
+
+### Optimizing Inference Performance
+
+1. Attention optimization: `custom_ops/gpu_ops/append_attn/`
+2. MoE optimization: `custom_ops/gpu_ops/moe/`
+3. Graph optimization: `fastdeploy/model_executor/graph_optimization/`
+
+### PD Disaggregation Deployment
+
+1. Router: `router/router.py` (Python implementation, recommended)
+2. High-performance router: `golang_router/` (Go implementation, better PD inter-scheduling performance)
+3. Cache transfer: `cache_manager/cache_transfer_manager.py`
+
+---
+
+## VII. Configuration System
+
+```
+FDConfig (config.py)
+├── ModelConfig      # Model configuration
+├── CacheConfig      # Cache configuration
+├── ParallelConfig   # Parallel configuration
+├── SchedulerConfig  # Scheduler configuration
+├── LoRAConfig       # LoRA configuration
+└── ...
+
+Environment Variable Configuration (envs.py)
+├── FD_* series environment variables
+└── Runtime behavior control
+```
+
+---
+
+This document covers the main modules and key files of the FastDeploy codebase. It can be used as a code navigation and development reference. For questions, please refer to detailed documentation of each module or source code comments.
@@ -1,26 +1,449 @@
 [English](../../usage/code_overview.md)

-# 代码说明
-下边按照目录结构来介绍一下每个 FastDeploy 的代码结构及代码功能。
+# FastDeploy 代码结构详解

- ```custom_ops```：存放 FastDeploy 运行大模型所使用到的 C++ 算子，不同硬件下的算子放置到对应的目录下（cpu_ops/gpu_ops），根目录下的 setup_*.py 文件用来编译上述 C++ 代码的算子。
- ```dockerfiles```：存放运行 FastDeploy 的环境镜像 dockerfile。
- ```docs```：FastDeploy 代码库有关的说明文档。
- ```fastdeploy```
-  - ```agent```：大模型服务启动使用到的脚本
-  - ```cache_manager```：大模型缓存管理模块
-  - ```engine```：管理大模型整体执行引擎类有关代码
-  - ```entrypoints```：用户入口调用接口
-  - ```input```：用户输入处理模块，包括预处理，多模态输入处理，tokenize 等功能
-  - ```model_executor```
-    - ```layers```：大模型组网需要用到的 layer 模块
-    - ```model_runner```：模型推理执行模块
-    - ```models```：FastDeploy 内置的大模型类模块
-    - ```ops```：由 custom_ops 编译后可供 python 调用的算子模块，不同硬件平台的算子放置到对应的目录里
-  - ```output```：大模型输出有关处理
-  - ```platforms```：与底层硬件功能支持有关的平台模块
-  - ```scheduler```：大模型请求调度模块
-  - ```metrics```：用于收集、管理和导出 Prometheus 指标的核心组件，负责记录系统运行时的关键性能数据（如请求延迟、资源使用率、成功请求数等）
-  - ```splitwise```: 分离式部署相关模块
- ```scripts```/```tools```：FastDeploy 用于执行功能的辅助脚本，比如编译，单测执行，代码风格纠正等
- ```test```：项目单测验证使用到的代码
+本文档详细介绍 FastDeploy 代码库的结构，帮助开发者快速了解各模块功能，便于开发和新功能扩展。
+
+---
+
+## 目录结构总览
+
+```
+FastDeploy/
+├── fastdeploy/          # 核心代码目录
+├── custom_ops/          # C++/CUDA 自定义算子
+├── tests/               # 单元测试
+├── scripts/             # 工具脚本
+├── tools/               # 开发工具
+├── docs/                # 文档
+├── examples/            # 示例代码
+├── benchmarks/          # 性能基准测试
+├── dockerfiles/         # Docker 镜像构建文件
+└── setup.py             # Python 包安装脚本
+```
+
+---
+
+## 一、核心代码目录 (fastdeploy/)
+
+主入口文件 `fastdeploy/__init__.py` 导出核心类：
+
+- `LLM` - 主入口类，离线推理接口
+- `SamplingParams` - 采样参数配置
+- `ModelRegistry` - 模型注册器
+- `version` - 版本信息
+
+### 1. engine/ - 核心引擎模块
+
+**功能**: 管理 LLM 推理的生命周期，协调各组件工作。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `engine.py` | `LLMEngine` 核心引擎类，管理调度器、预处理器、资源管理器 | 修改引擎行为、添加新组件的入口 |
+| `async_llm.py` | 异步 LLM 接口，`AsyncRequestQueue` 请求队列管理 | 异步推理、流式输出相关开发 |
+| `request.py` | 请求核心数据结构：`Request`, `RequestOutput`, `RequestStatus` | 新增请求字段、修改请求处理逻辑 |
+| `sampling_params.py` | `SamplingParams` 采样参数配置 | 添加新采样策略参数 |
+| `args_utils.py` | `EngineArgs` 引擎参数解析 | 新增引擎配置参数 |
+| `resource_manager.py` | GPU/CPU 资源管理 | 资源分配优化 |
+
+**子目录**:
+
+- `sched/` - 调度核心实现，包含 `resource_manager_v1.py` (**核心调度逻辑所在**)
+
+---
+
+### 2. model_executor/ - 模型执行器
+
+**功能**: 模型推理的核心执行模块，包含模型定义、网络层、算子等。
+
+#### 2.1 models/ - 模型实现
+
+| 文件/目录 | 功能 | 开发指引 |
+|-----------|------|----------|
+| `model_base.py` | `ModelRegistry` 模型注册基类 | **添加新模型必看** |
+| `deepseek_v3.py` | DeepSeek V3 模型 | MoE 大模型参考 |
+| `ernie4_5_moe.py` | ERNIE 4.5 MoE 模型 | 百度主力模型 |
+| `ernie4_5_mtp.py` | ERNIE 4.5 MTP 多token预测 | 推测解码模型 |
+| `qwen2.py` | Qwen2 模型 | 通用模型参考 |
+| `qwen3.py` | Qwen3 模型 | 最新模型参考 |
+| `ernie4_5_vl/` | ERNIE 4.5 视觉语言模型 | 多模态模型开发参考 |
+| `qwen2_5_vl/` | Qwen2.5 VL 多模态模型 | VL 模型参考 |
+| `paddleocr_vl/` | PaddleOCR VL 模型 | OCR 多模态参考 |
+
+#### 2.2 layers/ - 网络层实现
+
+| 子目录/文件 | 功能 | 开发指引 |
+|-------------|------|----------|
+| `attention/` | 注意力机制实现 (flash_attn, append_attn, mla_attn) | **优化注意力性能首选** |
+| `moe/` | MoE 层实现 (Cutlass, Triton, DeepGEMM 后端) | MoE 性能优化 |
+| `quantization/` | 量化层 (FP8, W4A8, WINT2, Weight-only) | 量化方案开发 |
+| `linear.py` | 线性层实现 | 矩阵乘法优化 |
+| `embeddings.py` | 嵌入层实现 | 词嵌入修改 |
+| `normalization.py` | 归一化层 (RMSNorm, LayerNorm) | 归一化优化 |
+| `rotary_embedding.py` | 旋转位置编码 ROPE | 位置编码修改 |
+| `sample/` | 采样器实现 | 采样策略开发 |
+| `backends/` | 硬件后端实现 (cuda, xpu, dcu, hpu, metax, gcu, npu) | **新硬件适配入口** |
+
+#### 2.3 其他子模块
+
+| 目录 | 功能 | 开发指引 |
+|------|------|----------|
+| `model_loader/` | 模型权重加载器 | 新模型格式支持 |
+| `guided_decoding/` | 引导解码 (JSON/regex 约束输出) | 结构化输出开发 |
+| `graph_optimization/` | 图优化 (CUDA Graph) | 推理性能优化 |
+| `logits_processor/` | Logits 处理器 | 输出控制逻辑 |
+| `ops/` | Python 可调用算子 (按硬件平台组织) | 算子调用入口 |
+
+**关键文件**:
+
+- `model_base.py` - 模型基类、注册器定义
+- `pre_and_post_process.py` - 前后处理工具
+
+---
+
+### 3. scheduler/ - 调度器模块
+
+**功能**: 请求调度，支持单机、分布式、PD 分离等场景。
+
+> **注意**:
+> - 核心调度逻辑主要在 `engine/sched/resource_manager_v1.py` 中实现
+> - 本目录下的 scheduler 正在**逐步废弃**，PD 分离调度请使用 `router/` 或 `golang_router/`
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `global_scheduler.py` | `GlobalScheduler` 分布式调度器 (Redis) | (逐步废弃) |
+| `local_scheduler.py` | `LocalScheduler` 本地调度器 | (逐步废弃) |
+| `splitwise_scheduler.py` | `SplitwiseScheduler` PD 分离调度 | (逐步废弃，请使用 router) |
+| `dp_scheduler.py` | 数据并行调度器 | (逐步废弃) |
+| `config.py` | `SchedulerConfig` 调度配置 | 调度参数调整 |
+| `storage.py` | 存储适配器，封装 Redis 连接 | 存储层修改 |
+
+**调度核心实现** (`engine/sched/`):
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `resource_manager_v1.py` | 核心调度逻辑，包含 `ScheduledDecodeTask`、`ScheduledPreemptTask` 等任务类 | **调度策略修改首选** |
+
+---
+
+### 4. entrypoints/ - API 入口
+
+**功能**: 对外服务接口，包括离线推理和在线 API 服务。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `llm.py` | `LLM` 主入口类，离线推理接口 | **使用 FastDeploy 入口** |
+| `engine_client.py` | 引擎客户端 | 请求转发逻辑修改 |
+
+#### 4.1 openai/ - OpenAI 兼容 API
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `api_server.py` | FastAPI 服务器 | **部署服务入口** |
+| `protocol.py` | OpenAI 协议定义 | API 格式修改 |
+| `serving_chat.py` | Chat Completion API | 聊天接口开发 |
+| `serving_completion.py` | Completion API | 补全接口开发 |
+| `serving_embedding.py` | Embedding API | 向量化接口 |
+| `tool_parsers/` | 工具调用解析器 | Function Calling 开发 |
+
+---
+
+### 5. worker/ - Worker 进程模块
+
+**功能**: 模型推理的实际执行进程。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `gpu_model_runner.py` | **GPU 模型运行器** (核心推理循环) | **推理流程修改首选** |
+| `gpu_worker.py` | GPU Worker 进程管理 | Worker 生命周期管理 |
+| `xpu_model_runner.py` | XPU 模型运行器 | 昆仑芯片适配 |
+| `hpu_model_runner.py` | HPU 模型运行器 | Intel HPU 适配 |
+| `worker_process.py` | Worker 进程基类 | 进程管理逻辑 |
+
+---
+
+### 6. input/ - 输入处理模块
+
+**功能**: 输入数据预处理，包括分词、多模态输入处理。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `text_processor.py` | `BaseDataProcessor` 文本处理器基类 | 输入处理扩展 |
+| `ernie4_5_processor.py` | ERNIE 4.5 输入处理器 | 百度模型输入处理 |
+| `ernie4_5_tokenizer.py` | ERNIE 4.5 分词器 | 分词逻辑修改 |
+| `preprocess.py` | 输入预处理工具 | 预处理流程 |
+
+**多模态处理子目录**:
+
+| 目录 | 功能 |
+|------|------|
+| `ernie4_5_vl_processor/` | ERNIE 4.5 VL 图像/视频处理 |
+| `qwen_vl_processor/` | Qwen VL 多模态处理 |
+| `paddleocr_vl_processor/` | PaddleOCR VL 处理 |
+
+---
+
+### 7. output/ - 输出处理模块
+
+**功能**: 推理结果后处理，流式输出管理。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `token_processor.py` | `TokenProcessor` Token 输出处理 | 流式输出、推测解码 |
+| `pooler.py` | 池化输出处理 | Embedding 输出 |
+| `stream_transfer_data.py` | 流式传输数据结构 | 数据传输格式 |
+
+---
+
+### 8. cache_manager/ - 缓存管理模块
+
+**功能**: KV Cache 管理，支持前缀缓存、跨设备传输。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `prefix_cache_manager.py` | `PrefixCacheManager` 前缀树缓存 | **KV Cache 优化首选** |
+| `cache_transfer_manager.py` | KV Cache 跨设备传输 | PD 分离缓存传输 |
+| `cache_data.py` | `BlockNode`, `CacheStatus` 数据结构 | 缓存数据定义 |
+| `multimodal_cache_manager.py` | 多模态缓存管理 | 多模态缓存 |
+
+**子目录**:
+
+- `transfer_factory/` - 缓存传输工厂 (IPC, RDMA)
+
+---
+
+### 9. platforms/ - 硬件平台支持
+
+**功能**: 多硬件平台适配，定义各平台的算子和特性。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `base.py` | `Platform` 基类，`_Backend` 枚举 | **新硬件适配入口** |
+| `cuda.py` | NVIDIA CUDA 平台 | GPU 优化 |
+| `xpu.py` | 百度昆仑 XPU 平台 | 昆仑芯片适配 |
+| `dcu.py` | AMD DCU (ROCm) 平台 | AMD GPU 适配 |
+| `maca.py` | MetaX GPU (MACA) 平台 | 壁仞 GPU 适配 |
+| `intel_hpu.py` | Intel HPU 平台 | Intel Gaudi 适配 |
+| `iluvatar.py` | 天数智芯 GPU 平台 | 天数智芯适配 |
+
+---
+
+### 10. metrics/ - 监控指标模块
+
+**功能**: Prometheus 指标收集，性能监控。
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `metrics.py` | Prometheus 指标定义 | 新增监控指标 |
+| `stats.py` | ZMQ 指标统计 | 分布式监控 |
+| `trace_util.py` | OpenTelemetry 分布式追踪 | 链路追踪 |
+
+---
+
+### 11. 其他重要模块
+
+| 目录 | 功能 | 开发指引 |
+|------|------|----------|
+| `inter_communicator/` | 进程间通信 (ZMQ) | Engine-Worker 通信修改 |
+| `spec_decode/` | 推测解码 (MTP, N-gram) | 推测解码策略开发 |
+| `distributed/` | 分布式通信 (AllReduce) | 分布式推理开发 |
+| `multimodal/` | 多模态数据处理 | 多模态功能扩展 |
+| `reasoning/` | 推理模式解析 (DeepSeek R1 风格) | 思考链解析 |
+| `router/` | 请求路由器，**PD 分离调度推荐使用** | **PD 分离部署首选** |
+| `golang_router/` | Go 语言实现的路由器，PD 间调度性能更优 | **高性能 PD 分离场景** |
+| `eplb/` | Expert Parallel 负载均衡 | MoE 负载均衡 |
+| `rl/` | 强化学习 Rollout | RLHF 场景 |
+| `plugins/` | 插件系统 | 自定义扩展 |
+| `logger/` | 日志模块 | 日志格式修改 |
+| `trace/` | 追踪模块 | 性能分析 |
+
+---
+
+### 12. 配置文件
+
+| 文件 | 功能 | 开发指引 |
+|------|------|----------|
+| `config.py` | `FDConfig` 总配置类 | **配置参数修改入口** |
+| `envs.py` | 环境变量配置 | 新增环境变量 |
+| `utils.py` | 通用工具函数 | 工具函数复用 |
+
+---
+
+## 二、自定义算子目录 (custom_ops/)
+
+**功能**: C++/CUDA 高性能算子实现，按硬件平台组织。
+
+```
+custom_ops/
+├── gpu_ops/           # NVIDIA GPU 算子 (主要)
+├── cpu_ops/           # CPU 算子
+├── xpu_ops/           # 百度昆仑 XPU 算子
+├── iluvatar_ops/      # 天数智芯 GPU 算子
+├── metax_ops/         # MetaX GPU 算子
+├── utils/             # 公共工具
+└── third_party/       # 第三方库 (cutlass, DeepGEMM)
+```
+
+### gpu_ops/ - GPU 算子详解
+
+| 目录/文件 | 功能 | 开发指引 |
+|-----------|------|----------|
+| `append_attn/` | Append Attention 实现 | **注意力优化首选** |
+| `moe/` | MoE 算子 (fused_moe, expert_dispatch) | MoE 性能优化 |
+| `flash_mask_attn/` | Flash Mask Attention | 注意力掩码优化 |
+| `mla_attn/` | Multi-Head Latent Attention | MLA 模型支持 |
+| `machete/` | Machete GEMM | 矩阵乘法优化 |
+| `quantization/` | 量化算子 | 量化性能优化 |
+| `sample_kernels/` | 采样算子 | 采样性能优化 |
+| `speculate_decoding/` | 推测解码算子 | 推测解码优化 |
+| `cutlass_kernels/` | CUTLASS 内核 | 高性能 GEMM |
+| `cpp_extensions.cc` | C++ 扩展入口 | **新增算子注册入口** |
+| `append_attention.cu` | Append Attention 核心 | 注意力核心实现 |
+
+**关键算子文件**:
+
+- `fused_rotary_position_encoding.cu` - 融合旋转位置编码
+- `multi_head_latent_attention.cu` - MLA 注意力
+- `per_token_quant_fp8.cu` - FP8 量化
+
+---
+
+## 三、测试目录 (tests/)
+
+**功能**: 单元测试和端到端测试，按模块组织。
+
+```
+tests/
+├── e2e/               # 端到端服务测试
+├── operators/         # 算子单元测试
+├── model_executor/    # 模型执行器测试
+├── model_loader/      # 模型加载测试
+├── layers/            # 网络层测试
+├── scheduler/         # 调度器测试
+├── cache_manager/     # 缓存管理测试
+├── entrypoints/       # API 入口测试
+├── input/             # 输入处理测试
+├── output/            # 输出处理测试
+├── metrics/           # 指标测试
+├── distributed/       # 分布式测试
+├── graph_optimization/# 图优化测试
+├── quantization/      # 量化测试
+├── multimodal/        # 多模态测试
+├── xpu_ci/            # XPU CI 测试
+├── ce/                # CE 环境测试
+├── ci_use/            # CI 工具测试
+└── conftest.py        # pytest 配置
+```
+
+### 测试目录详解
+
+| 目录 | 内容 | 开发指引 |
+|------|------|----------|
+| `e2e/` | 各模型服务完整测试 (ERNIE, Qwen, DeepSeek 等) | **服务集成测试** |
+| `operators/` | 算子单元测试 (`test_fused_moe.py`, `test_flash_mask_attn.py` 等) | **算子开发必写测试** |
+| `layers/` | 网络层测试 (attention, moe, quantization) | 网络层测试 |
+| `model_executor/` | 模型执行流程测试 | 模型执行测试 |
+| `scheduler/` | 调度器功能测试 | 调度逻辑验证 |
+| `cache_manager/` | 缓存管理测试 | 缓存逻辑验证 |
+
+---
+
+## 四、脚本工具目录 (scripts/)
+
+**功能**: CI/CD、性能调优、工具脚本。
+
+| 文件 | 功能 | 使用场景 |
+|------|------|----------|
+| `run_unittest.sh` | 单元测试运行 | 本地测试 |
+| `run_ci_xpu.sh` | XPU CI 运行 | 昆仑 CI |
+| `run_ci_hpu.sh` | HPU CI 运行 | Intel HPU CI |
+| `run_ci_dcu.sh` | DCU CI 运行 | AMD DCU CI |
+| `coverage_run.sh` | 代码覆盖率统计 | 代码质量 |
+| `tune_cublaslt_int8_gemm.py` | cuBLASLt INT8 GEMM 调优 | 性能调优 |
+| `tune_cutlass_fp8_gemm.py` | CUTLASS FP8 GEMM 调优 | 性能调优 |
+| `offline_w4a8.py` | 离线 W4A8 量化工具 | 模型量化 |
+| `extract_mtp_weight_from_safetensor.py` | MTP 权重提取 | 模型处理 |
+
+---
+
+## 五、其他目录
+
+### docs/ - 文档
+
+- 使用文档、API 文档、架构设计文档
+
+### examples/ - 示例代码
+
+- 各模型使用示例、部署示例
+
+### benchmarks/ - 性能基准
+
+- 性能测试脚本、基准数据
+
+### tools/ - 开发工具
+
+- `codestyle/` - 代码风格检查工具
+- `dockerfile/` - Docker 构建工具
+
+### dockerfiles/ - Docker 镜像
+
+- 各平台运行环境 Dockerfile
+
+---
+
+## 六、开发指引速查
+
+### 添加新模型
+
+1. 参考 `models/model_base.py` 了解模型注册机制
+2. 在 `models/` 下创建新模型文件
+3. 在 `input/` 下添加对应的输入处理器
+4. 在 `tests/model_executor/` 下添加测试
+
+### 添加新算子
+
+1. 在 `custom_ops/gpu_ops/` 下实现 CUDA 算子
+2. 在 `cpp_extensions.cc` 中注册算子
+3. 在 `model_executor/ops/gpu/` 下添加 Python 封装
+4. 在 `tests/operators/` 下添加测试
+
+### 新硬件平台适配
+
+1. 参考 `platforms/base.py` 创建新平台类
+2. 在 `custom_ops/` 下创建硬件算子目录
+3. 在 `model_executor/layers/backends/` 下创建后端实现
+4. 在 `worker/` 下创建模型运行器
+
+### 优化推理性能
+
+1. 注意力优化：`custom_ops/gpu_ops/append_attn/`
+2. MoE 优化：`custom_ops/gpu_ops/moe/`
+3. 图优化：`fastdeploy/model_executor/graph_optimization/`
+
+### PD 分离部署
+
+1. 路由器：`router/router.py` (Python 实现，推荐)
+2. 高性能路由：`golang_router/` (Go 实现，PD 间调度性能更优)
+3. 缓存传输：`cache_manager/cache_transfer_manager.py`
+
+---
+
+## 七、配置体系
+
+```
+FDConfig (config.py)
+├── ModelConfig      # 模型配置
+├── CacheConfig      # 缓存配置
+├── ParallelConfig   # 并行配置
+├── SchedulerConfig  # 调度配置
+├── LoRAConfig       # LoRA 配置
+└── ...
+
+环境变量配置 (envs.py)
+├── FD_* 系列环境变量
+└── 运行时行为控制
+```
+
+---
+
+本文档涵盖了 FastDeploy 代码库的主要模块和关键文件，可作为代码导航和开发参考使用。如有疑问，请参考各模块的详细文档或源码注释。