Files
FastDeploy/docs/usage/code_overview.md
T
kevin fa21fd95c4 [Docs] Update code overview documentation (#6568)
* [Docs] Update code overview documentation

- Add comprehensive FastDeploy code structure overview
- Include detailed module descriptions and development guides
- Add quick development guide for common tasks
- Update both English and Chinese versions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [Docs] Update code overview documentation format

- Convert file path links from [file](path) to `file` inline code format
- Add proper spacing for better readability in markdown tables
- Maintain consistent formatting across English and Chinese docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 16:37:01 +08:00

450 lines
19 KiB
Markdown

[简体中文](../zh/usage/code_overview.md)
# FastDeploy Code Structure Overview
This document provides a detailed overview of the FastDeploy codebase structure, helping developers quickly understand each module's functionality for development and feature extension.
---
## Directory Overview
```
FastDeploy/
├── fastdeploy/ # Core code directory
├── custom_ops/ # C++/CUDA custom operators
├── tests/ # Unit tests
├── scripts/ # Utility scripts
├── tools/ # Development tools
├── docs/ # Documentation
├── examples/ # Example code
├── benchmarks/ # Performance benchmarks
├── dockerfiles/ # Docker image build files
└── setup.py # Python package installation script
```
---
## I. Core Code Directory (fastdeploy/)
The main entry file `fastdeploy/__init__.py` exports core classes:
- `LLM` - Main entry class, offline inference interface
- `SamplingParams` - Sampling parameter configuration
- `ModelRegistry` - Model registry
- `version` - Version information
### 1. engine/ - Core Engine Module
**Function**: Manages LLM inference lifecycle and coordinates components.
| File | Function | Development Guide |
|------|----------|-------------------|
| `engine.py` | `LLMEngine` core engine class, manages scheduler, preprocessor, resource manager | Entry point for modifying engine behavior, adding new components |
| `async_llm.py` | Async LLM interface, `AsyncRequestQueue` request queue management | Async inference, streaming output development |
| `request.py` | Core request data structures: `Request`, `RequestOutput`, `RequestStatus` | Adding request fields, modifying request processing logic |
| `sampling_params.py` | `SamplingParams` sampling parameter configuration | Adding new sampling strategy parameters |
| `args_utils.py` | `EngineArgs` engine argument parsing | Adding new engine configuration parameters |
| `resource_manager.py` | GPU/CPU resource management | Resource allocation optimization |
**Subdirectory**:
- `sched/` - Core scheduling implementation, contains `resource_manager_v1.py` (**core scheduling logic**)
---
### 2. model_executor/ - Model Executor
**Function**: Core execution module for model inference, containing model definitions, layers, operators.
#### 2.1 models/ - Model Implementations
| File/Directory | Function | Development Guide |
|----------------|----------|-------------------|
| `model_base.py` | `ModelRegistry` model registration base class | **Must read for adding new models** |
| `deepseek_v3.py` | DeepSeek V3 model | MoE large model reference |
| `ernie4_5_moe.py` | ERNIE 4.5 MoE model | Baidu's flagship model |
| `ernie4_5_mtp.py` | ERNIE 4.5 MTP multi-token prediction | Speculative decoding model |
| `qwen2.py` | Qwen2 model | General model reference |
| `qwen3.py` | Qwen3 model | Latest model reference |
| `ernie4_5_vl/` | ERNIE 4.5 vision-language model | Multimodal model development reference |
| `qwen2_5_vl/` | Qwen2.5 VL multimodal model | VL model reference |
| `paddleocr_vl/` | PaddleOCR VL model | OCR multimodal reference |
#### 2.2 layers/ - Network Layer Implementations
| Subdirectory/File | Function | Development Guide |
|-------------------|----------|-------------------|
| `attention/` | Attention mechanism implementations (flash_attn, append_attn, mla_attn) | **First choice for attention performance optimization** |
| `moe/` | MoE layer implementations (Cutlass, Triton, DeepGEMM backends) | MoE performance optimization |
| `quantization/` | Quantization layers (FP8, W4A8, WINT2, Weight-only) | Quantization scheme development |
| `linear.py` | Linear layer implementation | Matrix multiplication optimization |
| `embeddings.py` | Embedding layer implementation | Word embedding modification |
| `normalization.py` | Normalization layers (RMSNorm, LayerNorm) | Normalization optimization |
| `rotary_embedding.py` | Rotary Position Encoding ROPE | Position encoding modification |
| `sample/` | Sampler implementation | Sampling strategy development |
| `backends/` | Hardware backend implementations (cuda, xpu, dcu, hpu, metax, gcu, npu) | **Entry point for new hardware adaptation** |
#### 2.3 Other Submodules
| Directory | Function | Development Guide |
|-----------|----------|-------------------|
| `model_loader/` | Model weight loader | New model format support |
| `guided_decoding/` | Guided decoding (JSON/regex constrained output) | Structured output development |
| `graph_optimization/` | Graph optimization (CUDA Graph) | Inference performance optimization |
| `logits_processor/` | Logits processor | Output control logic |
| `ops/` | Python-callable operators (organized by hardware platform) | Operator call entry point |
**Key Files**:
- `model_base.py` - Model base class, registry definition
- `pre_and_post_process.py` - Pre/post processing utilities
---
### 3. scheduler/ - Scheduler Module
**Function**: Request scheduling, supporting single-node, distributed, PD disaggregation scenarios.
> **Note**:
> - Core scheduling logic is mainly implemented in `engine/sched/resource_manager_v1.py`
> - Schedulers in this directory are being **gradually deprecated**. For PD disaggregation scheduling, use `router/` or `golang_router/`
| File | Function | Development Guide |
|------|----------|-------------------|
| `global_scheduler.py` | `GlobalScheduler` distributed scheduler (Redis) | (Being deprecated) |
| `local_scheduler.py` | `LocalScheduler` local scheduler | (Being deprecated) |
| `splitwise_scheduler.py` | `SplitwiseScheduler` PD disaggregation scheduling | (Being deprecated, use router) |
| `dp_scheduler.py` | Data parallel scheduler | (Being deprecated) |
| `config.py` | `SchedulerConfig` scheduling configuration | Scheduling parameter adjustment |
| `storage.py` | Storage adapter, wraps Redis connection | Storage layer modification |
**Core Scheduling Implementation** (`engine/sched/`):
| File | Function | Development Guide |
|------|----------|-------------------|
| `resource_manager_v1.py` | Core scheduling logic, contains `ScheduledDecodeTask`, `ScheduledPreemptTask` task classes | **First choice for scheduling strategy modification** |
---
### 4. entrypoints/ - API Entry Points
**Function**: External service interfaces, including offline inference and online API services.
| File | Function | Development Guide |
|------|----------|-------------------|
| `llm.py` | `LLM` main entry class, offline inference interface | **Entry point for using FastDeploy** |
| `engine_client.py` | Engine client | Request forwarding logic modification |
#### 4.1 openai/ - OpenAI Compatible API
| File | Function | Development Guide |
|------|----------|-------------------|
| `api_server.py` | FastAPI server | **Deployment service entry point** |
| `protocol.py` | OpenAI protocol definition | API format modification |
| `serving_chat.py` | Chat Completion API | Chat interface development |
| `serving_completion.py` | Completion API | Completion interface development |
| `serving_embedding.py` | Embedding API | Vectorization interface |
| `tool_parsers/` | Tool call parsers | Function Calling development |
---
### 5. worker/ - Worker Process Module
**Function**: Actual execution process for model inference.
| File | Function | Development Guide |
|------|----------|-------------------|
| `gpu_model_runner.py` | **GPU model runner** (core inference loop) | **First choice for inference flow modification** |
| `gpu_worker.py` | GPU Worker process management | Worker lifecycle management |
| `xpu_model_runner.py` | XPU model runner | Kunlun chip adaptation |
| `hpu_model_runner.py` | HPU model runner | Intel HPU adaptation |
| `worker_process.py` | Worker process base class | Process management logic |
---
### 6. input/ - Input Processing Module
**Function**: Input data preprocessing, including tokenization, multimodal input processing.
| File | Function | Development Guide |
|------|----------|-------------------|
| `text_processor.py` | `BaseDataProcessor` text processor base class | Input processing extension |
| `ernie4_5_processor.py` | ERNIE 4.5 input processor | Baidu model input processing |
| `ernie4_5_tokenizer.py` | ERNIE 4.5 tokenizer | Tokenization logic modification |
| `preprocess.py` | Input preprocessing utilities | Preprocessing flow |
**Multimodal Processing Subdirectories**:
| Directory | Function |
|-----------|----------|
| `ernie4_5_vl_processor/` | ERNIE 4.5 VL image/video processing |
| `qwen_vl_processor/` | Qwen VL multimodal processing |
| `paddleocr_vl_processor/` | PaddleOCR VL processing |
---
### 7. output/ - Output Processing Module
**Function**: Inference result post-processing, streaming output management.
| File | Function | Development Guide |
|------|----------|-------------------|
| `token_processor.py` | `TokenProcessor` token output processing | Streaming output, speculative decoding |
| `pooler.py` | Pooling output processing | Embedding output |
| `stream_transfer_data.py` | Streaming transfer data structure | Data transfer format |
---
### 8. cache_manager/ - Cache Management Module
**Function**: KV Cache management, supporting prefix caching, cross-device transfer.
| File | Function | Development Guide |
|------|----------|-------------------|
| `prefix_cache_manager.py` | `PrefixCacheManager` prefix tree cache | **First choice for KV Cache optimization** |
| `cache_transfer_manager.py` | KV Cache cross-device transfer | PD disaggregation cache transfer |
| `cache_data.py` | `BlockNode`, `CacheStatus` data structures | Cache data definition |
| `multimodal_cache_manager.py` | Multimodal cache management | Multimodal caching |
**Subdirectory**:
- `transfer_factory/` - Cache transfer factory (IPC, RDMA)
---
### 9. platforms/ - Hardware Platform Support
**Function**: Multi-hardware platform adaptation, defining operators and features for each platform.
| File | Function | Development Guide |
|------|----------|-------------------|
| `base.py` | `Platform` base class, `_Backend` enum | **Entry point for new hardware adaptation** |
| `cuda.py` | NVIDIA CUDA platform | GPU optimization |
| `xpu.py` | Baidu Kunlun XPU platform | Kunlun chip adaptation |
| `dcu.py` | AMD DCU (ROCm) platform | AMD GPU adaptation |
| `maca.py` | MetaX GPU (MACA) platform | Biren GPU adaptation |
| `intel_hpu.py` | Intel HPU platform | Intel Gaudi adaptation |
| `iluvatar.py` | Iluvatar GPU platform | Iluvatar adaptation |
---
### 10. metrics/ - Monitoring Metrics Module
**Function**: Prometheus metric collection, performance monitoring.
| File | Function | Development Guide |
|------|----------|-------------------|
| `metrics.py` | Prometheus metric definition | Adding new monitoring metrics |
| `stats.py` | ZMQ metric statistics | Distributed monitoring |
| `trace_util.py` | OpenTelemetry distributed tracing | Link tracing |
---
### 11. Other Important Modules
| Directory | Function | Development Guide |
|-----------|----------|-------------------|
| `inter_communicator/` | Inter-process communication (ZMQ) | Engine-Worker communication modification |
| `spec_decode/` | Speculative decoding (MTP, N-gram) | Speculative decoding strategy development |
| `distributed/` | Distributed communication (AllReduce) | Distributed inference development |
| `multimodal/` | Multimodal data processing | Multimodal feature extension |
| `reasoning/` | Reasoning mode parsing (DeepSeek R1 style) | Chain-of-thought parsing |
| `router/` | Request router, **recommended for PD disaggregation** | **First choice for PD disaggregation deployment** |
| `golang_router/` | Go-implemented router, better PD inter-scheduling performance | **High-performance PD disaggregation scenarios** |
| `eplb/` | Expert Parallel load balancing | MoE load balancing |
| `rl/` | Reinforcement learning Rollout | RLHF scenarios |
| `plugins/` | Plugin system | Custom extensions |
| `logger/` | Logging module | Log format modification |
| `trace/` | Tracing module | Performance analysis |
---
### 12. Configuration Files
| File | Function | Development Guide |
|------|----------|-------------------|
| `config.py` | `FDConfig` main configuration class | **Entry point for configuration parameter modification** |
| `envs.py` | Environment variable configuration | Adding new environment variables |
| `utils.py` | General utility functions | Utility function reuse |
---
## II. Custom Operators Directory (custom_ops/)
**Function**: C++/CUDA high-performance operator implementations, organized by hardware platform.
```
custom_ops/
├── gpu_ops/ # NVIDIA GPU operators (main)
├── cpu_ops/ # CPU operators
├── xpu_ops/ # Baidu Kunlun XPU operators
├── iluvatar_ops/ # Iluvatar GPU operators
├── metax_ops/ # MetaX GPU operators
├── utils/ # Common utilities
└── third_party/ # Third-party libraries (cutlass, DeepGEMM)
```
### gpu_ops/ - GPU Operator Details
| Directory/File | Function | Development Guide |
|----------------|----------|-------------------|
| `append_attn/` | Append Attention implementation | **First choice for attention optimization** |
| `moe/` | MoE operators (fused_moe, expert_dispatch) | MoE performance optimization |
| `flash_mask_attn/` | Flash Mask Attention | Attention mask optimization |
| `mla_attn/` | Multi-Head Latent Attention | MLA model support |
| `machete/` | Machete GEMM | Matrix multiplication optimization |
| `quantization/` | Quantization operators | Quantization performance optimization |
| `sample_kernels/` | Sampling operators | Sampling performance optimization |
| `speculate_decoding/` | Speculative decoding operators | Speculative decoding optimization |
| `cutlass_kernels/` | CUTLASS kernels | High-performance GEMM |
| `cpp_extensions.cc` | C++ extension entry | **Entry point for new operator registration** |
| `append_attention.cu` | Append Attention core | Attention core implementation |
**Key Operator Files**:
- `fused_rotary_position_encoding.cu` - Fused rotary position encoding
- `multi_head_latent_attention.cu` - MLA attention
- `per_token_quant_fp8.cu` - FP8 quantization
---
## III. Test Directory (tests/)
**Function**: Unit tests and end-to-end tests, organized by module.
```
tests/
├── e2e/ # End-to-end service tests
├── operators/ # Operator unit tests
├── model_executor/ # Model executor tests
├── model_loader/ # Model loading tests
├── layers/ # Network layer tests
├── scheduler/ # Scheduler tests
├── cache_manager/ # Cache management tests
├── entrypoints/ # API entry tests
├── input/ # Input processing tests
├── output/ # Output processing tests
├── metrics/ # Metric tests
├── distributed/ # Distributed tests
├── graph_optimization/# Graph optimization tests
├── quantization/ # Quantization tests
├── multimodal/ # Multimodal tests
├── xpu_ci/ # XPU CI tests
├── ce/ # CE environment tests
├── ci_use/ # CI utility tests
└── conftest.py # pytest configuration
```
### Test Directory Details
| Directory | Content | Development Guide |
|-----------|---------|-------------------|
| `e2e/` | Complete service tests for each model (ERNIE, Qwen, DeepSeek, etc.) | **Service integration testing** |
| `operators/` | Operator unit tests (`test_fused_moe.py`, `test_flash_mask_attn.py`, etc.) | **Required tests for operator development** |
| `layers/` | Network layer tests (attention, moe, quantization) | Network layer testing |
| `model_executor/` | Model execution flow tests | Model execution testing |
| `scheduler/` | Scheduler function tests | Scheduling logic verification |
| `cache_manager/` | Cache management tests | Cache logic verification |
---
## IV. Scripts Directory (scripts/)
**Function**: CI/CD, performance tuning, utility scripts.
| File | Function | Usage Scenario |
|------|----------|----------------|
| `run_unittest.sh` | Unit test runner | Local testing |
| `run_ci_xpu.sh` | XPU CI runner | Kunlun CI |
| `run_ci_hpu.sh` | HPU CI runner | Intel HPU CI |
| `run_ci_dcu.sh` | DCU CI runner | AMD DCU CI |
| `coverage_run.sh` | Code coverage statistics | Code quality |
| `tune_cublaslt_int8_gemm.py` | cuBLASLt INT8 GEMM tuning | Performance tuning |
| `tune_cutlass_fp8_gemm.py` | CUTLASS FP8 GEMM tuning | Performance tuning |
| `offline_w4a8.py` | Offline W4A8 quantization tool | Model quantization |
| `extract_mtp_weight_from_safetensor.py` | MTP weight extraction | Model processing |
---
## V. Other Directories
### docs/ - Documentation
- Usage documentation, API documentation, architecture design documents
### examples/ - Example Code
- Model usage examples, deployment examples
### benchmarks/ - Performance Benchmarks
- Performance test scripts, benchmark data
### tools/ - Development Tools
- `codestyle/` - Code style checking tools
- `dockerfile/` - Docker build tools
### dockerfiles/ - Docker Images
- Dockerfiles for each platform runtime environment
---
## VI. Quick Development Guide
### Adding a New Model
1. Reference `models/model_base.py` to understand model registration mechanism
2. Create new model file under `models/`
3. Add corresponding input processor under `input/`
4. Add tests under `tests/model_executor/`
### Adding a New Operator
1. Implement CUDA operator under `custom_ops/gpu_ops/`
2. Register operator in `cpp_extensions.cc`
3. Add Python wrapper under `model_executor/ops/gpu/`
4. Add tests under `tests/operators/`
### New Hardware Platform Adaptation
1. Reference `platforms/base.py` to create new platform class
2. Create hardware operator directory under `custom_ops/`
3. Create backend implementation under `model_executor/layers/backends/`
4. Create model runner under `worker/`
### Optimizing Inference Performance
1. Attention optimization: `custom_ops/gpu_ops/append_attn/`
2. MoE optimization: `custom_ops/gpu_ops/moe/`
3. Graph optimization: `fastdeploy/model_executor/graph_optimization/`
### PD Disaggregation Deployment
1. Router: `router/router.py` (Python implementation, recommended)
2. High-performance router: `golang_router/` (Go implementation, better PD inter-scheduling performance)
3. Cache transfer: `cache_manager/cache_transfer_manager.py`
---
## VII. Configuration System
```
FDConfig (config.py)
├── ModelConfig # Model configuration
├── CacheConfig # Cache configuration
├── ParallelConfig # Parallel configuration
├── SchedulerConfig # Scheduler configuration
├── LoRAConfig # LoRA configuration
└── ...
Environment Variable Configuration (envs.py)
├── FD_* series environment variables
└── Runtime behavior control
```
---
This document covers the main modules and key files of the FastDeploy codebase. It can be used as a code navigation and development reference. For questions, please refer to detailed documentation of each module or source code comments.