FastDeploy/docs/usage/code_overview.md

[简体中文](../zh/usage/code_overview.md)

# FastDeploy Code Structure Overview

This document provides a detailed overview of the FastDeploy codebase structure, helping developers quickly understand each module's functionality for development and feature extension.

---

## Directory Overview

```
FastDeploy/
├── fastdeploy/          # Core code directory
├── custom_ops/          # C++/CUDA custom operators
├── tests/               # Unit tests
├── scripts/             # Utility scripts
├── tools/               # Development tools
├── docs/                # Documentation
├── examples/            # Example code
├── benchmarks/          # Performance benchmarks
├── dockerfiles/         # Docker image build files
└── setup.py             # Python package installation script
```

---

## I. Core Code Directory (fastdeploy/)

The main entry file `fastdeploy/__init__.py` exports core classes:

- `LLM` - Main entry class, offline inference interface
- `SamplingParams` - Sampling parameter configuration
- `ModelRegistry` - Model registry
- `version` - Version information

### 1. engine/ - Core Engine Module

**Function**: Manages LLM inference lifecycle and coordinates components.

| File | Function | Development Guide |
|------|----------|-------------------|
| `engine.py` | `LLMEngine` core engine class, manages scheduler, preprocessor, resource manager | Entry point for modifying engine behavior, adding new components |
| `async_llm.py` | Async LLM interface, `AsyncRequestQueue` request queue management | Async inference, streaming output development |
| `request.py` | Core request data structures: `Request`, `RequestOutput`, `RequestStatus` | Adding request fields, modifying request processing logic |
| `sampling_params.py` | `SamplingParams` sampling parameter configuration | Adding new sampling strategy parameters |
| `args_utils.py` | `EngineArgs` engine argument parsing | Adding new engine configuration parameters |
| `resource_manager.py` | GPU/CPU resource management | Resource allocation optimization |

**Subdirectory**:

- `sched/` - Core scheduling implementation, contains `resource_manager_v1.py` (**core scheduling logic**)

---

### 2. model_executor/ - Model Executor

**Function**: Core execution module for model inference, containing model definitions, layers, operators.

#### 2.1 models/ - Model Implementations

| File/Directory | Function | Development Guide |
|----------------|----------|-------------------|
| `model_base.py` | `ModelRegistry` model registration base class | **Must read for adding new models** |
| `deepseek_v3.py` | DeepSeek V3 model | MoE large model reference |
| `ernie4_5_moe.py` | ERNIE 4.5 MoE model | Baidu's flagship model |
| `ernie4_5_mtp.py` | ERNIE 4.5 MTP multi-token prediction | Speculative decoding model |
| `qwen2.py` | Qwen2 model | General model reference |
| `qwen3.py` | Qwen3 model | Latest model reference |
| `ernie4_5_vl/` | ERNIE 4.5 vision-language model | Multimodal model development reference |
| `qwen2_5_vl/` | Qwen2.5 VL multimodal model | VL model reference |
| `paddleocr_vl/` | PaddleOCR VL model | OCR multimodal reference |

#### 2.2 layers/ - Network Layer Implementations

| Subdirectory/File | Function | Development Guide |
|-------------------|----------|-------------------|
| `attention/` | Attention mechanism implementations (flash_attn, append_attn, mla_attn) | **First choice for attention performance optimization** |
| `moe/` | MoE layer implementations (Cutlass, Triton, DeepGEMM backends) | MoE performance optimization |
| `quantization/` | Quantization layers (FP8, W4A8, WINT2, Weight-only) | Quantization scheme development |
| `linear.py` | Linear layer implementation | Matrix multiplication optimization |
| `embeddings.py` | Embedding layer implementation | Word embedding modification |
| `normalization.py` | Normalization layers (RMSNorm, LayerNorm) | Normalization optimization |
| `rotary_embedding.py` | Rotary Position Encoding ROPE | Position encoding modification |
| `sample/` | Sampler implementation | Sampling strategy development |
| `backends/` | Hardware backend implementations (cuda, xpu, dcu, hpu, metax, gcu, npu) | **Entry point for new hardware adaptation** |

#### 2.3 Other Submodules

| Directory | Function | Development Guide |
|-----------|----------|-------------------|
| `model_loader/` | Model weight loader | New model format support |
| `guided_decoding/` | Guided decoding (JSON/regex constrained output) | Structured output development |
| `graph_optimization/` | Graph optimization (CUDA Graph) | Inference performance optimization |
| `logits_processor/` | Logits processor | Output control logic |
| `ops/` | Python-callable operators (organized by hardware platform) | Operator call entry point |

**Key Files**:

- `model_base.py` - Model base class, registry definition
- `pre_and_post_process.py` - Pre/post processing utilities

---

### 3. scheduler/ - Scheduler Module

**Function**: Request scheduling, supporting single-node, distributed, PD disaggregation scenarios.

> **Note**:
> - Core scheduling logic is mainly implemented in `engine/sched/resource_manager_v1.py`
> - Schedulers in this directory are being **gradually deprecated**. For PD disaggregation scheduling, use `router/` or `golang_router/`

| File | Function | Development Guide |
|------|----------|-------------------|
| `global_scheduler.py` | `GlobalScheduler` distributed scheduler (Redis) | (Being deprecated) |
| `local_scheduler.py` | `LocalScheduler` local scheduler | (Being deprecated) |
| `splitwise_scheduler.py` | `SplitwiseScheduler` PD disaggregation scheduling | (Being deprecated, use router) |
| `dp_scheduler.py` | Data parallel scheduler | (Being deprecated) |
| `config.py` | `SchedulerConfig` scheduling configuration | Scheduling parameter adjustment |
| `storage.py` | Storage adapter, wraps Redis connection | Storage layer modification |

**Core Scheduling Implementation** (`engine/sched/`):

| File | Function | Development Guide |
|------|----------|-------------------|
| `resource_manager_v1.py` | Core scheduling logic, contains `ScheduledDecodeTask`, `ScheduledPreemptTask` task classes | **First choice for scheduling strategy modification** |

---

### 4. entrypoints/ - API Entry Points

**Function**: External service interfaces, including offline inference and online API services.

| File | Function | Development Guide |
|------|----------|-------------------|
| `llm.py` | `LLM` main entry class, offline inference interface | **Entry point for using FastDeploy** |
| `engine_client.py` | Engine client | Request forwarding logic modification |

#### 4.1 openai/ - OpenAI Compatible API

| File | Function | Development Guide |
|------|----------|-------------------|
| `api_server.py` | FastAPI server | **Deployment service entry point** |
| `protocol.py` | OpenAI protocol definition | API format modification |
| `serving_chat.py` | Chat Completion API | Chat interface development |
| `serving_completion.py` | Completion API | Completion interface development |
| `serving_embedding.py` | Embedding API | Vectorization interface |
| `tool_parsers/` | Tool call parsers | Function Calling development |

---

### 5. worker/ - Worker Process Module

**Function**: Actual execution process for model inference.

| File | Function | Development Guide |
|------|----------|-------------------|
| `gpu_model_runner.py` | **GPU model runner** (core inference loop) | **First choice for inference flow modification** |
| `gpu_worker.py` | GPU Worker process management | Worker lifecycle management |
| `xpu_model_runner.py` | XPU model runner | Kunlun chip adaptation |
| `hpu_model_runner.py` | HPU model runner | Intel HPU adaptation |
| `worker_process.py` | Worker process base class | Process management logic |

---

### 6. input/ - Input Processing Module

**Function**: Input data preprocessing, including tokenization, multimodal input processing.

| File | Function | Development Guide |
|------|----------|-------------------|
| `text_processor.py` | `BaseDataProcessor` text processor base class | Input processing extension |
| `ernie4_5_processor.py` | ERNIE 4.5 input processor | Baidu model input processing |
| `ernie4_5_tokenizer.py` | ERNIE 4.5 tokenizer | Tokenization logic modification |
| `preprocess.py` | Input preprocessing utilities | Preprocessing flow |

**Multimodal Processing Subdirectories**:

| Directory | Function |
|-----------|----------|
| `ernie4_5_vl_processor/` | ERNIE 4.5 VL image/video processing |
| `qwen_vl_processor/` | Qwen VL multimodal processing |
| `paddleocr_vl_processor/` | PaddleOCR VL processing |

---

### 7. output/ - Output Processing Module

**Function**: Inference result post-processing, streaming output management.

| File | Function | Development Guide |
|------|----------|-------------------|
| `token_processor.py` | `TokenProcessor` token output processing | Streaming output, speculative decoding |
| `pooler.py` | Pooling output processing | Embedding output |
| `stream_transfer_data.py` | Streaming transfer data structure | Data transfer format |

---

### 8. cache_manager/ - Cache Management Module

**Function**: KV Cache management, supporting prefix caching, cross-device transfer.

| File | Function | Development Guide |
|------|----------|-------------------|
| `prefix_cache_manager.py` | `PrefixCacheManager` prefix tree cache | **First choice for KV Cache optimization** |
| `cache_transfer_manager.py` | KV Cache cross-device transfer | PD disaggregation cache transfer |
| `cache_data.py` | `BlockNode`, `CacheStatus` data structures | Cache data definition |
| `multimodal_cache_manager.py` | Multimodal cache management | Multimodal caching |

**Subdirectory**:

- `transfer_factory/` - Cache transfer factory (IPC, RDMA)

---

### 9. platforms/ - Hardware Platform Support

**Function**: Multi-hardware platform adaptation, defining operators and features for each platform.

| File | Function | Development Guide |
|------|----------|-------------------|
| `base.py` | `Platform` base class, `_Backend` enum | **Entry point for new hardware adaptation** |
| `cuda.py` | NVIDIA CUDA platform | GPU optimization |
| `xpu.py` | Baidu Kunlun XPU platform | Kunlun chip adaptation |
| `dcu.py` | AMD DCU (ROCm) platform | AMD GPU adaptation |
| `maca.py` | MetaX GPU (MACA) platform | Biren GPU adaptation |
| `intel_hpu.py` | Intel HPU platform | Intel Gaudi adaptation |
| `iluvatar.py` | Iluvatar GPU platform | Iluvatar adaptation |

---

### 10. metrics/ - Monitoring Metrics Module

**Function**: Prometheus metric collection, performance monitoring.

| File | Function | Development Guide |
|------|----------|-------------------|
| `metrics.py` | Prometheus metric definition | Adding new monitoring metrics |
| `stats.py` | ZMQ metric statistics | Distributed monitoring |
| `trace_util.py` | OpenTelemetry distributed tracing | Link tracing |

---

### 11. Other Important Modules

| Directory | Function | Development Guide |
|-----------|----------|-------------------|
| `inter_communicator/` | Inter-process communication (ZMQ) | Engine-Worker communication modification |
| `spec_decode/` | Speculative decoding (MTP, N-gram) | Speculative decoding strategy development |
| `distributed/` | Distributed communication (AllReduce) | Distributed inference development |
| `multimodal/` | Multimodal data processing | Multimodal feature extension |
| `reasoning/` | Reasoning mode parsing (DeepSeek R1 style) | Chain-of-thought parsing |
| `router/` | Request router, **recommended for PD disaggregation** | **First choice for PD disaggregation deployment** |
| `golang_router/` | Go-implemented router, better PD inter-scheduling performance | **High-performance PD disaggregation scenarios** |
| `eplb/` | Expert Parallel load balancing | MoE load balancing |
| `rl/` | Reinforcement learning Rollout | RLHF scenarios |
| `plugins/` | Plugin system | Custom extensions |
| `logger/` | Logging module | Log format modification |
| `trace/` | Tracing module | Performance analysis |

---

### 12. Configuration Files

| File | Function | Development Guide |
|------|----------|-------------------|
| `config.py` | `FDConfig` main configuration class | **Entry point for configuration parameter modification** |
| `envs.py` | Environment variable configuration | Adding new environment variables |
| `utils.py` | General utility functions | Utility function reuse |

---

## II. Custom Operators Directory (custom_ops/)

**Function**: C++/CUDA high-performance operator implementations, organized by hardware platform.

```
custom_ops/
├── gpu_ops/           # NVIDIA GPU operators (main)
├── cpu_ops/           # CPU operators
├── xpu_ops/           # Baidu Kunlun XPU operators
├── iluvatar_ops/      # Iluvatar GPU operators
├── metax_ops/         # MetaX GPU operators
├── utils/             # Common utilities
└── third_party/       # Third-party libraries (cutlass, DeepGEMM)
```

### gpu_ops/ - GPU Operator Details

| Directory/File | Function | Development Guide |
|----------------|----------|-------------------|
| `append_attn/` | Append Attention implementation | **First choice for attention optimization** |
| `moe/` | MoE operators (fused_moe, expert_dispatch) | MoE performance optimization |
| `flash_mask_attn/` | Flash Mask Attention | Attention mask optimization |
| `mla_attn/` | Multi-Head Latent Attention | MLA model support |
| `machete/` | Machete GEMM | Matrix multiplication optimization |
| `quantization/` | Quantization operators | Quantization performance optimization |
| `sample_kernels/` | Sampling operators | Sampling performance optimization |
| `speculate_decoding/` | Speculative decoding operators | Speculative decoding optimization |
| `cutlass_kernels/` | CUTLASS kernels | High-performance GEMM |
| `cpp_extensions.cc` | C++ extension entry | **Entry point for new operator registration** |
| `append_attention.cu` | Append Attention core | Attention core implementation |

**Key Operator Files**:

- `fused_rotary_position_encoding.cu` - Fused rotary position encoding
- `multi_head_latent_attention.cu` - MLA attention
- `per_token_quant_fp8.cu` - FP8 quantization

---

## III. Test Directory (tests/)

**Function**: Unit tests and end-to-end tests, organized by module.

```
tests/
├── e2e/               # End-to-end service tests
├── operators/         # Operator unit tests
├── model_executor/    # Model executor tests
├── model_loader/      # Model loading tests
├── layers/            # Network layer tests
├── scheduler/         # Scheduler tests
├── cache_manager/     # Cache management tests
├── entrypoints/       # API entry tests
├── input/             # Input processing tests
├── output/            # Output processing tests
├── metrics/           # Metric tests
├── distributed/       # Distributed tests
├── graph_optimization/# Graph optimization tests
├── quantization/      # Quantization tests
├── multimodal/        # Multimodal tests
├── xpu_ci/            # XPU CI tests
├── ce/                # CE environment tests
├── ci_use/            # CI utility tests
└── conftest.py        # pytest configuration
```

### Test Directory Details

| Directory | Content | Development Guide |
|-----------|---------|-------------------|
| `e2e/` | Complete service tests for each model (ERNIE, Qwen, DeepSeek, etc.) | **Service integration testing** |
| `operators/` | Operator unit tests (`test_fused_moe.py`, `test_flash_mask_attn.py`, etc.) | **Required tests for operator development** |
| `layers/` | Network layer tests (attention, moe, quantization) | Network layer testing |
| `model_executor/` | Model execution flow tests | Model execution testing |
| `scheduler/` | Scheduler function tests | Scheduling logic verification |
| `cache_manager/` | Cache management tests | Cache logic verification |

---

## IV. Scripts Directory (scripts/)

**Function**: CI/CD, performance tuning, utility scripts.

| File | Function | Usage Scenario |
|------|----------|----------------|
| `run_unittest.sh` | Unit test runner | Local testing |
| `run_ci_xpu.sh` | XPU CI runner | Kunlun CI |
| `run_ci_hpu.sh` | HPU CI runner | Intel HPU CI |
| `run_ci_dcu.sh` | DCU CI runner | AMD DCU CI |
| `coverage_run.sh` | Code coverage statistics | Code quality |
| `tune_cublaslt_int8_gemm.py` | cuBLASLt INT8 GEMM tuning | Performance tuning |
| `tune_cutlass_fp8_gemm.py` | CUTLASS FP8 GEMM tuning | Performance tuning |
| `offline_w4a8.py` | Offline W4A8 quantization tool | Model quantization |
| `extract_mtp_weight_from_safetensor.py` | MTP weight extraction | Model processing |

---

## V. Other Directories

### docs/ - Documentation

- Usage documentation, API documentation, architecture design documents

### examples/ - Example Code

- Model usage examples, deployment examples

### benchmarks/ - Performance Benchmarks

- Performance test scripts, benchmark data

### tools/ - Development Tools

- `codestyle/` - Code style checking tools
- `dockerfile/` - Docker build tools

### dockerfiles/ - Docker Images

- Dockerfiles for each platform runtime environment

---

## VI. Quick Development Guide

### Adding a New Model

1. Reference `models/model_base.py` to understand model registration mechanism
2. Create new model file under `models/`
3. Add corresponding input processor under `input/`
4. Add tests under `tests/model_executor/`

### Adding a New Operator

1. Implement CUDA operator under `custom_ops/gpu_ops/`
2. Register operator in `cpp_extensions.cc`
3. Add Python wrapper under `model_executor/ops/gpu/`
4. Add tests under `tests/operators/`

### New Hardware Platform Adaptation

1. Reference `platforms/base.py` to create new platform class
2. Create hardware operator directory under `custom_ops/`
3. Create backend implementation under `model_executor/layers/backends/`
4. Create model runner under `worker/`

### Optimizing Inference Performance

1. Attention optimization: `custom_ops/gpu_ops/append_attn/`
2. MoE optimization: `custom_ops/gpu_ops/moe/`
3. Graph optimization: `fastdeploy/model_executor/graph_optimization/`

### PD Disaggregation Deployment

1. Router: `router/router.py` (Python implementation, recommended)
2. High-performance router: `golang_router/` (Go implementation, better PD inter-scheduling performance)
3. Cache transfer: `cache_manager/cache_transfer_manager.py`

---

## VII. Configuration System

```
FDConfig (config.py)
├── ModelConfig      # Model configuration
├── CacheConfig      # Cache configuration
├── ParallelConfig   # Parallel configuration
├── SchedulerConfig  # Scheduler configuration
├── LoRAConfig       # LoRA configuration
└── ...

Environment Variable Configuration (envs.py)
├── FD_* series environment variables
└── Runtime behavior control
```

---

This document covers the main modules and key files of the FastDeploy codebase. It can be used as a code navigation and development reference. For questions, please refer to detailed documentation of each module or source code comments.