mirror of
https://github.com/PaddlePaddle/FastDeploy.git
synced 2026-04-23 08:21:53 +08:00
fa21fd95c4
* [Docs] Update code overview documentation - Add comprehensive FastDeploy code structure overview - Include detailed module descriptions and development guides - Add quick development guide for common tasks - Update both English and Chinese versions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [Docs] Update code overview documentation format - Convert file path links from [file](path) to `file` inline code format - Add proper spacing for better readability in markdown tables - Maintain consistent formatting across English and Chinese docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
450 lines
19 KiB
Markdown
450 lines
19 KiB
Markdown
[简体中文](../zh/usage/code_overview.md)
|
|
|
|
# FastDeploy Code Structure Overview
|
|
|
|
This document provides a detailed overview of the FastDeploy codebase structure, helping developers quickly understand each module's functionality for development and feature extension.
|
|
|
|
---
|
|
|
|
## Directory Overview
|
|
|
|
```
|
|
FastDeploy/
|
|
├── fastdeploy/ # Core code directory
|
|
├── custom_ops/ # C++/CUDA custom operators
|
|
├── tests/ # Unit tests
|
|
├── scripts/ # Utility scripts
|
|
├── tools/ # Development tools
|
|
├── docs/ # Documentation
|
|
├── examples/ # Example code
|
|
├── benchmarks/ # Performance benchmarks
|
|
├── dockerfiles/ # Docker image build files
|
|
└── setup.py # Python package installation script
|
|
```
|
|
|
|
---
|
|
|
|
## I. Core Code Directory (fastdeploy/)
|
|
|
|
The main entry file `fastdeploy/__init__.py` exports core classes:
|
|
|
|
- `LLM` - Main entry class, offline inference interface
|
|
- `SamplingParams` - Sampling parameter configuration
|
|
- `ModelRegistry` - Model registry
|
|
- `version` - Version information
|
|
|
|
### 1. engine/ - Core Engine Module
|
|
|
|
**Function**: Manages LLM inference lifecycle and coordinates components.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `engine.py` | `LLMEngine` core engine class, manages scheduler, preprocessor, resource manager | Entry point for modifying engine behavior, adding new components |
|
|
| `async_llm.py` | Async LLM interface, `AsyncRequestQueue` request queue management | Async inference, streaming output development |
|
|
| `request.py` | Core request data structures: `Request`, `RequestOutput`, `RequestStatus` | Adding request fields, modifying request processing logic |
|
|
| `sampling_params.py` | `SamplingParams` sampling parameter configuration | Adding new sampling strategy parameters |
|
|
| `args_utils.py` | `EngineArgs` engine argument parsing | Adding new engine configuration parameters |
|
|
| `resource_manager.py` | GPU/CPU resource management | Resource allocation optimization |
|
|
|
|
**Subdirectory**:
|
|
|
|
- `sched/` - Core scheduling implementation, contains `resource_manager_v1.py` (**core scheduling logic**)
|
|
|
|
---
|
|
|
|
### 2. model_executor/ - Model Executor
|
|
|
|
**Function**: Core execution module for model inference, containing model definitions, layers, operators.
|
|
|
|
#### 2.1 models/ - Model Implementations
|
|
|
|
| File/Directory | Function | Development Guide |
|
|
|----------------|----------|-------------------|
|
|
| `model_base.py` | `ModelRegistry` model registration base class | **Must read for adding new models** |
|
|
| `deepseek_v3.py` | DeepSeek V3 model | MoE large model reference |
|
|
| `ernie4_5_moe.py` | ERNIE 4.5 MoE model | Baidu's flagship model |
|
|
| `ernie4_5_mtp.py` | ERNIE 4.5 MTP multi-token prediction | Speculative decoding model |
|
|
| `qwen2.py` | Qwen2 model | General model reference |
|
|
| `qwen3.py` | Qwen3 model | Latest model reference |
|
|
| `ernie4_5_vl/` | ERNIE 4.5 vision-language model | Multimodal model development reference |
|
|
| `qwen2_5_vl/` | Qwen2.5 VL multimodal model | VL model reference |
|
|
| `paddleocr_vl/` | PaddleOCR VL model | OCR multimodal reference |
|
|
|
|
#### 2.2 layers/ - Network Layer Implementations
|
|
|
|
| Subdirectory/File | Function | Development Guide |
|
|
|-------------------|----------|-------------------|
|
|
| `attention/` | Attention mechanism implementations (flash_attn, append_attn, mla_attn) | **First choice for attention performance optimization** |
|
|
| `moe/` | MoE layer implementations (Cutlass, Triton, DeepGEMM backends) | MoE performance optimization |
|
|
| `quantization/` | Quantization layers (FP8, W4A8, WINT2, Weight-only) | Quantization scheme development |
|
|
| `linear.py` | Linear layer implementation | Matrix multiplication optimization |
|
|
| `embeddings.py` | Embedding layer implementation | Word embedding modification |
|
|
| `normalization.py` | Normalization layers (RMSNorm, LayerNorm) | Normalization optimization |
|
|
| `rotary_embedding.py` | Rotary Position Encoding ROPE | Position encoding modification |
|
|
| `sample/` | Sampler implementation | Sampling strategy development |
|
|
| `backends/` | Hardware backend implementations (cuda, xpu, dcu, hpu, metax, gcu, npu) | **Entry point for new hardware adaptation** |
|
|
|
|
#### 2.3 Other Submodules
|
|
|
|
| Directory | Function | Development Guide |
|
|
|-----------|----------|-------------------|
|
|
| `model_loader/` | Model weight loader | New model format support |
|
|
| `guided_decoding/` | Guided decoding (JSON/regex constrained output) | Structured output development |
|
|
| `graph_optimization/` | Graph optimization (CUDA Graph) | Inference performance optimization |
|
|
| `logits_processor/` | Logits processor | Output control logic |
|
|
| `ops/` | Python-callable operators (organized by hardware platform) | Operator call entry point |
|
|
|
|
**Key Files**:
|
|
|
|
- `model_base.py` - Model base class, registry definition
|
|
- `pre_and_post_process.py` - Pre/post processing utilities
|
|
|
|
---
|
|
|
|
### 3. scheduler/ - Scheduler Module
|
|
|
|
**Function**: Request scheduling, supporting single-node, distributed, PD disaggregation scenarios.
|
|
|
|
> **Note**:
|
|
> - Core scheduling logic is mainly implemented in `engine/sched/resource_manager_v1.py`
|
|
> - Schedulers in this directory are being **gradually deprecated**. For PD disaggregation scheduling, use `router/` or `golang_router/`
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `global_scheduler.py` | `GlobalScheduler` distributed scheduler (Redis) | (Being deprecated) |
|
|
| `local_scheduler.py` | `LocalScheduler` local scheduler | (Being deprecated) |
|
|
| `splitwise_scheduler.py` | `SplitwiseScheduler` PD disaggregation scheduling | (Being deprecated, use router) |
|
|
| `dp_scheduler.py` | Data parallel scheduler | (Being deprecated) |
|
|
| `config.py` | `SchedulerConfig` scheduling configuration | Scheduling parameter adjustment |
|
|
| `storage.py` | Storage adapter, wraps Redis connection | Storage layer modification |
|
|
|
|
**Core Scheduling Implementation** (`engine/sched/`):
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `resource_manager_v1.py` | Core scheduling logic, contains `ScheduledDecodeTask`, `ScheduledPreemptTask` task classes | **First choice for scheduling strategy modification** |
|
|
|
|
---
|
|
|
|
### 4. entrypoints/ - API Entry Points
|
|
|
|
**Function**: External service interfaces, including offline inference and online API services.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `llm.py` | `LLM` main entry class, offline inference interface | **Entry point for using FastDeploy** |
|
|
| `engine_client.py` | Engine client | Request forwarding logic modification |
|
|
|
|
#### 4.1 openai/ - OpenAI Compatible API
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `api_server.py` | FastAPI server | **Deployment service entry point** |
|
|
| `protocol.py` | OpenAI protocol definition | API format modification |
|
|
| `serving_chat.py` | Chat Completion API | Chat interface development |
|
|
| `serving_completion.py` | Completion API | Completion interface development |
|
|
| `serving_embedding.py` | Embedding API | Vectorization interface |
|
|
| `tool_parsers/` | Tool call parsers | Function Calling development |
|
|
|
|
---
|
|
|
|
### 5. worker/ - Worker Process Module
|
|
|
|
**Function**: Actual execution process for model inference.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `gpu_model_runner.py` | **GPU model runner** (core inference loop) | **First choice for inference flow modification** |
|
|
| `gpu_worker.py` | GPU Worker process management | Worker lifecycle management |
|
|
| `xpu_model_runner.py` | XPU model runner | Kunlun chip adaptation |
|
|
| `hpu_model_runner.py` | HPU model runner | Intel HPU adaptation |
|
|
| `worker_process.py` | Worker process base class | Process management logic |
|
|
|
|
---
|
|
|
|
### 6. input/ - Input Processing Module
|
|
|
|
**Function**: Input data preprocessing, including tokenization, multimodal input processing.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `text_processor.py` | `BaseDataProcessor` text processor base class | Input processing extension |
|
|
| `ernie4_5_processor.py` | ERNIE 4.5 input processor | Baidu model input processing |
|
|
| `ernie4_5_tokenizer.py` | ERNIE 4.5 tokenizer | Tokenization logic modification |
|
|
| `preprocess.py` | Input preprocessing utilities | Preprocessing flow |
|
|
|
|
**Multimodal Processing Subdirectories**:
|
|
|
|
| Directory | Function |
|
|
|-----------|----------|
|
|
| `ernie4_5_vl_processor/` | ERNIE 4.5 VL image/video processing |
|
|
| `qwen_vl_processor/` | Qwen VL multimodal processing |
|
|
| `paddleocr_vl_processor/` | PaddleOCR VL processing |
|
|
|
|
---
|
|
|
|
### 7. output/ - Output Processing Module
|
|
|
|
**Function**: Inference result post-processing, streaming output management.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `token_processor.py` | `TokenProcessor` token output processing | Streaming output, speculative decoding |
|
|
| `pooler.py` | Pooling output processing | Embedding output |
|
|
| `stream_transfer_data.py` | Streaming transfer data structure | Data transfer format |
|
|
|
|
---
|
|
|
|
### 8. cache_manager/ - Cache Management Module
|
|
|
|
**Function**: KV Cache management, supporting prefix caching, cross-device transfer.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `prefix_cache_manager.py` | `PrefixCacheManager` prefix tree cache | **First choice for KV Cache optimization** |
|
|
| `cache_transfer_manager.py` | KV Cache cross-device transfer | PD disaggregation cache transfer |
|
|
| `cache_data.py` | `BlockNode`, `CacheStatus` data structures | Cache data definition |
|
|
| `multimodal_cache_manager.py` | Multimodal cache management | Multimodal caching |
|
|
|
|
**Subdirectory**:
|
|
|
|
- `transfer_factory/` - Cache transfer factory (IPC, RDMA)
|
|
|
|
---
|
|
|
|
### 9. platforms/ - Hardware Platform Support
|
|
|
|
**Function**: Multi-hardware platform adaptation, defining operators and features for each platform.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `base.py` | `Platform` base class, `_Backend` enum | **Entry point for new hardware adaptation** |
|
|
| `cuda.py` | NVIDIA CUDA platform | GPU optimization |
|
|
| `xpu.py` | Baidu Kunlun XPU platform | Kunlun chip adaptation |
|
|
| `dcu.py` | AMD DCU (ROCm) platform | AMD GPU adaptation |
|
|
| `maca.py` | MetaX GPU (MACA) platform | Biren GPU adaptation |
|
|
| `intel_hpu.py` | Intel HPU platform | Intel Gaudi adaptation |
|
|
| `iluvatar.py` | Iluvatar GPU platform | Iluvatar adaptation |
|
|
|
|
---
|
|
|
|
### 10. metrics/ - Monitoring Metrics Module
|
|
|
|
**Function**: Prometheus metric collection, performance monitoring.
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `metrics.py` | Prometheus metric definition | Adding new monitoring metrics |
|
|
| `stats.py` | ZMQ metric statistics | Distributed monitoring |
|
|
| `trace_util.py` | OpenTelemetry distributed tracing | Link tracing |
|
|
|
|
---
|
|
|
|
### 11. Other Important Modules
|
|
|
|
| Directory | Function | Development Guide |
|
|
|-----------|----------|-------------------|
|
|
| `inter_communicator/` | Inter-process communication (ZMQ) | Engine-Worker communication modification |
|
|
| `spec_decode/` | Speculative decoding (MTP, N-gram) | Speculative decoding strategy development |
|
|
| `distributed/` | Distributed communication (AllReduce) | Distributed inference development |
|
|
| `multimodal/` | Multimodal data processing | Multimodal feature extension |
|
|
| `reasoning/` | Reasoning mode parsing (DeepSeek R1 style) | Chain-of-thought parsing |
|
|
| `router/` | Request router, **recommended for PD disaggregation** | **First choice for PD disaggregation deployment** |
|
|
| `golang_router/` | Go-implemented router, better PD inter-scheduling performance | **High-performance PD disaggregation scenarios** |
|
|
| `eplb/` | Expert Parallel load balancing | MoE load balancing |
|
|
| `rl/` | Reinforcement learning Rollout | RLHF scenarios |
|
|
| `plugins/` | Plugin system | Custom extensions |
|
|
| `logger/` | Logging module | Log format modification |
|
|
| `trace/` | Tracing module | Performance analysis |
|
|
|
|
---
|
|
|
|
### 12. Configuration Files
|
|
|
|
| File | Function | Development Guide |
|
|
|------|----------|-------------------|
|
|
| `config.py` | `FDConfig` main configuration class | **Entry point for configuration parameter modification** |
|
|
| `envs.py` | Environment variable configuration | Adding new environment variables |
|
|
| `utils.py` | General utility functions | Utility function reuse |
|
|
|
|
---
|
|
|
|
## II. Custom Operators Directory (custom_ops/)
|
|
|
|
**Function**: C++/CUDA high-performance operator implementations, organized by hardware platform.
|
|
|
|
```
|
|
custom_ops/
|
|
├── gpu_ops/ # NVIDIA GPU operators (main)
|
|
├── cpu_ops/ # CPU operators
|
|
├── xpu_ops/ # Baidu Kunlun XPU operators
|
|
├── iluvatar_ops/ # Iluvatar GPU operators
|
|
├── metax_ops/ # MetaX GPU operators
|
|
├── utils/ # Common utilities
|
|
└── third_party/ # Third-party libraries (cutlass, DeepGEMM)
|
|
```
|
|
|
|
### gpu_ops/ - GPU Operator Details
|
|
|
|
| Directory/File | Function | Development Guide |
|
|
|----------------|----------|-------------------|
|
|
| `append_attn/` | Append Attention implementation | **First choice for attention optimization** |
|
|
| `moe/` | MoE operators (fused_moe, expert_dispatch) | MoE performance optimization |
|
|
| `flash_mask_attn/` | Flash Mask Attention | Attention mask optimization |
|
|
| `mla_attn/` | Multi-Head Latent Attention | MLA model support |
|
|
| `machete/` | Machete GEMM | Matrix multiplication optimization |
|
|
| `quantization/` | Quantization operators | Quantization performance optimization |
|
|
| `sample_kernels/` | Sampling operators | Sampling performance optimization |
|
|
| `speculate_decoding/` | Speculative decoding operators | Speculative decoding optimization |
|
|
| `cutlass_kernels/` | CUTLASS kernels | High-performance GEMM |
|
|
| `cpp_extensions.cc` | C++ extension entry | **Entry point for new operator registration** |
|
|
| `append_attention.cu` | Append Attention core | Attention core implementation |
|
|
|
|
**Key Operator Files**:
|
|
|
|
- `fused_rotary_position_encoding.cu` - Fused rotary position encoding
|
|
- `multi_head_latent_attention.cu` - MLA attention
|
|
- `per_token_quant_fp8.cu` - FP8 quantization
|
|
|
|
---
|
|
|
|
## III. Test Directory (tests/)
|
|
|
|
**Function**: Unit tests and end-to-end tests, organized by module.
|
|
|
|
```
|
|
tests/
|
|
├── e2e/ # End-to-end service tests
|
|
├── operators/ # Operator unit tests
|
|
├── model_executor/ # Model executor tests
|
|
├── model_loader/ # Model loading tests
|
|
├── layers/ # Network layer tests
|
|
├── scheduler/ # Scheduler tests
|
|
├── cache_manager/ # Cache management tests
|
|
├── entrypoints/ # API entry tests
|
|
├── input/ # Input processing tests
|
|
├── output/ # Output processing tests
|
|
├── metrics/ # Metric tests
|
|
├── distributed/ # Distributed tests
|
|
├── graph_optimization/# Graph optimization tests
|
|
├── quantization/ # Quantization tests
|
|
├── multimodal/ # Multimodal tests
|
|
├── xpu_ci/ # XPU CI tests
|
|
├── ce/ # CE environment tests
|
|
├── ci_use/ # CI utility tests
|
|
└── conftest.py # pytest configuration
|
|
```
|
|
|
|
### Test Directory Details
|
|
|
|
| Directory | Content | Development Guide |
|
|
|-----------|---------|-------------------|
|
|
| `e2e/` | Complete service tests for each model (ERNIE, Qwen, DeepSeek, etc.) | **Service integration testing** |
|
|
| `operators/` | Operator unit tests (`test_fused_moe.py`, `test_flash_mask_attn.py`, etc.) | **Required tests for operator development** |
|
|
| `layers/` | Network layer tests (attention, moe, quantization) | Network layer testing |
|
|
| `model_executor/` | Model execution flow tests | Model execution testing |
|
|
| `scheduler/` | Scheduler function tests | Scheduling logic verification |
|
|
| `cache_manager/` | Cache management tests | Cache logic verification |
|
|
|
|
---
|
|
|
|
## IV. Scripts Directory (scripts/)
|
|
|
|
**Function**: CI/CD, performance tuning, utility scripts.
|
|
|
|
| File | Function | Usage Scenario |
|
|
|------|----------|----------------|
|
|
| `run_unittest.sh` | Unit test runner | Local testing |
|
|
| `run_ci_xpu.sh` | XPU CI runner | Kunlun CI |
|
|
| `run_ci_hpu.sh` | HPU CI runner | Intel HPU CI |
|
|
| `run_ci_dcu.sh` | DCU CI runner | AMD DCU CI |
|
|
| `coverage_run.sh` | Code coverage statistics | Code quality |
|
|
| `tune_cublaslt_int8_gemm.py` | cuBLASLt INT8 GEMM tuning | Performance tuning |
|
|
| `tune_cutlass_fp8_gemm.py` | CUTLASS FP8 GEMM tuning | Performance tuning |
|
|
| `offline_w4a8.py` | Offline W4A8 quantization tool | Model quantization |
|
|
| `extract_mtp_weight_from_safetensor.py` | MTP weight extraction | Model processing |
|
|
|
|
---
|
|
|
|
## V. Other Directories
|
|
|
|
### docs/ - Documentation
|
|
|
|
- Usage documentation, API documentation, architecture design documents
|
|
|
|
### examples/ - Example Code
|
|
|
|
- Model usage examples, deployment examples
|
|
|
|
### benchmarks/ - Performance Benchmarks
|
|
|
|
- Performance test scripts, benchmark data
|
|
|
|
### tools/ - Development Tools
|
|
|
|
- `codestyle/` - Code style checking tools
|
|
- `dockerfile/` - Docker build tools
|
|
|
|
### dockerfiles/ - Docker Images
|
|
|
|
- Dockerfiles for each platform runtime environment
|
|
|
|
---
|
|
|
|
## VI. Quick Development Guide
|
|
|
|
### Adding a New Model
|
|
|
|
1. Reference `models/model_base.py` to understand model registration mechanism
|
|
2. Create new model file under `models/`
|
|
3. Add corresponding input processor under `input/`
|
|
4. Add tests under `tests/model_executor/`
|
|
|
|
### Adding a New Operator
|
|
|
|
1. Implement CUDA operator under `custom_ops/gpu_ops/`
|
|
2. Register operator in `cpp_extensions.cc`
|
|
3. Add Python wrapper under `model_executor/ops/gpu/`
|
|
4. Add tests under `tests/operators/`
|
|
|
|
### New Hardware Platform Adaptation
|
|
|
|
1. Reference `platforms/base.py` to create new platform class
|
|
2. Create hardware operator directory under `custom_ops/`
|
|
3. Create backend implementation under `model_executor/layers/backends/`
|
|
4. Create model runner under `worker/`
|
|
|
|
### Optimizing Inference Performance
|
|
|
|
1. Attention optimization: `custom_ops/gpu_ops/append_attn/`
|
|
2. MoE optimization: `custom_ops/gpu_ops/moe/`
|
|
3. Graph optimization: `fastdeploy/model_executor/graph_optimization/`
|
|
|
|
### PD Disaggregation Deployment
|
|
|
|
1. Router: `router/router.py` (Python implementation, recommended)
|
|
2. High-performance router: `golang_router/` (Go implementation, better PD inter-scheduling performance)
|
|
3. Cache transfer: `cache_manager/cache_transfer_manager.py`
|
|
|
|
---
|
|
|
|
## VII. Configuration System
|
|
|
|
```
|
|
FDConfig (config.py)
|
|
├── ModelConfig # Model configuration
|
|
├── CacheConfig # Cache configuration
|
|
├── ParallelConfig # Parallel configuration
|
|
├── SchedulerConfig # Scheduler configuration
|
|
├── LoRAConfig # LoRA configuration
|
|
└── ...
|
|
|
|
Environment Variable Configuration (envs.py)
|
|
├── FD_* series environment variables
|
|
└── Runtime behavior control
|
|
```
|
|
|
|
---
|
|
|
|
This document covers the main modules and key files of the FastDeploy codebase. It can be used as a code navigation and development reference. For questions, please refer to detailed documentation of each module or source code comments.
|