[简体中文](../zh/usage/code_overview.md) # FastDeploy Code Structure Overview This document provides a detailed overview of the FastDeploy codebase structure, helping developers quickly understand each module's functionality for development and feature extension. --- ## Directory Overview ``` FastDeploy/ ├── fastdeploy/ # Core code directory ├── custom_ops/ # C++/CUDA custom operators ├── tests/ # Unit tests ├── scripts/ # Utility scripts ├── tools/ # Development tools ├── docs/ # Documentation ├── examples/ # Example code ├── benchmarks/ # Performance benchmarks ├── dockerfiles/ # Docker image build files └── setup.py # Python package installation script ``` --- ## I. Core Code Directory (fastdeploy/) The main entry file `fastdeploy/__init__.py` exports core classes: - `LLM` - Main entry class, offline inference interface - `SamplingParams` - Sampling parameter configuration - `ModelRegistry` - Model registry - `version` - Version information ### 1. engine/ - Core Engine Module **Function**: Manages LLM inference lifecycle and coordinates components. | File | Function | Development Guide | |------|----------|-------------------| | `engine.py` | `LLMEngine` core engine class, manages scheduler, preprocessor, resource manager | Entry point for modifying engine behavior, adding new components | | `async_llm.py` | Async LLM interface, `AsyncRequestQueue` request queue management | Async inference, streaming output development | | `request.py` | Core request data structures: `Request`, `RequestOutput`, `RequestStatus` | Adding request fields, modifying request processing logic | | `sampling_params.py` | `SamplingParams` sampling parameter configuration | Adding new sampling strategy parameters | | `args_utils.py` | `EngineArgs` engine argument parsing | Adding new engine configuration parameters | | `resource_manager.py` | GPU/CPU resource management | Resource allocation optimization | **Subdirectory**: - `sched/` - Core scheduling implementation, contains `resource_manager_v1.py` (**core scheduling logic**) --- ### 2. model_executor/ - Model Executor **Function**: Core execution module for model inference, containing model definitions, layers, operators. #### 2.1 models/ - Model Implementations | File/Directory | Function | Development Guide | |----------------|----------|-------------------| | `model_base.py` | `ModelRegistry` model registration base class | **Must read for adding new models** | | `deepseek_v3.py` | DeepSeek V3 model | MoE large model reference | | `ernie4_5_moe.py` | ERNIE 4.5 MoE model | Baidu's flagship model | | `ernie4_5_mtp.py` | ERNIE 4.5 MTP multi-token prediction | Speculative decoding model | | `qwen2.py` | Qwen2 model | General model reference | | `qwen3.py` | Qwen3 model | Latest model reference | | `ernie4_5_vl/` | ERNIE 4.5 vision-language model | Multimodal model development reference | | `qwen2_5_vl/` | Qwen2.5 VL multimodal model | VL model reference | | `paddleocr_vl/` | PaddleOCR VL model | OCR multimodal reference | #### 2.2 layers/ - Network Layer Implementations | Subdirectory/File | Function | Development Guide | |-------------------|----------|-------------------| | `attention/` | Attention mechanism implementations (flash_attn, append_attn, mla_attn) | **First choice for attention performance optimization** | | `moe/` | MoE layer implementations (Cutlass, Triton, DeepGEMM backends) | MoE performance optimization | | `quantization/` | Quantization layers (FP8, W4A8, WINT2, Weight-only) | Quantization scheme development | | `linear.py` | Linear layer implementation | Matrix multiplication optimization | | `embeddings.py` | Embedding layer implementation | Word embedding modification | | `normalization.py` | Normalization layers (RMSNorm, LayerNorm) | Normalization optimization | | `rotary_embedding.py` | Rotary Position Encoding ROPE | Position encoding modification | | `sample/` | Sampler implementation | Sampling strategy development | | `backends/` | Hardware backend implementations (cuda, xpu, dcu, hpu, metax, gcu, npu) | **Entry point for new hardware adaptation** | #### 2.3 Other Submodules | Directory | Function | Development Guide | |-----------|----------|-------------------| | `model_loader/` | Model weight loader | New model format support | | `guided_decoding/` | Guided decoding (JSON/regex constrained output) | Structured output development | | `graph_optimization/` | Graph optimization (CUDA Graph) | Inference performance optimization | | `logits_processor/` | Logits processor | Output control logic | | `ops/` | Python-callable operators (organized by hardware platform) | Operator call entry point | **Key Files**: - `model_base.py` - Model base class, registry definition - `pre_and_post_process.py` - Pre/post processing utilities --- ### 3. scheduler/ - Scheduler Module **Function**: Request scheduling, supporting single-node, distributed, PD disaggregation scenarios. > **Note**: > - Core scheduling logic is mainly implemented in `engine/sched/resource_manager_v1.py` > - Schedulers in this directory are being **gradually deprecated**. For PD disaggregation scheduling, use `router/` or `golang_router/` | File | Function | Development Guide | |------|----------|-------------------| | `global_scheduler.py` | `GlobalScheduler` distributed scheduler (Redis) | (Being deprecated) | | `local_scheduler.py` | `LocalScheduler` local scheduler | (Being deprecated) | | `splitwise_scheduler.py` | `SplitwiseScheduler` PD disaggregation scheduling | (Being deprecated, use router) | | `dp_scheduler.py` | Data parallel scheduler | (Being deprecated) | | `config.py` | `SchedulerConfig` scheduling configuration | Scheduling parameter adjustment | | `storage.py` | Storage adapter, wraps Redis connection | Storage layer modification | **Core Scheduling Implementation** (`engine/sched/`): | File | Function | Development Guide | |------|----------|-------------------| | `resource_manager_v1.py` | Core scheduling logic, contains `ScheduledDecodeTask`, `ScheduledPreemptTask` task classes | **First choice for scheduling strategy modification** | --- ### 4. entrypoints/ - API Entry Points **Function**: External service interfaces, including offline inference and online API services. | File | Function | Development Guide | |------|----------|-------------------| | `llm.py` | `LLM` main entry class, offline inference interface | **Entry point for using FastDeploy** | | `engine_client.py` | Engine client | Request forwarding logic modification | #### 4.1 openai/ - OpenAI Compatible API | File | Function | Development Guide | |------|----------|-------------------| | `api_server.py` | FastAPI server | **Deployment service entry point** | | `protocol.py` | OpenAI protocol definition | API format modification | | `serving_chat.py` | Chat Completion API | Chat interface development | | `serving_completion.py` | Completion API | Completion interface development | | `serving_embedding.py` | Embedding API | Vectorization interface | | `tool_parsers/` | Tool call parsers | Function Calling development | --- ### 5. worker/ - Worker Process Module **Function**: Actual execution process for model inference. | File | Function | Development Guide | |------|----------|-------------------| | `gpu_model_runner.py` | **GPU model runner** (core inference loop) | **First choice for inference flow modification** | | `gpu_worker.py` | GPU Worker process management | Worker lifecycle management | | `xpu_model_runner.py` | XPU model runner | Kunlun chip adaptation | | `hpu_model_runner.py` | HPU model runner | Intel HPU adaptation | | `worker_process.py` | Worker process base class | Process management logic | --- ### 6. input/ - Input Processing Module **Function**: Input data preprocessing, including tokenization, multimodal input processing. | File | Function | Development Guide | |------|----------|-------------------| | `text_processor.py` | `BaseDataProcessor` text processor base class | Input processing extension | | `ernie4_5_processor.py` | ERNIE 4.5 input processor | Baidu model input processing | | `ernie4_5_tokenizer.py` | ERNIE 4.5 tokenizer | Tokenization logic modification | | `preprocess.py` | Input preprocessing utilities | Preprocessing flow | **Multimodal Processing Subdirectories**: | Directory | Function | |-----------|----------| | `ernie4_5_vl_processor/` | ERNIE 4.5 VL image/video processing | | `qwen_vl_processor/` | Qwen VL multimodal processing | | `paddleocr_vl_processor/` | PaddleOCR VL processing | --- ### 7. output/ - Output Processing Module **Function**: Inference result post-processing, streaming output management. | File | Function | Development Guide | |------|----------|-------------------| | `token_processor.py` | `TokenProcessor` token output processing | Streaming output, speculative decoding | | `pooler.py` | Pooling output processing | Embedding output | | `stream_transfer_data.py` | Streaming transfer data structure | Data transfer format | --- ### 8. cache_manager/ - Cache Management Module **Function**: KV Cache management, supporting prefix caching, cross-device transfer. | File | Function | Development Guide | |------|----------|-------------------| | `prefix_cache_manager.py` | `PrefixCacheManager` prefix tree cache | **First choice for KV Cache optimization** | | `cache_transfer_manager.py` | KV Cache cross-device transfer | PD disaggregation cache transfer | | `cache_data.py` | `BlockNode`, `CacheStatus` data structures | Cache data definition | | `multimodal_cache_manager.py` | Multimodal cache management | Multimodal caching | **Subdirectory**: - `transfer_factory/` - Cache transfer factory (IPC, RDMA) --- ### 9. platforms/ - Hardware Platform Support **Function**: Multi-hardware platform adaptation, defining operators and features for each platform. | File | Function | Development Guide | |------|----------|-------------------| | `base.py` | `Platform` base class, `_Backend` enum | **Entry point for new hardware adaptation** | | `cuda.py` | NVIDIA CUDA platform | GPU optimization | | `xpu.py` | Baidu Kunlun XPU platform | Kunlun chip adaptation | | `dcu.py` | AMD DCU (ROCm) platform | AMD GPU adaptation | | `maca.py` | MetaX GPU (MACA) platform | Biren GPU adaptation | | `intel_hpu.py` | Intel HPU platform | Intel Gaudi adaptation | | `iluvatar.py` | Iluvatar GPU platform | Iluvatar adaptation | --- ### 10. metrics/ - Monitoring Metrics Module **Function**: Prometheus metric collection, performance monitoring. | File | Function | Development Guide | |------|----------|-------------------| | `metrics.py` | Prometheus metric definition | Adding new monitoring metrics | | `stats.py` | ZMQ metric statistics | Distributed monitoring | | `trace_util.py` | OpenTelemetry distributed tracing | Link tracing | --- ### 11. Other Important Modules | Directory | Function | Development Guide | |-----------|----------|-------------------| | `inter_communicator/` | Inter-process communication (ZMQ) | Engine-Worker communication modification | | `spec_decode/` | Speculative decoding (MTP, N-gram) | Speculative decoding strategy development | | `distributed/` | Distributed communication (AllReduce) | Distributed inference development | | `multimodal/` | Multimodal data processing | Multimodal feature extension | | `reasoning/` | Reasoning mode parsing (DeepSeek R1 style) | Chain-of-thought parsing | | `router/` | Request router, **recommended for PD disaggregation** | **First choice for PD disaggregation deployment** | | `golang_router/` | Go-implemented router, better PD inter-scheduling performance | **High-performance PD disaggregation scenarios** | | `eplb/` | Expert Parallel load balancing | MoE load balancing | | `rl/` | Reinforcement learning Rollout | RLHF scenarios | | `plugins/` | Plugin system | Custom extensions | | `logger/` | Logging module | Log format modification | | `trace/` | Tracing module | Performance analysis | --- ### 12. Configuration Files | File | Function | Development Guide | |------|----------|-------------------| | `config.py` | `FDConfig` main configuration class | **Entry point for configuration parameter modification** | | `envs.py` | Environment variable configuration | Adding new environment variables | | `utils.py` | General utility functions | Utility function reuse | --- ## II. Custom Operators Directory (custom_ops/) **Function**: C++/CUDA high-performance operator implementations, organized by hardware platform. ``` custom_ops/ ├── gpu_ops/ # NVIDIA GPU operators (main) ├── cpu_ops/ # CPU operators ├── xpu_ops/ # Baidu Kunlun XPU operators ├── iluvatar_ops/ # Iluvatar GPU operators ├── metax_ops/ # MetaX GPU operators ├── utils/ # Common utilities └── third_party/ # Third-party libraries (cutlass, DeepGEMM) ``` ### gpu_ops/ - GPU Operator Details | Directory/File | Function | Development Guide | |----------------|----------|-------------------| | `append_attn/` | Append Attention implementation | **First choice for attention optimization** | | `moe/` | MoE operators (fused_moe, expert_dispatch) | MoE performance optimization | | `flash_mask_attn/` | Flash Mask Attention | Attention mask optimization | | `mla_attn/` | Multi-Head Latent Attention | MLA model support | | `machete/` | Machete GEMM | Matrix multiplication optimization | | `quantization/` | Quantization operators | Quantization performance optimization | | `sample_kernels/` | Sampling operators | Sampling performance optimization | | `speculate_decoding/` | Speculative decoding operators | Speculative decoding optimization | | `cutlass_kernels/` | CUTLASS kernels | High-performance GEMM | | `cpp_extensions.cc` | C++ extension entry | **Entry point for new operator registration** | | `append_attention.cu` | Append Attention core | Attention core implementation | **Key Operator Files**: - `fused_rotary_position_encoding.cu` - Fused rotary position encoding - `multi_head_latent_attention.cu` - MLA attention - `per_token_quant_fp8.cu` - FP8 quantization --- ## III. Test Directory (tests/) **Function**: Unit tests and end-to-end tests, organized by module. ``` tests/ ├── e2e/ # End-to-end service tests ├── operators/ # Operator unit tests ├── model_executor/ # Model executor tests ├── model_loader/ # Model loading tests ├── layers/ # Network layer tests ├── scheduler/ # Scheduler tests ├── cache_manager/ # Cache management tests ├── entrypoints/ # API entry tests ├── input/ # Input processing tests ├── output/ # Output processing tests ├── metrics/ # Metric tests ├── distributed/ # Distributed tests ├── graph_optimization/# Graph optimization tests ├── quantization/ # Quantization tests ├── multimodal/ # Multimodal tests ├── xpu_ci/ # XPU CI tests ├── ce/ # CE environment tests ├── ci_use/ # CI utility tests └── conftest.py # pytest configuration ``` ### Test Directory Details | Directory | Content | Development Guide | |-----------|---------|-------------------| | `e2e/` | Complete service tests for each model (ERNIE, Qwen, DeepSeek, etc.) | **Service integration testing** | | `operators/` | Operator unit tests (`test_fused_moe.py`, `test_flash_mask_attn.py`, etc.) | **Required tests for operator development** | | `layers/` | Network layer tests (attention, moe, quantization) | Network layer testing | | `model_executor/` | Model execution flow tests | Model execution testing | | `scheduler/` | Scheduler function tests | Scheduling logic verification | | `cache_manager/` | Cache management tests | Cache logic verification | --- ## IV. Scripts Directory (scripts/) **Function**: CI/CD, performance tuning, utility scripts. | File | Function | Usage Scenario | |------|----------|----------------| | `run_unittest.sh` | Unit test runner | Local testing | | `run_ci_xpu.sh` | XPU CI runner | Kunlun CI | | `run_ci_hpu.sh` | HPU CI runner | Intel HPU CI | | `run_ci_dcu.sh` | DCU CI runner | AMD DCU CI | | `coverage_run.sh` | Code coverage statistics | Code quality | | `tune_cublaslt_int8_gemm.py` | cuBLASLt INT8 GEMM tuning | Performance tuning | | `tune_cutlass_fp8_gemm.py` | CUTLASS FP8 GEMM tuning | Performance tuning | | `offline_w4a8.py` | Offline W4A8 quantization tool | Model quantization | | `extract_mtp_weight_from_safetensor.py` | MTP weight extraction | Model processing | --- ## V. Other Directories ### docs/ - Documentation - Usage documentation, API documentation, architecture design documents ### examples/ - Example Code - Model usage examples, deployment examples ### benchmarks/ - Performance Benchmarks - Performance test scripts, benchmark data ### tools/ - Development Tools - `codestyle/` - Code style checking tools - `dockerfile/` - Docker build tools ### dockerfiles/ - Docker Images - Dockerfiles for each platform runtime environment --- ## VI. Quick Development Guide ### Adding a New Model 1. Reference `models/model_base.py` to understand model registration mechanism 2. Create new model file under `models/` 3. Add corresponding input processor under `input/` 4. Add tests under `tests/model_executor/` ### Adding a New Operator 1. Implement CUDA operator under `custom_ops/gpu_ops/` 2. Register operator in `cpp_extensions.cc` 3. Add Python wrapper under `model_executor/ops/gpu/` 4. Add tests under `tests/operators/` ### New Hardware Platform Adaptation 1. Reference `platforms/base.py` to create new platform class 2. Create hardware operator directory under `custom_ops/` 3. Create backend implementation under `model_executor/layers/backends/` 4. Create model runner under `worker/` ### Optimizing Inference Performance 1. Attention optimization: `custom_ops/gpu_ops/append_attn/` 2. MoE optimization: `custom_ops/gpu_ops/moe/` 3. Graph optimization: `fastdeploy/model_executor/graph_optimization/` ### PD Disaggregation Deployment 1. Router: `router/router.py` (Python implementation, recommended) 2. High-performance router: `golang_router/` (Go implementation, better PD inter-scheduling performance) 3. Cache transfer: `cache_manager/cache_transfer_manager.py` --- ## VII. Configuration System ``` FDConfig (config.py) ├── ModelConfig # Model configuration ├── CacheConfig # Cache configuration ├── ParallelConfig # Parallel configuration ├── SchedulerConfig # Scheduler configuration ├── LoRAConfig # LoRA configuration └── ... Environment Variable Configuration (envs.py) ├── FD_* series environment variables └── Runtime behavior control ``` --- This document covers the main modules and key files of the FastDeploy codebase. It can be used as a code navigation and development reference. For questions, please refer to detailed documentation of each module or source code comments.