- Update benchmark - Update documentations
27 KiB
Archive Package
High-performance, streaming-first archive and compression library for Go with zero-copy streaming, thread-safe operations, and intelligent format detection.
Table of Contents
- Overview
- Architecture
- Performance
- Use Cases
- Quick Start
- Subpackages
- Best Practices
- API Reference
- Contributing
- Improvements & Security
- Resources
- AI Transparency
- License
Overview
This library provides production-ready archive and compression management for Go applications. It emphasizes streaming operations, memory efficiency, and thread safety while supporting multiple archive formats (TAR, ZIP) and compression algorithms (GZIP, BZIP2, LZ4, XZ).
Design Philosophy
- Stream-First: All operations use
io.Reader/io.Writerfor continuous data flow - Memory Efficient: Constant memory usage regardless of file size
- Thread-Safe: Proper synchronization primitives for concurrent operations
- Format Agnostic: Automatic detection of archive and compression formats
- Composable: Independent subpackages that work together seamlessly
Key Features
- Stream Processing: Handle files of any size with constant memory usage (~10MB for 10GB archives)
- Thread-Safe Operations: Atomic operations (
atomic.Bool), mutex protection (sync.Mutex), and goroutine synchronization (sync.WaitGroup) - Auto-Detection: Magic number based format identification for both archives and compression
- Multiple Formats:
- Archives: TAR (streaming), ZIP (random access)
- Compression: GZIP, BZIP2, LZ4, XZ, uncompressed
- Path Security: Automatic sanitization against directory traversal attacks
- Metadata Preservation: File permissions, timestamps, symlinks, and hard links
- Standard Interfaces: Implements
io.Reader,io.Writer,io.Closer
Architecture
Component Diagram
archive/
├── archive/ # Archive format handling (TAR, ZIP)
│ ├── tar/ # TAR streaming implementation
│ ├── zip/ # ZIP random-access implementation
│ └── types/ # Common interfaces (Reader, Writer)
├── compress/ # Compression algorithms (GZIP, BZIP2, LZ4, XZ)
├── helper/ # High-level compression/decompression pipelines
├── extract.go # Universal extraction with auto-detection
└── interface.go # Top-level convenience wrappers
┌──────────────────────────────────────────────────────┐
│ Root Package │
│ ExtractAll(), DetectArchive(), DetectCompression() │
└───────────┬─────────────┬─────────────┬──────────────┘
│ │ │
┌────────▼─────┐ ┌────▼─────┐ ┌────▼────────┐
│ archive │ │ compress │ │ helper │
│ │ │ │ │ │
│ TAR, ZIP │ │ GZIP, XZ │ │ Pipelines │
│ Reader/Writer│ │ BZIP2,LZ4│ │ Thread-safe │
└──────────────┘ └──────────┘ └─────────────┘
Package Structure
The package is organized into three main subpackages, each with specific responsibilities:
archive/
├── archive/ # Archive format handling (TAR, ZIP)
│ ├── tar/ # TAR streaming implementation
│ ├── zip/ # ZIP random-access implementation
│ └── types/ # Common interfaces (Reader, Writer)
├── compress/ # Compression algorithms (GZIP, BZIP2, LZ4, XZ)
├── helper/ # High-level compression/decompression pipelines
├── extract.go # Universal extraction with auto-detection
└── interface.go # Top-level convenience wrappers
| Component | Purpose | Memory | Thread-Safe |
|---|---|---|---|
archive |
Multi-file containers (TAR/ZIP) | O(1) TAR, O(n) ZIP | ✅ |
compress |
Single-file compression (GZIP/BZIP2/LZ4/XZ) | O(1) | ✅ |
helper |
Compression/decompression pipelines | O(1) | ✅ |
| Root | Convenience wrappers and auto-detection | O(1) | ✅ |
Format Differences
TAR (Tape Archive)
- Sequential file-by-file processing
- No random access (forward-only)
- Minimal memory overhead
- Best for: Backups, streaming, large datasets
ZIP
- Central directory with file catalog
- Random access via
io.ReaderAt - Requires seekable source
- Best for: Software distribution, Windows compatibility
Performance
Memory Efficiency
This library maintains constant memory usage regardless of file size:
- Streaming Architecture: Data flows in 512-byte chunks
- Zero-Copy Operations: Direct passthrough for uncompressed data
- Controlled Buffers:
bufio.Reader/Writerwith optimal buffer sizes - Example: Extract a 10GB TAR.GZ archive using only ~10MB RAM
Thread Safety
All operations are thread-safe through:
- Atomic Operations:
atomic.Boolfor state flags - Mutex Protection:
sync.Mutexfor shared buffer access - Goroutine Sync:
sync.WaitGroupfor lifecycle management - Concurrent Safe: Multiple goroutines can operate independently
Benchmarks
Based on actual benchmark results (AMD64, Go 1.25, 20 samples per test):
Compression Performance
Small Data (1KB):
| Algorithm | Median | CPU Time | Memory | Allocations | Compression Ratio |
|---|---|---|---|---|---|
| LZ4 | <1µs | 0.032ms | 4.5 KB | 16 | 93.1% |
| Gzip | <1µs | 0.073ms | 795 KB | 24 | 94.2% |
| Bzip2 | 100µs | 0.186ms | 650 KB | 34 | 90.4% |
| XZ | 300µs | 0.513ms | 8,226 KB | 144 | 89.8% |
Medium Data (10KB):
| Algorithm | Median | CPU Time | Memory | Allocations | Compression Ratio |
|---|---|---|---|---|---|
| LZ4 | <1µs | 0.019ms | 4.5 KB | 17 | 99.0% |
| Gzip | <1µs | 0.089ms | 795 KB | 25 | 99.1% |
| Bzip2 | 200µs | 0.339ms | 822 KB | 37 | 98.8% |
| XZ | 300µs | 0.378ms | 8,226 KB | 147 | 98.7% |
Large Data (100KB):
| Algorithm | Median | CPU Time | Memory | Allocations | Compression Ratio |
|---|---|---|---|---|---|
| LZ4 | <1µs | 0.044ms | 1.2 KB | 11 | 99.5% |
| Gzip | 300µs | 0.351ms | 796 KB | 26 | 99.7% |
| Bzip2 | 2.7ms | 2.753ms | 2,544 KB | 38 | 99.9% |
| XZ | 6.9ms | 6.994ms | 8,228 KB | 327 | 99.8% |
Archive Format Performance
TAR vs ZIP - Creation (Single 1KB file, uncompressed):
| Format | Median | CPU Time | Memory | Allocations | Archive Size | Overhead |
|---|---|---|---|---|---|---|
| TAR | <1µs | 0.019ms | 5.2 KB | 19 | 2,560 bytes | 1,536 bytes (150%) |
| ZIP | <1µs | 0.006ms | 5.2 KB | 19 | ~200 bytes | ~176 bytes |
TAR vs ZIP - Extraction (Single 1KB file):
| Format | Median | CPU Time | Memory | Allocations |
|---|---|---|---|---|
| TAR | <1µs | 0.008ms | 1.7 KB | 22 |
| ZIP | <1µs | 0.006ms | 0.2 KB | 4 |
Important Notes on TAR vs ZIP:
- Compression: TAR is an archive format only (no compression). ZIP integrates compression. Compression ratios are NOT comparable between formats.
- Robustness: TAR allows reading/writing even if corrupted (sequential format). ZIP cannot recover if corrupted as its central directory is at the end of the archive.
- Use TAR + Compression: Combine TAR with Gzip/Bzip2/LZ4/XZ for compressed archives (e.g.,
.tar.gz,.tar.xz).
Algorithm Selection Guide
By Speed (Compression):
LZ4 (~0.04ms) >> Gzip (~0.35ms) > Bzip2 (~2.75ms) > XZ (~7ms)
└─ 175x faster ─┘ └─ 8x faster ─┘ └─ 2.5x slower ┘
By Compression Ratio (100KB data):
Bzip2 (99.9%) ≈ XZ (99.8%) ≈ Gzip (99.7%) > LZ4 (99.5%)
└─────────── Best compression ─────────────┘ └─ Fastest ┘
By Memory Efficiency:
LZ4 (1.2-4.5 KB) << Gzip (~800 KB) ≈ Bzip2 (~650 KB-2.5 MB) << XZ (~8.2 MB)
└─── 200x less ──┘ └────── Moderate ───────────────────┘ └─ Highest ┘
Recommended Use Cases:
- Real-time/Logs → LZ4 (fastest, minimal memory)
- Web/API → Gzip (excellent ratio, moderate speed)
- Archival/Cold Storage → Bzip2 or XZ (best compression)
- Balanced → Gzip (good speed + ratio + memory)
Archive Format Selection:
- TAR: Best for large files (1.5% overhead at 100KB), streaming, backups
- ZIP: Best for small files, random access, Windows compatibility (minimal overhead)
Use Cases
This library is designed for scenarios requiring efficient archive and compression handling:
Backup Systems
- Stream large directories to TAR.GZ without memory exhaustion
- Incremental backups with selective file extraction
- Parallel compression across multiple backup jobs
Log Management
- Real-time compression of rotated logs (LZ4 for speed, XZ for storage)
- Extract specific log files without full decompression
- High-volume logging with minimal CPU overhead
CI/CD Pipelines
- Package build artifacts into versioned archives
- Extract dependencies from compressed packages
- Automated compression before artifact upload
Data Processing
- Stream-process large datasets from compressed archives
- Convert between compression formats (e.g., GZIP → XZ)
- Transform data without intermediate files
Web Services
- On-the-fly compression of API responses
- Dynamic archive generation for downloads
- Streaming extraction of uploaded archives
Quick Start
Installation
go get github.com/nabbar/golib/archive
Extract Archive (Auto-Detection)
Automatically detect and extract any archive format:
package main
import (
"os"
"github.com/nabbar/golib/archive"
)
func main() {
in, _ := os.Open("archive.tar.gz")
defer in.Close()
// Automatic format detection and extraction
err := archive.ExtractAll(in, "archive.tar.gz", "/output")
if err != nil {
panic(err)
}
}
Create TAR.GZ Archive
Compress files into a TAR.GZ archive:
package main
import (
"io"
"os"
"path/filepath"
"github.com/nabbar/golib/archive/archive"
"github.com/nabbar/golib/archive/compress"
"github.com/nabbar/golib/archive/helper"
)
func main() {
out, _ := os.Create("backup.tar.gz")
defer out.Close()
// Create GZIP -> TAR pipeline
gzWriter, _ := helper.NewWriter(compress.Gzip, helper.Compress, out)
defer gzWriter.Close()
tarWriter, _ := archive.Tar.Writer(gzWriter)
defer tarWriter.Close()
// Add files
filepath.Walk("/source", func(path string, info os.FileInfo, err error) error {
if err != nil || info.IsDir() {
return err
}
file, _ := os.Open(path)
defer file.Close()
return tarWriter.Add(info, file, path, "")
})
}
Stream Compression
Compress and decompress data in memory:
package main
import (
"bytes"
"io"
"github.com/nabbar/golib/archive/compress"
"github.com/nabbar/golib/archive/helper"
)
func main() {
data := []byte("Data to compress...")
// Compress
var compressed bytes.Buffer
compressor, _ := helper.NewWriter(compress.Gzip, helper.Compress, &compressed)
io.Copy(compressor, bytes.NewReader(data))
compressor.Close()
// Decompress
var decompressed bytes.Buffer
decompressor, _ := helper.NewReader(compress.Gzip, helper.Decompress, &compressed)
io.Copy(&decompressed, decompressor)
decompressor.Close()
}
Parallel Compression
Compress multiple files concurrently:
package main
import (
"io"
"os"
"sync"
"github.com/nabbar/golib/archive/compress"
"github.com/nabbar/golib/archive/helper"
)
func compressFile(src, dst string, wg *sync.WaitGroup) {
defer wg.Done()
in, _ := os.Open(src)
defer in.Close()
out, _ := os.Create(dst)
defer out.Close()
compressor, _ := helper.NewWriter(compress.Gzip, helper.Compress, out)
defer compressor.Close()
io.Copy(compressor, in)
}
func main() {
files := []string{"file1.log", "file2.log", "file3.log"}
var wg sync.WaitGroup
for _, file := range files {
wg.Add(1)
go compressFile(file, file+".gz", &wg)
}
wg.Wait()
}
Subpackages
archive Subpackage
Multi-file container management with TAR and ZIP support.
Features
- TAR: Sequential streaming (O(1) memory)
- ZIP: Random access (requires
io.ReaderAt) - Auto-detection via header analysis
- Unified
Reader/Writerinterfaces - Path security (directory traversal protection)
- Symlink and metadata preservation
API Example
import "github.com/nabbar/golib/archive/archive"
// Detect format
file, _ := os.Open("archive.tar")
alg, reader, closer, _ := archive.Detect(file)
defer closer.Close()
// List contents
files, _ := reader.List()
// Create archive
out, _ := os.Create("output.tar")
writer, _ := archive.Tar.Writer(out)
defer writer.Close()
Format Selection
- TAR: Backups, streaming, large datasets (no seeking)
- ZIP: Software distribution, selective extraction (needs seeking)
See GoDoc for complete API.
compress Subpackage
Single-file compression with multiple algorithms.
Supported Algorithms
| Algorithm | Ext | Ratio | Speed | Use Case |
|---|---|---|---|---|
| GZIP | .gz |
~3:1 | Fast | Web, general purpose |
| BZIP2 | .bz2 |
~4:1 | Medium | Text files, archival |
| LZ4 | .lz4 |
~2.5:1 | Very Fast | Logs, real-time |
| XZ | .xz |
~5:1 | Slow | Maximum compression |
Magic Number Detection
| Algorithm | Header Bytes | Hex |
|---|---|---|
| GZIP | \x1f\x8b |
1F 8B |
| BZIP2 | BZ |
42 5A |
| LZ4 | \x04\x22\x4d\x18 |
04 22 4D 18 |
| XZ | \xfd7zXZ\x00 |
FD 37 7A 58 5A 00 |
API Example
import "github.com/nabbar/golib/archive/compress"
// Auto-detect and decompress
file, _ := os.Open("file.txt.gz")
alg, reader, _ := compress.Detect(file)
defer reader.Close()
// Compress
writer, _ := compress.Gzip.Writer(output)
defer writer.Close()
io.Copy(writer, input)
See GoDoc for complete API.
helper Subpackage
High-level compression pipelines with thread-safe buffering.
Features
- Unified interface for all operations
- Thread-safe with atomic operations and mutexes
- Async I/O via goroutines
- Implements
io.ReadWriteCloser - Custom buffer preventing premature EOF
Operation Modes
| Mode | Source | Direction | Example |
|---|---|---|---|
| Compress + Reader | io.Reader |
Read compressed | Compress stream on-the-fly |
| Compress + Writer | io.Writer |
Write to compressed | Create .gz file |
| Decompress + Reader | io.Reader |
Read decompressed | Read from .gz file |
| Decompress + Writer | io.Writer |
Write to decompressed | Extract to file |
API Example
import "github.com/nabbar/golib/archive/helper"
// Compress to file
out, _ := os.Create("file.gz")
h, _ := helper.NewWriter(compress.Gzip, helper.Compress, out)
defer h.Close()
io.Copy(h, input)
// Decompress from file
in, _ := os.Open("file.gz")
h, _ := helper.NewReader(compress.Gzip, helper.Decompress, in)
defer h.Close()
io.Copy(output, h)
Thread Safety: Verified with go test -race (zero data races)
See GoDoc for complete API.
Best Practices
Testing
The package includes a comprehensive test suite with 79.0% code coverage and 535 test specifications using BDD methodology (Ginkgo v2 + Gomega).
Key test coverage:
- ✅ All compression algorithms (GZIP, BZIP2, LZ4, XZ)
- ✅ Archive operations (TAR, ZIP creation and extraction)
- ✅ Helper pipelines and thread safety
- ✅ Auto-detection and extraction workflows
- ✅ Concurrent access with race detector (zero races detected)
- ✅ Error handling and edge cases
Run tests:
# Standard tests
go test ./...
# With coverage
go test -cover ./...
# With race detection (recommended)
CGO_ENABLED=1 go test -race ./...
For detailed test documentation, see TESTING.md.
Stream Large Files (Constant Memory)
// ✅ Good: Streaming
func process(path string) error {
in, _ := os.Open(path)
defer in.Close()
decompressor, _ := helper.NewReader(compress.Gzip, helper.Decompress, in)
defer decompressor.Close()
return processStream(decompressor) // Constant memory
}
// ❌ Bad: Load entire file
func processBad(path string) error {
data, _ := os.ReadFile(path) // Full file in RAM
return processStream(bytes.NewReader(data))
}
Always Handle Errors
// ✅ Good
func extract(path, dest string) error {
f, err := os.Open(path)
if err != nil {
return fmt.Errorf("open: %w", err)
}
defer f.Close()
return archive.ExtractAll(f, path, dest)
}
// ❌ Bad: Silent failures
func extractBad(path, dest string) {
f, _ := os.Open(path)
archive.ExtractAll(f, path, dest)
}
Close All Resources
// ✅ Always use defer for cleanup
func compress(src, dst string) error {
in, err := os.Open(src)
if err != nil {
return err
}
defer in.Close()
out, err := os.Create(dst)
if err != nil {
return err
}
defer out.Close()
compressor, err := helper.NewWriter(compress.Gzip, helper.Compress, out)
if err != nil {
return err
}
defer compressor.Close() // Flushes buffers
_, err = io.Copy(compressor, in)
return err
}
Safe Concurrency
// ✅ Proper synchronization
func compressMany(files []string) error {
var wg sync.WaitGroup
errs := make(chan error, len(files))
for _, file := range files {
wg.Add(1)
go func(f string) {
defer wg.Done()
if err := compressFile(f); err != nil {
errs <- err
}
}(file)
}
wg.Wait()
close(errs)
for err := range errs {
if err != nil {
return err
}
}
return nil
}
API Reference
Root Package Functions
// Parse compression algorithm from string
func ParseCompression(s string) compress.Algorithm
// Detect compression format and return decompressor
func DetectCompression(r io.Reader) (compress.Algorithm, io.ReadCloser, error)
// Parse archive algorithm from string
func ParseArchive(s string) archive.Algorithm
// Detect archive format and return reader
func DetectArchive(r io.ReadCloser) (archive.Algorithm, types.Reader, io.ReadCloser, error)
// Extract archive with auto-detection
func ExtractAll(r io.ReadCloser, archiveName, destination string) error
Compression Algorithms
| Algorithm | Extension | Magic Bytes | Speed | Ratio |
|---|---|---|---|---|
| Gzip | .gz |
1F 8B |
Fast | ~3:1 |
| Bzip2 | .bz2 |
42 5A |
Medium | ~4:1 |
| LZ4 | .lz4 |
04 22 4D 18 |
Very Fast | ~2.5:1 |
| XZ | .xz |
FD 37 7A 58 5A 00 |
Slow | ~5:1 |
| None | `` | - | Instant | 1:1 |
Archive Formats
| Format | Extension | Access | Seekable | Best For |
|---|---|---|---|---|
| TAR | .tar |
Sequential | No | Backups, streaming |
| ZIP | .zip |
Random | Yes | Distribution, Windows |
Contributing
Contributions are welcome! Please follow these guidelines:
-
Code Quality
- Follow Go best practices and idioms
- Maintain or improve code coverage (target: >79%)
- Pass all tests including race detector
- Use
gofmtandgolint
-
AI Usage Policy
- ❌ AI must NEVER be used to generate package code or core functionality
- ✅ AI assistance is limited to:
- Testing (writing and improving tests)
- Debugging (troubleshooting and bug resolution)
- Documentation (comments, README, TESTING.md)
- All AI-assisted work must be reviewed and validated by humans
-
Testing
- Add tests for new features
- Use Ginkgo v2 / Gomega for test framework
- Use
gmeasurefor performance benchmarks - Ensure zero race conditions with
go test -race
-
Documentation
- Update GoDoc comments for public APIs
- Add examples for new features
- Update README.md and TESTING.md if needed
-
Pull Request Process
- Fork the repository
- Create a feature branch
- Write clear commit messages
- Ensure all tests pass
- Update documentation
- Submit PR with description of changes
See CONTRIBUTING.md for detailed guidelines.
Improvements & Security
Current Status
The package is production-ready with no urgent improvements or security vulnerabilities identified.
Code Quality Metrics
- ✅ 79.0% test coverage (target: >79%)
- ✅ Zero race conditions detected with
-raceflag - ✅ Thread-safe implementation using atomic operations and mutexes
- ✅ Memory-safe with proper resource cleanup
- ✅ Streaming architecture for constant memory usage
Future Enhancements (Non-urgent)
Potential improvements for future versions:
Archive Formats
- 7-Zip support
- RAR extraction
- ISO image handling
Compression Algorithms
- Zstandard (modern, high-performance)
- Brotli (web-optimized)
- Compression level configuration
Features
- Progress callbacks for long operations
- Streaming TAR.GZ (single-pass)
- Selective extraction by pattern
- Archive encryption (AES-256)
- Parallel compression (pgzip/pbzip2)
- Cloud storage integration (S3, GCS, Azure)
- Format conversion without recompression
Performance
- Memory-mapped ZIP for ultra-fast access
- Archive indexing with SQLite
- Content deduplication
These are optional improvements and not required for production use. The current implementation is stable and performant.
Suggestions and contributions are welcome via GitHub issues.
Resources
Package Documentation
-
GoDoc - Complete API reference with function signatures, method descriptions, and runnable examples. Essential for understanding the public interface and usage patterns.
-
doc.go - In-depth package documentation including design philosophy, format detection mechanisms, extraction process, performance characteristics, and best practices for production use.
-
TESTING.md - Comprehensive test suite documentation covering test architecture, BDD methodology with Ginkgo v2, 79.0% coverage analysis, performance benchmarks, and guidelines for writing new tests. Includes troubleshooting and CI integration examples.
Related golib Packages
- github.com/nabbar/golib/size - Size constants and utilities (KiB, MiB, GiB, etc.) used for buffer sizing and file size calculations.
Standard Library References
-
archive/tar - Standard library TAR format handling. The archive package uses this internally for TAR operations.
-
archive/zip - Standard library ZIP format handling. The archive package uses this internally for ZIP operations.
-
compress/gzip - Standard library GZIP compression. Used internally by the compress subpackage.
-
compress/bzip2 - Standard library BZIP2 decompression. Used internally by the compress subpackage.
External References
-
Effective Go - Official Go programming guide covering best practices for interfaces, error handling, and I/O patterns. The archive package follows these conventions for idiomatic Go code.
-
Go I/O Patterns - Official Go blog article explaining pipeline patterns and streaming I/O. Relevant for understanding how the archive package implements stream processing.
External Dependencies
-
github.com/pierrec/lz4/v4 - LZ4 compression library providing fast compression/decompression.
-
github.com/ulikunitz/xz - XZ compression library providing high-ratio compression.
Community & Support
-
GitHub Issues - Report bugs, request features, or ask questions about the archive package. Check existing issues before creating new ones.
-
Contributing Guide - Detailed guidelines for contributing code, tests, and documentation to the project. Includes code style requirements, testing procedures, and pull request process.
AI Transparency
In compliance with EU AI Act Article 50.4: AI assistance was used for testing, documentation, and bug resolution under human supervision. All core functionality is human-designed and validated.
License
MIT License - See LICENSE file for details.
Copyright (c) 2025 Nicolas JUHEL
Maintained by: Nicolas JUHEL
Package: github.com/nabbar/golib/archive
Version: See releases for versioning