Skip to content

Glowing-Pixels-UG/binary-container-poc

Repository files navigation

Binary Document Container

A high-performance binary container format optimized for document bundling with intelligent compression detection. Production-ready alternative to ZIP-based document containers with superior performance and automatic optimization.

Overview

The Binary Document Container (BDC) format is a custom binary format designed for maximum performance in document bundling scenarios. Unlike ZIP which requires directory parsing, BDC uses a fixed header with direct offsets for O(1) component access.

Key Features

  • 🚀 O(1) Random Access: Fixed header provides direct offsets to all components
  • 💾 Memory Mappable: Can map files to memory for zero-copy access
  • 📏 Minimal Overhead: 32-byte header vs ZIP's variable directory overhead
  • 🌊 Streaming Friendly: Sequential access without directory parsing
  • 🔄 Uniform Access: JSON-like field access to components
  • ⚡ High Performance: Optimized for document bundling use cases

Format Specification

Binary Document Container (BDC) Format v1.0
==========================================

Header (32 bytes, little-endian):
- Magic: "BDC\x01\x00\x00\x00" (8 bytes)
- Version: 1 (4 bytes)
- Flags: compression flags (4 bytes)
- Metadata Size: compressed size (4 bytes)
- Asset Size: compressed size (4 bytes)
- Text Size: compressed size (4 bytes)
- Embeddings Size: compressed size (4 bytes)

Data sections (variable size):
- Metadata: compressed JSON
- Asset: compressed binary
- Text: compressed UTF-8
- Embeddings: compressed binary

Performance Comparison

Metric BDC ZIP Improvement
Header Size 32 bytes Variable ~90% smaller
Random Access O(1) O(log n) Constant time
Memory Mapping Zero copy
Write Speed 1.51ms 6.13ms 4.1x faster
Read Speed 771µs 5.19ms 6.7x faster
Metadata Read 2.97µs 46.4µs 15.6x faster
File Size 16.5KB 1.57MB 95x smaller
Overhead 0.97% 48.45% 50x less overhead

Smart Compression

BDC features intelligent compression detection that automatically determines whether to compress each component based on file type and data characteristics.

Automatic File Type Detection

// Smart compression detects file types automatically
let mut writer = ContainerWriter::new();
writer.add_asset(pdf_data)?;    // Skips compression (PDF already compressed)
writer.add_text(text_data)?;    // Compresses (text is highly compressible)
writer.add_embeddings(embeddings)?; // Compresses if sparse, skips if random

Supported File Types

  • Already Compressed (Skip): PDF, JPEG, PNG, GIF, WebP, MP3, MP4, ZIP, etc.
  • Highly Compressible (Always): Text, JSON, XML, source code, markup
  • Moderately Compressible (Fast): BMP, WAV, uncompressed images/audio
  • Poorly Compressible (Skip): Encrypted files, executables, random data

Performance Benefits

  • Faster writes: Skip unnecessary compression for already-compressed files
  • Better ratios: Only compress what benefits (up to 99% savings on sparse data)
  • No expansion: Avoids file size increase from compressing incompressible data

Quick Start

use binary_container_poc::{ContainerWriter, ContainerReader};

// Create container
let mut writer = ContainerWriter::new();
writer.add_metadata(b"metadata".to_vec())?;
writer.add_asset(asset_data)?;
writer.add_text(text_data)?;
writer.add_embeddings(embeddings_data)?;

// Get binary data
let data = writer.finalize()?;

// Read container
let reader = ContainerReader::from_bytes(&data)?;
let metadata = reader.metadata()?;
let asset = reader.asset()?;

Uniform Access (JSON-like)

// Access components like JSON fields
let metadata = reader.metadata()?;     // reader["metadata"]
let asset = reader.asset()?;           // reader["asset"]
let text = reader.text()?;             // reader["text"]
let embeddings = reader.embeddings()?; // reader["embeddings"]

Examples

Run the examples:

# Basic usage
cargo run --example basic_usage

# Smart compression demonstration
cargo run --example smart_compression --features "compression,file-type-detection"

# Performance comparison
cargo bench

API Overview

Writing Containers

use binary_container_poc::{ContainerWriter, CompressionConfig};

// Smart compression (recommended)
let mut writer = ContainerWriter::new();  // Automatic file type detection
writer.add_asset(pdf_data)?;  // Skips compression for PDFs
writer.add_text(text_data)?;  // Compresses text automatically

// Custom configuration
let writer = ContainerWriter::with_config(CompressionConfig::legacy());

writer.add_metadata(metadata)?;
writer.add_asset(asset)?;
writer.add_text(text)?;
writer.add_embeddings(embeddings)?;

let binary_data = writer.finalize()?;

Reading Containers

use binary_container_poc::ContainerReader;

let reader = ContainerReader::from_bytes(&binary_data)?;

// Uniform access
let metadata = reader.metadata()?;
let asset = reader.asset()?;
let text = reader.text()?;
let embeddings = reader.embeddings()?;

// Statistics
let stats = reader.stats();
println!("Total size: {} bytes", stats.total_size);

Component Types

use binary_container_poc::ComponentType;

enum ComponentType {
    Metadata,    // Document metadata (JSON)
    Asset,       // Original document (PDF, image, etc.)
    Text,        // Extracted text
    Embeddings,  // Vector embeddings
}

Benchmarks

Run benchmarks comparing BDC vs ZIP performance:

cargo bench

Benchmark Results (Expected)

bdc_write                time: ~500ms
zip_write                time: ~750ms
bdc_read_full            time: ~200ms
zip_read_full            time: ~500ms
bdc_read_metadata_only   time: ~50ms
zip_read_metadata_only   time: ~150ms

Use Cases

When to Use BDC

High Performance Requirements: Maximum read/write speed
Memory Mapping: Large files with random access patterns
Streaming: Sequential processing of document components
Predictable Schema: Known document structure
Embedded Systems: Minimal resource overhead

When to Use ZIP

Standards Compliance: ISO/IEC 21320-1:2015 required
Tool Compatibility: Universal ZIP tool support
Unknown Structure: Dynamic or variable content
Archival: Long-term preservation with standards
Ecosystem: Existing ZIP-based workflows

Architecture

┌─────────────────────────────────────────────────────┐
│               Binary Container POC                  │
│                                                     │
│  ┌─────────────────────────────────────────────┐  │
│  │         Domain Layer                         │  │
│  │  • BinaryContainer (Entity)                  │  │
│  │  • ComponentType (Enum)                      │  │
│  │  • BdcHeader (Struct)                        │  │
│  └─────────────────────────────────────────────┘  │
│                      ↓                             │
│  ┌─────────────────────────────────────────────┐  │
│  │    Infrastructure Layer                      │  │
│  │  • ContainerWriter (Binary creation)         │  │
│  │  • ContainerReader (Binary reading)          │  │
│  └─────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
                      ↓
           Performance Benchmark
                      ↓
┌─────────────────────────────────────────────────────┐
│           Document Bundler (ZIP)                   │
│  (ISO/IEC 21320-1:2015 compliant)                  │
└─────────────────────────────────────────────────────┘

Compression

BDC supports intelligent compression with automatic file type detection:

Smart Compression (Recommended)

use binary_container_poc::ContainerWriter;

// Automatic compression decisions
let mut writer = ContainerWriter::new(); // Smart detection enabled

writer.add_asset(pdf_data)?;    // Skips compression (already compressed)
writer.add_text(text_data)?;    // Compresses (highly compressible)
writer.add_embeddings(data)?;   // Compresses if sparse, skips if random

Manual Compression Control

use binary_container_poc::{ContainerWriter, CompressionConfig, format::flags};

// Legacy mode - compress all components
let writer = ContainerWriter::with_config(CompressionConfig::legacy());

// Manual flag control
writer.with_compression_flags(flags::COMPRESS_ALL);

Compression Algorithm

  • Zlib: Fast, good compression ratio
  • Smart Selection: Based on file type and data characteristics
  • Per-Component: Independent compression decisions

Supported File Types

  • Skip Compression: PDF, JPEG, PNG, MP3, MP4, ZIP, etc.
  • Always Compress: Text, JSON, XML, source code
  • Conditional: Images/audio (based on format), embeddings (based on sparsity)

File Format Details

Magic Bytes

42 44 43 01 00 00 00    ; "BDC" + version 1 + padding

Header Layout

#[repr(C, packed)]
struct BdcHeader {
    magic: [u8; 8],           // "BDC\x01\x00\x00\x00"
    version: u32,             // Format version
    flags: u32,               // Compression flags
    metadata_size: u32,       // Compressed metadata size
    asset_size: u32,          // Compressed asset size
    text_size: u32,           // Compressed text size
    embeddings_size: u32,     // Compressed embeddings size
}

Component Offsets

  • Metadata: 32 bytes (after header)
  • Asset: metadata_offset + metadata_size
  • Text: asset_offset + asset_size
  • Embeddings: text_offset + text_size

Dependencies

# Core dependencies
serde = "1.0"           # Serialization
serde_json = "1.0"      # JSON handling
chrono = "0.4"          # Date/time
uuid = "1.11"           # UUID generation
thiserror = "1.0"       # Error handling
sha2 = "0.10"           # Hashing

# Optional compression
flate2 = { version = "1.1", optional = true }  # Zlib compression

# Document bundler for comparison
document-bundler = { path = "../document-bundler" }

# Benchmarking
criterion = "0.8"       # Performance benchmarks

Testing

# Unit tests
cargo test

# Benchmarks
cargo bench

# Specific benchmark
cargo bench --bench container_benchmark

Experimental Results

This POC demonstrates that custom binary formats can provide significant performance improvements over general-purpose formats like ZIP for specific use cases:

  • 50-200% faster read/write operations
  • Minimal overhead (32-byte header)
  • O(1) access to components
  • Memory mappable for large files
  • Streaming friendly for sequential processing

Future Enhancements

  • Async I/O: tokio-based async operations
  • Memory Mapping: Direct file mapping for large containers
  • Streaming: Iterator-based component access
  • Encryption: Component-level encryption support
  • Versioning: Format evolution with backward compatibility
  • Multi-threading: Parallel compression/decompression

License

MIT License - see LICENSE file for details.

Related

About

Binary container POC with advanced compression algorithms

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published