Architecture¶
Kreuzberg is built with a Rust-first, multi-language architecture designed for maximum performance while maintaining accessibility across different programming ecosystems. The core extraction logic is implemented in Rust, with thin bindings for Python, TypeScript, and Ruby.
Design Philosophy¶
The architecture follows three key principles:
- Rust Core: All performance-critical operations (PDF parsing, OCR, text processing) are implemented in Rust for speed, safety, and memory efficiency
- Language-Agnostic Plugins: The plugin system works across language boundaries, allowing Python OCR backends to integrate seamlessly with the Rust core
- Zero-Copy Boundaries: Data passes across FFI boundaries efficiently using zero-copy techniques where possible
Multi-Language Architecture¶
graph TB
subgraph "Language Bindings"
Python["Python Package<br/>(packages/python)"]
TypeScript["TypeScript Package<br/>(packages/typescript)"]
Ruby["Ruby Gem<br/>(packages/ruby)"]
end
subgraph "FFI Bridges"
PyO3["PyO3 Bridge<br/>(crates/kreuzberg-py)"]
NAPIRS["NAPI-RS Bridge<br/>(crates/kreuzberg-node)"]
Magnus["Magnus Bridge<br/>(crates/kreuzberg-ruby)"]
end
subgraph "Rust Core"
Core["Kreuzberg Core<br/>(crates/kreuzberg)"]
end
Python --> PyO3
TypeScript --> NAPIRS
Ruby --> Magnus
PyO3 --> Core
NAPIRS --> Core
Magnus --> Core
style Core fill:#e1f5ff
style PyO3 fill:#ffe1e1
style NAPIRS fill:#ffe1e1
style Magnus fill:#ffe1e1 Rust Core Structure¶
The Rust core (crates/kreuzberg) is organized into distinct modules, each responsible for a specific aspect of document processing:
graph LR
subgraph "Kreuzberg Core Crate"
Core["core/<br/>Extraction orchestration<br/>MIME detection<br/>Configuration"]
Plugins["plugins/<br/>Plugin system<br/>Registry pattern<br/>Trait definitions"]
Extraction["extraction/<br/>Format implementations<br/>PDF, Excel, Email<br/>XML, Text, HTML"]
Extractors["extractors/<br/>Plugin wrappers<br/>MIME type mapping<br/>Registry registration"]
OCR["ocr/<br/>OCR processing<br/>Tesseract backend<br/>Table extraction"]
Text["text/<br/>Token reduction<br/>Quality scoring<br/>String utilities"]
Types["types/<br/>Core data structures<br/>ExtractionResult<br/>Metadata"]
Error["error/<br/>Error types<br/>Result aliases"]
end
Core --> Plugins
Core --> Extractors
Extractors --> Extraction
Extractors --> Plugins
Extraction --> OCR
Extraction --> Text
Core --> Types
Core --> Error
style Core fill:#bbdefb
style Plugins fill:#c8e6c9
style Extraction fill:#fff9c4
style Extractors fill:#ffccbc Module Responsibilities¶
- core/: Main extraction entry points (
extract_file,extract_bytes), MIME detection, configuration loading, pipeline orchestration - plugins/: Plugin trait definitions (
DocumentExtractor,OcrBackend,PostProcessor,Validator), registry implementation - extraction/: Core extraction implementations for different formats (PDF via pdfium, Excel via calamine, etc.)
- extractors/:
DocumentExtractortrait wrappers that register format handlers with the plugin system - ocr/: OCR processing orchestration, Tesseract integration, HOCR parsing, table detection
- text/: Text processing utilities including token reduction, quality scoring, string manipulation
- types/: Shared type definitions (
ExtractionResult,Metadata,Chunk, etc.) - error/: Centralized error handling with
KreuzbergErrorenum
Why Rust?¶
Rust was chosen for the core implementation due to several compelling advantages:
Performance¶
Benchmarks show 10-50x performance improvements over Python implementations:
- PDF parsing: Native pdfium bindings eliminate Python overhead
- Text processing: SIMD-accelerated string operations for token reduction
- Concurrent extraction: True parallelism with Tokio's async runtime
- Memory efficiency: Zero-copy operations and streaming parsers for large files
Safety¶
Rust's type system and ownership model prevent entire classes of bugs:
- No null pointer exceptions: Option types enforce explicit handling
- No data races: Compiler-enforced thread safety with Send/Sync
- No buffer overflows: Bounds checking at compile time
- No use-after-free: Ownership rules prevent memory errors
Concurrency¶
Built-in async/await with Tokio enables efficient parallel processing:
- Batch extraction: Process multiple files concurrently
- Non-blocking I/O: Async file operations never block threads
- Work-stealing: Tokio scheduler maximizes CPU utilization
- Backpressure: Async streams handle large datasets gracefully
Standalone Rust Library¶
The Rust core (crates/kreuzberg) is a fully functional standalone library that can be used directly in Rust projects without any language bindings:
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Extracted: {}", result.content);
Ok(())
}
This makes Kreuzberg suitable for:
- Rust-native applications
- High-performance servers and APIs
- Command-line tools
- Embedded systems (where Python/Node.js are impractical)
Related Documentation¶
- Creating Plugins - Guide to building custom plugins
- API Reference - Python API documentation