Performance¶

Kreuzberg's Rust-first architecture delivers significant performance improvements over pure Python implementations. This page explains the performance benefits, benchmarking methodology, and optimization techniques.

Performance Benefits¶

The Rust core provides 10-50x performance improvements across multiple operations:

graph LR
    subgraph "Python-based Libraries"
        Py1["docling<br/>~2.5s per PDF"]
        Py2["unstructured<br/>~3.2s per PDF"]
        Py3["markitdown<br/>~1.8s per PDF"]
    end

    subgraph "Kreuzberg v4"
        Rust["Rust Core<br/>~0.15s per PDF"]
    end

    Py1 -.->|16x slower| Rust
    Py2 -.->|21x slower| Rust
    Py3 -.->|12x slower| Rust

    style Rust fill:#c8e6c9
    style Py1 fill:#ffcdd2
    style Py2 fill:#ffcdd2
    style Py3 fill:#ffcdd2

Benchmark Results¶

Performance benchmarks compare Kreuzberg against other popular extraction libraries using 94 real-world documents:

PDF Extraction¶

Library	Avg Time	Memory (Peak)	Throughput
Kreuzberg v4	0.15s	45 MB	6.7 docs/sec
Kreuzberg v3	1.2s	120 MB	0.83 docs/sec
extractous	0.25s	65 MB	4.0 docs/sec
docling	2.5s	450 MB	0.4 docs/sec
unstructured	3.2s	380 MB	0.31 docs/sec

Key Improvements:

8x faster than Kreuzberg v3 (Rust rewrite)
16-21x faster than Python libraries
62% less memory than v3
90% less memory than docling/unstructured

Excel Extraction¶

Library	Avg Time	Memory (Peak)
Kreuzberg v4	0.08s	25 MB
openpyxl	1.2s	180 MB
pandas	0.45s	95 MB

Key Improvements:

15x faster than openpyxl
5.6x faster than pandas
86% less memory than openpyxl

Text Processing¶

Operation	Python	Rust	Speedup
Token reduction	450ms	12ms	37x
Quality scoring	220ms	8ms	27x
XML streaming (100MB)	8.5s	0.4s	21x
Text streaming (500MB)	15s	0.8s	18x

Why Rust is Faster¶

1. Native Compilation¶

Rust compiles to native machine code with aggressive optimizations:

flowchart LR
    subgraph "Python"
        PySource[Python Code] --> Interpret[Interpreter<br/>CPython]
        Interpret --> Execute[Execution]
    end

    subgraph "Rust"
        RustSource[Rust Code] --> Compile[Compiler<br/>LLVM]
        Compile --> Optimize[Optimizations<br/>Inlining, SIMD, etc.]
        Optimize --> Native[Native Machine Code]
        Native --> Execute2[Execution]
    end

    Execute -.->|10-50x slower| Execute2

    style Execute2 fill:#c8e6c9
    style Execute fill:#ffcdd2

Compiler Optimizations:

Inlining: Small functions eliminated, reducing call overhead
Dead code elimination: Unused code removed
Loop unrolling: Loops optimized for CPU pipelines
SIMD: Single Instruction Multiple Data for parallel operations

2. Zero-Copy Operations¶

Rust's ownership model enables zero-copy string slicing and byte buffer handling:

// Python: Copies substring
text = content[100:500]  # Allocates new string

// Rust: Zero-copy slice
let text: &str = &content[100..500];  // No allocation

Impact:

No memory allocation for substrings
No CPU cycles spent copying
Better cache locality from fewer allocations

3. SIMD Acceleration¶

Text processing hot paths use SIMD for parallel operations:

// Process 16 characters at once
let chunk = unsafe { _mm_loadu_si128(ptr as *const __m128i) };
let spaces = _mm_cmpeq_epi8(chunk, space_vec);

SIMD Benefits:

Token reduction: 37x faster with SIMD whitespace detection
Quality scoring: 27x faster with SIMD character classification
String utilities: 15-20x faster character counting

4. Async Concurrency¶

Tokio's work-stealing scheduler enables true parallelism:

# Python: GIL prevents true parallelism
with ThreadPoolExecutor() as executor:
    results = executor.map(extract_file, files)  # Only one thread executes Python at a time

# Rust: True parallel execution
let results = batch_extract_file(&files, None, &config).await?;  // All cores utilized

Concurrency Benefits:

Batch extraction: Near-linear scaling with CPU cores
No GIL: All cores execute simultaneously
Async I/O: Thousands of concurrent file operations

5. Memory Efficiency¶

Rust's ownership model eliminates garbage collection overhead:

graph TB
    subgraph "Python Memory"
        Alloc1[Allocate Object]
        Use1[Use Object]
        GC1[Garbage Collector<br/>Scans + Pauses]
        Free1[Free Memory]
        Alloc1 --> Use1 --> GC1 --> Free1
    end

    subgraph "Rust Memory"
        Alloc2[Allocate Object]
        Use2[Use Object]
        Drop2[Drop Out of Scope<br/>Immediate Free]
        Alloc2 --> Use2 --> Drop2
    end

    GC1 -.->|Pauses execution| Alloc1

    style Drop2 fill:#c8e6c9
    style GC1 fill:#ffcdd2

Memory Benefits:

No GC pauses: Deterministic performance
Lower peak memory: RAII frees resources immediately
Better cache utilization: Smaller memory footprint

Streaming Parsers¶

For large files (multi-GB XML, text, archives), Kreuzberg uses streaming parsers that process data incrementally:

flowchart LR
    subgraph "Loading Parser"
        File1[Large File<br/>5 GB] --> Load[Load Entire File<br/>into Memory]
        Load --> Parse1[Parse]
        Parse1 --> Result1[Result]
    end

    subgraph "Streaming Parser"
        File2[Large File<br/>5 GB] --> Stream[Read Chunks<br/>4 KB at a time]
        Stream --> Parse2[Parse Incrementally]
        Parse2 --> Result2[Result]
    end

    Load -.->|5 GB memory| Result1
    Stream -.->|4 KB memory| Result2

    style Result2 fill:#c8e6c9
    style Result1 fill:#ffcdd2

Streaming Benefits:

Constant memory: Process 100GB file with 4KB memory
Faster startup: Begin processing immediately
Better cache performance: Small working set

Streaming Extractors:

XMLExtractor: Streams with quick-xml
TextExtractor: Line-by-line streaming
ArchiveExtractor: Decompresses on-the-fly

Benchmarking Methodology¶

Kreuzberg's benchmark suite provides comprehensive performance measurement:

Test Dataset¶

94 real-world documents
Multiple formats: PDF, DOCX, XLSX, images, emails
Size categories: Small (<1MB), medium (1-10MB), large (>10MB)
Variety: Reports, invoices, forms, presentations, spreadsheets

Metrics Tracked¶

Execution time: Wall clock time per extraction
CPU usage: Sampled at 100ms intervals
Memory usage: Peak RSS (Resident Set Size)
Throughput: Documents processed per second
Success rate: Percentage of files extracted without errors

Measurement Tools¶

flowchart TD
    Start[Start Extraction] --> ClearCache[Clear Kreuzberg Cache]
    ClearCache --> StartProfile[Start ResourceProfiler]
    StartProfile --> Extract[Run Extraction]
    Extract --> StopProfile[Stop Profiler]
    StopProfile --> Record[Record Metrics]
    Record --> Report[Generate Report]

    StartProfile -.-> CPU[Sample CPU @ 100ms]
    StartProfile -.-> Memory[Sample Memory @ 100ms]
    CPU --> StopProfile
    Memory --> StopProfile

    style Extract fill:#fff9c4

ResourceProfiler:

Samples CPU/memory every 100ms during extraction
Tracks peak memory usage
Records execution time with microsecond precision
1800s timeout per file

Running Benchmarks¶

# Install benchmark dependencies
uv sync --all-extras --all-packages

# Run benchmarks
uv run python -m benchmarks.src.cli benchmark \
    --framework kreuzberg_sync,extractous,docling \
    --category all \
    --iterations 3

# Generate reports
uv run python -m benchmarks.src.cli report --output-format html
uv run python -m benchmarks.src.cli visualize

See Advanced Features Guide for details.

Optimization Techniques¶

Kreuzberg employs several optimization strategies:

1. Lazy Initialization¶

Expensive resources initialized only when needed:

static GLOBAL_RUNTIME: Lazy<Runtime> = Lazy::new(|| {
    tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .expect("Failed to create runtime")
});

2. Caching¶

OCR results and extraction results cached by content hash:

Hit rate: 85%+ for repeated files
Storage: SQLite database (~100MB for 10k files)
Invalidation: Content-based (file changes invalidate cache)

3. Batch Processing¶

Process multiple files concurrently with batch_extract_*:

# Sequential: ~5 seconds for 10 files
for file in files:
    result = extract_file(file, config=config)

# Parallel: ~0.8 seconds for 10 files (6.25x faster)
results = batch_extract_file(files, config=config)

4. Fast Hash Maps¶

Uses ahash instead of std::collections::HashMap:

Faster hashing: SipHash → AHash (3-5x faster)
SIMD-accelerated: Uses CPU vector instructions
DoS resistant: Randomized per-process

5. Smart String Handling¶

Uses &str (string slices) over String where possible:

// Avoids allocation
pub fn supported_mime_types(&self) -> Vec<&str> {
    vec!["application/pdf", "application/xml"]
}

Architecture - System design enabling performance
Extraction Pipeline - Pipeline stages and optimizations
Configuration Guide - Performance tuning options
Advanced Features - Benchmarking and profiling tools