Skip to content

Performance

Kreuzberg's Rust-first architecture delivers significant performance improvements over pure Python implementations. This page explains the performance benefits, benchmarking methodology, and optimization techniques.

Performance Benefits

The Rust core provides 10-50x performance improvements across multiple operations:

graph LR
    subgraph "Python-based Libraries"
        Py1["docling<br/>~2.5s per PDF"]
        Py2["unstructured<br/>~3.2s per PDF"]
        Py3["markitdown<br/>~1.8s per PDF"]
    end

    subgraph "Kreuzberg v4"
        Rust["Rust Core<br/>~0.15s per PDF"]
    end

    Py1 -.->|16x slower| Rust
    Py2 -.->|21x slower| Rust
    Py3 -.->|12x slower| Rust

    style Rust fill:#c8e6c9
    style Py1 fill:#ffcdd2
    style Py2 fill:#ffcdd2
    style Py3 fill:#ffcdd2

Benchmark Results

Performance benchmarks compare Kreuzberg against other popular extraction libraries using 94 real-world documents:

PDF Extraction

Library Avg Time Memory (Peak) Throughput
Kreuzberg v4 0.15s 45 MB 6.7 docs/sec
Kreuzberg v3 1.2s 120 MB 0.83 docs/sec
extractous 0.25s 65 MB 4.0 docs/sec
docling 2.5s 450 MB 0.4 docs/sec
unstructured 3.2s 380 MB 0.31 docs/sec

Key Improvements:

  • 8x faster than Kreuzberg v3 (Rust rewrite)
  • 16-21x faster than Python libraries
  • 62% less memory than v3
  • 90% less memory than docling/unstructured

Excel Extraction

Library Avg Time Memory (Peak)
Kreuzberg v4 0.08s 25 MB
openpyxl 1.2s 180 MB
pandas 0.45s 95 MB

Key Improvements:

  • 15x faster than openpyxl
  • 5.6x faster than pandas
  • 86% less memory than openpyxl

Text Processing

Operation Python Rust Speedup
Token reduction 450ms 12ms 37x
Quality scoring 220ms 8ms 27x
XML streaming (100MB) 8.5s 0.4s 21x
Text streaming (500MB) 15s 0.8s 18x

Why Rust is Faster

1. Native Compilation

Rust compiles to native machine code with aggressive optimizations:

flowchart LR
    subgraph "Python"
        PySource[Python Code] --> Interpret[Interpreter<br/>CPython]
        Interpret --> Execute[Execution]
    end

    subgraph "Rust"
        RustSource[Rust Code] --> Compile[Compiler<br/>LLVM]
        Compile --> Optimize[Optimizations<br/>Inlining, SIMD, etc.]
        Optimize --> Native[Native Machine Code]
        Native --> Execute2[Execution]
    end

    Execute -.->|10-50x slower| Execute2

    style Execute2 fill:#c8e6c9
    style Execute fill:#ffcdd2

Compiler Optimizations:

  • Inlining: Small functions eliminated, reducing call overhead
  • Dead code elimination: Unused code removed
  • Loop unrolling: Loops optimized for CPU pipelines
  • SIMD: Single Instruction Multiple Data for parallel operations

2. Zero-Copy Operations

Rust's ownership model enables zero-copy string slicing and byte buffer handling:

// Python: Copies substring
text = content[100:500]  # Allocates new string

// Rust: Zero-copy slice
let text: &str = &content[100..500];  // No allocation

Impact:

  • No memory allocation for substrings
  • No CPU cycles spent copying
  • Better cache locality from fewer allocations

3. SIMD Acceleration

Text processing hot paths use SIMD for parallel operations:

// Process 16 characters at once
let chunk = unsafe { _mm_loadu_si128(ptr as *const __m128i) };
let spaces = _mm_cmpeq_epi8(chunk, space_vec);

SIMD Benefits:

  • Token reduction: 37x faster with SIMD whitespace detection
  • Quality scoring: 27x faster with SIMD character classification
  • String utilities: 15-20x faster character counting

4. Async Concurrency

Tokio's work-stealing scheduler enables true parallelism:

# Python: GIL prevents true parallelism
with ThreadPoolExecutor() as executor:
    results = executor.map(extract_file, files)  # Only one thread executes Python at a time

# Rust: True parallel execution
let results = batch_extract_file(&files, None, &config).await?;  // All cores utilized

Concurrency Benefits:

  • Batch extraction: Near-linear scaling with CPU cores
  • No GIL: All cores execute simultaneously
  • Async I/O: Thousands of concurrent file operations

5. Memory Efficiency

Rust's ownership model eliminates garbage collection overhead:

graph TB
    subgraph "Python Memory"
        Alloc1[Allocate Object]
        Use1[Use Object]
        GC1[Garbage Collector<br/>Scans + Pauses]
        Free1[Free Memory]
        Alloc1 --> Use1 --> GC1 --> Free1
    end

    subgraph "Rust Memory"
        Alloc2[Allocate Object]
        Use2[Use Object]
        Drop2[Drop Out of Scope<br/>Immediate Free]
        Alloc2 --> Use2 --> Drop2
    end

    GC1 -.->|Pauses execution| Alloc1

    style Drop2 fill:#c8e6c9
    style GC1 fill:#ffcdd2

Memory Benefits:

  • No GC pauses: Deterministic performance
  • Lower peak memory: RAII frees resources immediately
  • Better cache utilization: Smaller memory footprint

Streaming Parsers

For large files (multi-GB XML, text, archives), Kreuzberg uses streaming parsers that process data incrementally:

flowchart LR
    subgraph "Loading Parser"
        File1[Large File<br/>5 GB] --> Load[Load Entire File<br/>into Memory]
        Load --> Parse1[Parse]
        Parse1 --> Result1[Result]
    end

    subgraph "Streaming Parser"
        File2[Large File<br/>5 GB] --> Stream[Read Chunks<br/>4 KB at a time]
        Stream --> Parse2[Parse Incrementally]
        Parse2 --> Result2[Result]
    end

    Load -.->|5 GB memory| Result1
    Stream -.->|4 KB memory| Result2

    style Result2 fill:#c8e6c9
    style Result1 fill:#ffcdd2

Streaming Benefits:

  • Constant memory: Process 100GB file with 4KB memory
  • Faster startup: Begin processing immediately
  • Better cache performance: Small working set

Streaming Extractors:

  • XMLExtractor: Streams with quick-xml
  • TextExtractor: Line-by-line streaming
  • ArchiveExtractor: Decompresses on-the-fly

Benchmarking Methodology

Kreuzberg's benchmark suite provides comprehensive performance measurement:

Test Dataset

  • 94 real-world documents
  • Multiple formats: PDF, DOCX, XLSX, images, emails
  • Size categories: Small (<1MB), medium (1-10MB), large (>10MB)
  • Variety: Reports, invoices, forms, presentations, spreadsheets

Metrics Tracked

  • Execution time: Wall clock time per extraction
  • CPU usage: Sampled at 100ms intervals
  • Memory usage: Peak RSS (Resident Set Size)
  • Throughput: Documents processed per second
  • Success rate: Percentage of files extracted without errors

Measurement Tools

flowchart TD
    Start[Start Extraction] --> ClearCache[Clear Kreuzberg Cache]
    ClearCache --> StartProfile[Start ResourceProfiler]
    StartProfile --> Extract[Run Extraction]
    Extract --> StopProfile[Stop Profiler]
    StopProfile --> Record[Record Metrics]
    Record --> Report[Generate Report]

    StartProfile -.-> CPU[Sample CPU @ 100ms]
    StartProfile -.-> Memory[Sample Memory @ 100ms]
    CPU --> StopProfile
    Memory --> StopProfile

    style Extract fill:#fff9c4

ResourceProfiler:

  • Samples CPU/memory every 100ms during extraction
  • Tracks peak memory usage
  • Records execution time with microsecond precision
  • 1800s timeout per file

Running Benchmarks

# Install benchmark dependencies
uv sync --all-extras --all-packages

# Run benchmarks
uv run python -m benchmarks.src.cli benchmark \
    --framework kreuzberg_sync,extractous,docling \
    --category all \
    --iterations 3

# Generate reports
uv run python -m benchmarks.src.cli report --output-format html
uv run python -m benchmarks.src.cli visualize

See Advanced Features Guide for details.

Optimization Techniques

Kreuzberg employs several optimization strategies:

1. Lazy Initialization

Expensive resources initialized only when needed:

static GLOBAL_RUNTIME: Lazy<Runtime> = Lazy::new(|| {
    tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .expect("Failed to create runtime")
});

2. Caching

OCR results and extraction results cached by content hash:

  • Hit rate: 85%+ for repeated files
  • Storage: SQLite database (~100MB for 10k files)
  • Invalidation: Content-based (file changes invalidate cache)

3. Batch Processing

Process multiple files concurrently with batch_extract_*:

# Sequential: ~5 seconds for 10 files
for file in files:
    result = extract_file(file, config=config)

# Parallel: ~0.8 seconds for 10 files (6.25x faster)
results = batch_extract_file(files, config=config)

4. Fast Hash Maps

Uses ahash instead of std::collections::HashMap:

  • Faster hashing: SipHash → AHash (3-5x faster)
  • SIMD-accelerated: Uses CPU vector instructions
  • DoS resistant: Randomized per-process

5. Smart String Handling

Uses &str (string slices) over String where possible:

// Avoids allocation
pub fn supported_mime_types(&self) -> Vec<&str> {
    vec!["application/pdf", "application/xml"]
}