Skip to content

Benchmarking Methodology¶

Test Setup¶

Platform: Ubuntu 22.04 (GitHub Actions)
Iterations: 3 runs per benchmark
Modes: Single-file (latency) and Batch (throughput)
Documents: 30+ test files covering all supported formats

Frameworks Tested¶

Kreuzberg Variants¶

Native (Rust direct)
Python (sync, async, batch)
TypeScript (async, batch)
WebAssembly (async, batch)
Ruby (sync, batch)
Go (sync, batch)
Java (sync)
C# (sync)

Competitors¶

Apache Tika
Docling
Unstructured
MarkItDown

Metrics Explained¶

Duration (p95, p50): 95^th and 50^th percentile latency in milliseconds
Throughput: Megabytes processed per second
Memory (peak, p95, p99): Memory usage percentiles in MB
CPU: Average CPU utilization percentage
Success Rate: Percentage of files successfully processed

Caveats¶

Hardware-dependent - results vary by CPU/memory
File size distribution affects throughput calculations
OCR benchmarks require Tesseract installation
Network latency not measured (local file I/O only)

Running Locally¶

# Build benchmark harness
cargo build --release -p benchmark-harness

# Run benchmarks
./target/release/benchmark-harness run \
    --fixtures tools/benchmark-harness/fixtures/ \
    --frameworks kreuzberg-native,docling \
    --output ./benchmark-output \
    --format html

# Open results
open benchmark-output/index.html

See Advanced Guide for more options.