Skip to content

Performance Analysis

Overview

This page presents comprehensive performance analysis of Kreuzberg based on standardized benchmarking across 18 file formats. The data demonstrates Kreuzberg's position as the fastest Python CPU-based text extraction framework.

Live Benchmarks: View real-time performance comparisons at benchmarks.kreuzberg.dev Methodology: Full benchmark methodology and source code available at github.com/Goldziher/python-text-extraction-libs-benchmarks

Executive Summary

Kreuzberg achieves industry-leading performance metrics:

  • Speed: Fastest Python CPU-based text extraction framework
  • Memory: Lowest memory footprint at ~360MB average
  • Installation: Minimal 71MB package size
  • Reliability: 100% success rate across all tested formats

Technical Performance Metrics

Processing Speed

Throughput by File Size Category

Category Kreuzberg Sync Kreuzberg Async Technical Notes
Tiny (\<100KB) 31.78 files/s 23.94 files/s Sync faster due to lower overhead
Small (100KB-1MB) 8.91 files/s 9.31 files/s Async benefits from concurrency
Medium (1-10MB) 2.42 files/s 3.16 files/s Async leverages multiprocessing

Processing Architecture

  • Synchronous Mode: Direct execution path with minimal overhead, optimal for single-file operations
  • Asynchronous Mode: Event-loop based with intelligent task scheduling, ideal for concurrent workloads
  • Multiprocessing: Automatic CPU core utilization for compute-intensive operations (OCR, PDF parsing)
  • Memory Management: Streaming architecture prevents memory bloat on large files

Memory Efficiency

Mode Memory Usage Characteristics
Kreuzberg Sync 359.8 MB Baseline - minimal overhead, efficient GC
Kreuzberg Async 395.2 MB +10% due to event loop and concurrent task management

Memory optimization strategies:

  • Lazy loading of document components
  • Streaming text extraction for large files
  • Automatic garbage collection after each extraction
  • Process pool recycling for long-running operations

Installation Footprint

Kreuzberg specifications:

  • Package size: 71 MB
  • Dependencies: 43 packages
  • Core components: PDFium, python-docx, python-pptx, pypandoc
  • Optional extras: EasyOCR, PaddleOCR, GMFT (table extraction)

Size optimization achieved through:

  • Selective dependency installation
  • No bundled ML models
  • Efficient binary packaging
  • Modular architecture with optional components

Reliability Metrics

Kreuzberg achieves 100% success rate across all tested formats:

  • Zero timeouts or failures in benchmark suite
  • Robust error handling with graceful degradation
  • Comprehensive format support (18 file types)
  • Consistent performance across file sizes

Supported File Formats

Comprehensive format coverage across 6 categories:

Category Formats Features
Documents PDF, DOCX, PPTX, XLSX Full text, metadata, tables
Web/Markup HTML, Markdown, RST Structure preservation
Images PNG, JPG, JPEG, BMP OCR with multiple engines
Email EML, MSG Headers, body, attachments
Data CSV, JSON, YAML Native parsing
Archives ZIP (containing above) Recursive extraction

Technical Architecture

Performance Optimizations

Speed optimizations:

  • Native C extensions (PDFium for PDFs, Tesseract for OCR)
  • Efficient data handling with minimal copies
  • Memory pooling for frequently used objects
  • Parallel processing for multi-page documents

Memory optimizations:

  • Streaming extraction for large files
  • Lazy loading of document components
  • Automatic resource cleanup
  • Bounded memory usage regardless of file size

Async implementation:

  • True async/await support (not just wrapper functions)
  • Intelligent task scheduling
  • Process pool for CPU-intensive operations
  • Non-blocking I/O throughout the pipeline

Production Deployment

Infrastructure Benefits

Resource efficiency:

  • Minimal memory footprint (~360MB) enables higher container density
  • Small installation size (71MB) reduces image build times
  • Fast processing speeds reduce compute costs
  • Predictable resource usage simplifies capacity planning

Deployment options:

  • Docker images for all architectures (linux/amd64, linux/arm64)
  • Serverless compatible (AWS Lambda, Google Cloud Functions)
  • Native Python package for traditional deployments
  • REST API server for microservice architectures

Operational advantages:

  • Zero external API dependencies
  • Local processing for data sovereignty
  • Configurable resource limits
  • Comprehensive logging and monitoring

Benchmark Methodology

Test Environment

  • Platform: Linux CI runners with standardized hardware
  • Python Version: 3.12-3.13
  • Document Set: 18 file formats across 6 categories
  • Metrics Collected: Processing speed, memory usage, success rate
  • Methodology: Full details and source code

Key Performance Indicators

Kreuzberg demonstrates:

  • Fastest processing: Leading throughput across all file size categories
  • Lowest memory usage: ~360MB average vs industry alternatives
  • Smallest footprint: 71MB installation size
  • High reliability: 100% success rate in comprehensive testing
  • Comprehensive format support: 18 file types with consistent performance

Conclusion

Kreuzberg's performance leadership stems from its efficient architecture, optimized implementation, and focus on real-world production needs. The combination of speed, reliability, and resource efficiency makes it the optimal choice for Python-based text extraction workloads.

For the latest benchmark results and to compare performance across different frameworks, visit benchmarks.kreuzberg.dev.


Performance data is based on comprehensive benchmarking across real-world document corpus. Results may vary based on specific use cases and hardware configurations.