Performance Guide¶
Kreuzberg provides both synchronous and asynchronous APIs, each optimized for different use cases. This guide helps you choose the right approach and understand the performance characteristics.
Quick Reference¶
Use Case | Recommended API | Reason |
---|---|---|
CLI tools | extract_file_sync() | Lower overhead, simpler code |
Backend APIs | await extract_file() | Always use async in async contexts |
Web applications | await extract_file() | Better concurrency |
Simple documents | extract_file_sync() | Faster for small files |
Complex PDFs | await extract_file() | Parallelized processing |
Batch processing | await batch_extract_file() | Concurrent execution |
OCR-heavy workloads | await extract_file() | Multiprocessing benefits |
Competitive Performance¶
Live benchmarks (source code) demonstrate Kreuzberg as the fastest Python CPU-based text extraction framework:
- Leading Performance: 31.78 files/second for small documents, 2.42 files/second for medium files
- Minimal Memory: ~360MB average usage, lowest among tested frameworks
- Smallest Installation: 71MB package size for maximum deployment flexibility
- High Reliability: 100% success rate across all 18 tested file formats
- Production Optimized: Built for high-throughput, real-time applications
Internal Benchmark Results¶
All internal benchmarks were conducted on macOS 15.5 with ARM64 (14 cores, 48GB RAM) using Python 3.13.3.
Single Document Processing¶
Document Type | File Size | Sync Time | Async Time | Speedup | Notes |
---|---|---|---|---|---|
Markdown | \<1KB | 0.4ms | 17.5ms | ❌ Async 41x slower | Async overhead dominates |
HTML | ~1KB | 1.6ms | 1.1ms | ✅ Async 1.5x faster | Minimal parsing overhead |
PDF (searchable) | ~10KB | 3.4ms | 2.7ms | ✅ Async 1.3x faster | Text extraction only |
PDF (non-searchable) | ~100KB | 394ms | 652ms | ✅ Sync 1.7x faster | OCR processing |
PDF (complex) | ~1MB | 39.0s | 8.5s | ✅ Async 4.6x faster | Heavy OCR + processing |
Batch Processing¶
Operation | Documents | Sync Time | Async Time | Speedup | Notes |
---|---|---|---|---|---|
Sequential batch | 3 mixed | 38.6s | N/A | N/A | Sync processes one by one |
Concurrent batch | 3 mixed | N/A | 8.5s | ✅ Async 4.5x faster | Parallel processing |
Performance Analysis¶
Why Async Wins for Complex Tasks¶
- Multiprocessing: Async implementation uses multiprocessing for CPU-intensive OCR
- Concurrency: I/O operations don't block other tasks
- Resource Management: Better memory and CPU utilization
- Parallel OCR: Multiple pages processed simultaneously
Why Sync Wins for Simple Tasks¶
- No Overhead: Direct function calls without async/await machinery
- Lower Memory: No event loop or task scheduling overhead
- Simpler Path: Direct execution without thread/process coordination
- Fast Startup: Immediate execution for quick operations
Backend API Considerations¶
Important: When working in an async context (like FastAPI, Django async views, aiohttp), always use the async API even for simple documents:
Why this matters:
- Sync operations in async contexts block the entire event loop
- This prevents other requests from being processed concurrently
- Backend throughput drops dramatically
- Use async consistently throughout your async application stack
The Crossover Point¶
The performance crossover occurs around 10KB file size or when OCR is required:
Implementation Details¶
Synchronous Implementation¶
The sync API uses a pure synchronous multiprocessing approach:
- Direct execution: No async overhead
- Process pools: For CPU-intensive tasks like OCR
- Memory efficient: Lower baseline memory usage
- Simple debugging: Easier to profile and debug
Asynchronous Implementation¶
The async API leverages Python's asyncio with intelligent task scheduling:
- Event loop integration: Non-blocking I/O operations
- Concurrent processing: Multiple documents simultaneously
- Adaptive multiprocessing: Dynamic worker allocation
- Resource management: Automatic cleanup and optimization
Optimization Strategies¶
For Maximum Performance¶
- Choose the right API based on your use case
- Use batch operations for multiple files
- Configure OCR appropriately for your document types
- Profile your specific workload - results vary by content
Optimized Default Configuration¶
Kreuzberg's default configuration is optimized out-of-the-box for modern PDFs and standard documents:
Advanced Configuration Examples¶
Performance Optimization Tips¶
Based on comprehensive benchmarking with 138+ documents:
- Disable OCR for text documents: Setting
ocr_backend=None
provides significant speedup for documents with text layers - Use PSM
AUTO_ONLY
(default): Optimized for modern documents without orientation detection overhead - Language model trade-offs: Disabling
language_model_ngram_on
can provide 30x+ speedup with minimal quality impact on clean documents - Dictionary correction: Disabling
tessedit_enable_dict_correction
speeds up processing for technical documents
Batch Processing Best Practices¶
Memory Usage¶
Sync Memory Profile¶
- Low baseline: ~10-50MB for most operations
- Predictable: Memory usage scales linearly with file size
- Fast cleanup: Immediate garbage collection
Async Memory Profile¶
- Higher baseline: ~50-100MB due to event loop overhead
- Better scaling: More efficient for large batches
- Managed cleanup: Automatic resource management
Benchmarking Your Workload¶
To benchmark your specific use case, use our comprehensive benchmark suite:
Benchmark Methodology¶
Our benchmarks follow rigorous methodology to ensure accurate results:
- Controlled Environment: Tests run on dedicated CI infrastructure
- Multiple Iterations: Each test runs multiple times for statistical significance
- Memory Monitoring: Peak memory usage tracked throughout execution
- CPU Profiling: Average CPU utilization measured
- Warm-up Runs: JIT compilation effects minimized with warm-up iterations
- System Info Collection: Hardware specs recorded for context
Interpreting Results¶
- Duration: Lower is better, measured in seconds/milliseconds
- Memory Peak: Peak memory usage during operation (MB)
- CPU Average: Average CPU utilization percentage during test
- Success Rate: Percentage of benchmark runs that completed successfully
The benchmarks use real-world documents of varying complexity to simulate actual usage patterns.
For complete benchmark suite documentation, methodology details, and CI integration, see the Benchmark Suite README.
Troubleshooting Performance¶
Common Issues¶
- Async slower than expected: Check if async overhead dominates (small files)
- High memory usage: Consider batch size and file types
- Slow OCR: Verify OCR engine configuration and document quality
- CPU bottlenecks: Monitor process pool utilization
Profiling Tools¶
Conclusion¶
Choose your API based on your specific needs:
- Sync for simplicity: CLI tools, simple documents, single-threaded applications
- Async for scale: Web applications, batch processing, complex documents
- Async for backends: Always use async in async contexts (FastAPI, Django async, etc.)
- Batch for efficiency: Multiple files, concurrent processing requirements
Key Decision Points¶
- Are you in an async context? → Use async API
- Processing multiple files? → Use batch operations
- Simple single document in sync context? → Sync may be faster
- Complex documents or OCR required? → Use async API
- Building a web API? → Use async API
The performance characteristics will vary based on your specific documents, hardware, and usage patterns. We recommend benchmarking with your actual data to make informed decisions.
Remember: Kreuzberg is benchmarked as one of the fastest text extraction libraries available, delivering superior performance regardless of which API you choose.