OCR (Optical Character Recognition)¶
Extract text from images and scanned PDFs using OCR.
When OCR is Needed¶
flowchart TD
Start[Document File] --> FileType{File Type}
FileType -->|Image| ImageOCR[Always Use OCR]
FileType -->|PDF| CheckPDF{Check PDF}
FileType -->|Other| NoOCR[No OCR Needed]
CheckPDF --> ForceOCR{force_ocr=True?}
ForceOCR -->|Yes| AllPagesOCR[OCR All Pages]
ForceOCR -->|No| TextLayer{Has Text Layer?}
TextLayer -->|No Text| ScannedOCR[OCR Required]
TextLayer -->|Some Text| HybridPDF[Hybrid PDF]
TextLayer -->|All Text| NativeExtract[Native Extraction]
HybridPDF --> PageByPage[Process Pages]
PageByPage --> CheckPage{Page Has Text?}
CheckPage -->|No| PageOCR[OCR This Page]
CheckPage -->|Yes| PageNative[Native Extraction]
ImageOCR --> OCRBackend[OCR Backend Configured?]
ScannedOCR --> OCRBackend
AllPagesOCR --> OCRBackend
PageOCR --> OCRBackend
OCRBackend -->|Yes| ProcessOCR[Process with OCR]
OCRBackend -->|No| Error[MissingDependencyError]
style ImageOCR fill:#FFB6C1
style ScannedOCR fill:#FFB6C1
style AllPagesOCR fill:#FFB6C1
style PageOCR fill:#FFB6C1
style ProcessOCR fill:#90EE90
style Error fill:#FF6B6B Kreuzberg automatically determines when OCR is required:
- Images (
.png,.jpg,.tiff,.bmp,.webp) - Always requires OCR - PDFs with no text layer - Scanned documents automatically trigger OCR
- Hybrid PDFs - Pages without text are processed with OCR, others use native extraction
- Force OCR - Use
force_ocr=Trueto OCR all pages regardless of text layer
Automatic Detection
You don't need to manually enable OCR for images. Kreuzberg detects the file type and applies OCR automatically when an OCR backend is configured.
OCR Backend Comparison¶
flowchart TD
Start[Choose OCR Backend] --> Platform{Platform Support}
Platform -->|All Platforms| Tesseract
Platform -->|Python Only| PythonBackends[EasyOCR/PaddleOCR]
Tesseract --> TessPriority{Priority}
TessPriority -->|Speed| TessSpeed[Tesseract: Fast]
TessPriority -->|Accuracy| TessAccuracy[Tesseract: Good]
TessPriority -->|Production| TessProd[Tesseract: Best Choice]
PythonBackends --> PyPriority{Priority}
PyPriority -->|Highest Accuracy| Easy[EasyOCR: Excellent Accuracy]
PyPriority -->|Speed + Accuracy| Paddle[PaddleOCR: Very Fast + Excellent]
PyPriority -->|GPU Available| GPU[EasyOCR/PaddleOCR with GPU]
style Tesseract fill:#90EE90
style Easy fill:#FFD700
style Paddle fill:#87CEEB Kreuzberg supports three OCR backends with different strengths:
| Feature | Tesseract | EasyOCR | PaddleOCR |
|---|---|---|---|
| Speed | Fast | Moderate | Very Fast |
| Accuracy | Good | Excellent | Excellent |
| Languages | 100+ | 80+ | 80+ |
| Installation | System package | Python package | Python package |
| Model Size | Small (~10MB) | Large (~100MB) | Medium (~50MB) |
| CPU/GPU | CPU only | CPU + GPU | CPU + GPU |
| Platform Support | All | Python only | Python only |
| Best For | General use, production | High accuracy needs | Speed + accuracy |
Recommendation¶
- Production/CLI: Use Tesseract for simplicity and broad platform support
- Python + Accuracy: Use EasyOCR for best accuracy with deep learning models
- Python + Speed: Use PaddleOCR for fast processing with good accuracy
Installation¶
Tesseract (Recommended)¶
Available on all platforms (Python, TypeScript, Rust, Ruby):
Download from GitHub releases
Additional Languages¶
# macOS
brew install tesseract-lang
# Ubuntu/Debian
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-spa # Spanish
# List all installed languages
tesseract --list-langs
EasyOCR (Python Only)¶
Available only in Python with deep learning models:
Python 3.14 Compatibility
EasyOCR is not supported on Python 3.14 due to upstream PyTorch compatibility. Use Python 3.10-3.13 or use Tesseract on Python 3.14.
PaddleOCR (Python Only)¶
Available only in Python with optimized deep learning:
Python 3.14 Compatibility
PaddleOCR is not supported on Python 3.14 due to upstream compatibility issues. Use Python 3.10-3.13.
Configuration¶
Basic Configuration¶
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: Some("eng".to_string()),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
Multiple Languages¶
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: Some("eng+deu+fra".to_string()),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("multilingual.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
Force OCR on All Pages¶
Process PDFs with OCR even when they have a text layer:
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
force_ocr: true,
..Default::default()
};
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
Using EasyOCR (Python Only)¶
GPU Acceleration
EasyOCR and PaddleOCR support GPU acceleration via PyTorch/PaddlePaddle. Set use_gpu=True to enable.
Using PaddleOCR (Python Only)¶
Advanced OCR Options¶
DPI Configuration¶
Control image resolution for OCR processing:
DPI Recommendations
- 150 DPI: Fast processing, lower accuracy
- 300 DPI (default): Balanced speed and accuracy
- 600 DPI: High accuracy, slower processing
Image Preprocessing¶
Kreuzberg automatically preprocesses images for better OCR results:
- Grayscale conversion - Reduces noise
- Contrast enhancement - Improves text visibility
- Noise reduction - Removes artifacts
- Deskewing - Corrects rotation
These are applied automatically and require no configuration.
Troubleshooting¶
Tesseract not found
Error: MissingDependencyError: tesseract
Solution: Install Tesseract OCR:
Language not found
Error: Failed to initialize tesseract with language 'deu'
Solution: Install the language data:
Poor OCR accuracy
Problem: Extracted text has many errors
Solutions:
-
Increase DPI: Try 600 DPI for better quality
-
Try different backend: EasyOCR often has better accuracy
-
Specify correct language: Use the document's language
OCR is very slow
Problem: Processing takes too long
Solutions:
-
Reduce DPI: Use 150 DPI for faster processing
-
Use GPU acceleration (EasyOCR/PaddleOCR):
-
Use batch processing: Process multiple files concurrently
Out of memory with large PDFs
Problem: Memory errors when processing large scanned PDFs
Solutions:
-
Reduce DPI: Lower resolution uses less memory
-
Process pages separately: Extract specific page ranges
-
Increase system memory: OCR is memory-intensive
EasyOCR/PaddleOCR not working on Python 3.14
Error: Installation fails on Python 3.14
Solution: Use Python 3.10-3.13 or switch to Tesseract:
CLI Usage¶
Extract with OCR using the command-line interface:
# Basic OCR extraction
kreuzberg extract scanned.pdf --ocr
# Specify language
kreuzberg extract scanned.pdf --ocr --language deu
# Multiple languages
kreuzberg extract scanned.pdf --ocr --language "eng+deu"
# Force OCR on all pages
kreuzberg extract document.pdf --ocr --force-ocr
# Custom DPI
kreuzberg extract scanned.pdf --ocr --dpi 600
Next Steps¶
- Configuration - All configuration options
- Advanced Features - Chunking, language detection
- Advanced Features - Advanced Kreuzberg features
- Extraction Basics - Core extraction API