Features¶
Kreuzberg is a comprehensive document intelligence library supporting 50+ file formats with advanced extraction, OCR, and processing capabilities. This page documents all features and their availability across language bindings.
Core Extraction Features¶
File Format Support¶
Kreuzberg extracts text, tables, and metadata from over 50 file formats:
Documents - PDF (.pdf) - Native text extraction with optional OCR fallback - Microsoft Word (.docx, .doc) - Modern and legacy formats - LibreOffice Writer (.odt) - OpenDocument text - Plain text (.txt, .md, .markdown) - With metadata extraction for Markdown
Spreadsheets - Excel (.xlsx, .xls, .xlsm, .xlsb) - Modern and legacy formats - LibreOffice Calc (.ods) - OpenDocument spreadsheet - CSV (.csv) - Comma-separated values - TSV (.tsv) - Tab-separated values
Presentations - PowerPoint (.pptx, .ppt) - Modern and legacy formats
Images - Common formats: JPEG, PNG, GIF, BMP, TIFF, WebP - Advanced formats: JPEG 2000 (.jp2, .jpx, .jpm, .mj2) - Portable formats: PNM, PBM, PGM, PPM
Email - EML (.eml) - RFC 822 email format - MSG (.msg) - Microsoft Outlook format
Web & Markup - HTML (.html, .htm) - Converted to Markdown - XML (.xml) - Streaming parser for large files - SVG (.svg) - Scalable vector graphics
Structured Data - JSON (.json) - JavaScript Object Notation - YAML (.yaml, .yml) - YAML Ain't Markup Language - TOML (.toml) - Tom's Obvious Minimal Language
Archives - ZIP (.zip) - ZIP archives - TAR (.tar, .tgz) - Tape archives - GZIP (.gz) - GNU zip - 7-Zip (.7z) - 7-Zip archives
Extraction Capabilities¶
Text Extraction - Native text extraction from all supported formats - Preserves formatting and structure where applicable - Handles multi-byte character encodings (UTF-8, UTF-16, etc.) - Mojibake detection and correction
Table Extraction - Structured table data from PDFs, spreadsheets, and Word documents - Cell-level extraction with row/column indexing - Markdown and JSON output formats - Merged cell support
Metadata Extraction - Document properties (title, author, creation date, etc.) - Page count, word count, character count - MIME type detection - Format-specific metadata (Excel sheets, PDF annotations, etc.)
Image Extraction - Extract embedded images from PDFs and Office documents - Image preprocessing for OCR optimization - Format conversion and resolution normalization
OCR (Optical Character Recognition)¶
Tesseract OCR¶
Native Tesseract integration available in all language bindings.
Features: - 100+ language support via Tesseract language packs - Page segmentation modes (PSM) for different layouts - OCR Engine Modes (OEM) for accuracy tuning - Confidence scoring per word/line - hOCR output format support - Automatic image preprocessing
Configuration: - Language selection (single or multi-language) - PSM and OEM mode selection - Custom Tesseract configuration strings - Whitelist/blacklist character sets
Python-Specific OCR Backends¶
Python bindings provide two additional OCR backends via optional dependencies:
EasyOCR (pip install kreuzberg[easyocr]) - Deep learning-based OCR engine - 80+ language support - GPU acceleration support (CUDA) - Better accuracy for certain scripts (CJK, Arabic, etc.) - Requires Python <3.14
PaddleOCR (pip install kreuzberg[paddleocr]) - Production-ready OCR from PaddlePaddle - Ultra-lightweight models - 80+ language support - Mobile deployment capability - Requires Python <3.14
OCR Features¶
- Automatic fallback: Use OCR when native text extraction fails
- Force OCR mode: Override native extraction with OCR
- Caching: OCR results cached to disk for performance
- Image preprocessing: Automatic contrast, deskew, and noise reduction
- Multi-language detection: Process documents with mixed languages
Advanced Processing Features¶
Language Detection¶
Automatic language detection for extracted text using fast-langdetect.
Capabilities: - 60+ language detection - Confidence scoring - Multi-language detection (detect all languages in document) - Configurable confidence thresholds - ISO 639-1 and ISO 639-3 code support
Configuration:
Content Chunking¶
Split extracted text into semantic chunks for LLM processing.
Chunking Strategies: - Recursive: Split by paragraphs, sentences, then words - Semantic: Preserve semantic boundaries - Token-aware: Respect token limits for LLMs
Features: - Configurable chunk size and overlap - Metadata preservation per chunk - Character position tracking - Optional embedding generation
Configuration:
Embeddings¶
Generate vector embeddings for chunks using FastEmbed.
Embedding Models: - Preset models: "fast", "balanced", "quality" - FastEmbed models: Any model from FastEmbed catalog - Custom models: Bring your own embedding model
Features: - Local embedding generation (no API calls) - Automatic model download and caching - Multiple embedding dimensions (384, 512, 768, 1024) - Batch processing for performance - Optional L2 normalization
Configuration:
Token Reduction¶
Reduce token count while preserving semantic meaning using extractive summarization.
Reduction Modes: - Light ("light"): ~15% reduction, minimal information loss - Moderate ("moderate"): ~30% reduction, balanced approach - Aggressive ("aggressive"): ~50% reduction, maximum compression
Algorithm: - TF-IDF based sentence scoring - Stopword filtering with language-specific lists - Position-aware scoring (preserve important sections) - Configurable reduction targets
Configuration:
Quality Processing¶
Enhance extraction quality with text normalization and cleanup.
Processing Steps: - Unicode normalization (NFC/NFD/NFKC/NFKD) - Whitespace normalization - Line break standardization - Encoding detection and correction - Mojibake fixing - Character set validation
Configuration:
Keyword Extraction¶
Extract keywords and key phrases from documents.
Algorithms: - YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent - RAKE (Rapid Automatic Keyword Extraction): Fast statistical method
Features: - Configurable number of keywords - N-gram support (1-3 word phrases) - Language-specific stopword filtering - Relevance scoring
Configuration:
Batch Processing¶
Parallel Extraction¶
Process multiple documents concurrently using async/await or thread pools.
Async API:
Features: - Automatic concurrency based on CPU count - Configurable worker limits - Error handling per document - Progress tracking - Memory-efficient streaming for large batches
Caching¶
Intelligent caching system for expensive operations.
Cached Operations: - OCR results (per image hash) - Language detection results - Embedding vectors - Extracted metadata
Cache Features: - Disk-based storage - Automatic cache invalidation - Configurable cache directory - Cache statistics and management - LRU eviction policy
Configuration:
Configuration & Discovery¶
Configuration Methods¶
Kreuzberg supports four configuration methods:
- Programmatic: Create configuration objects in code
- TOML files:
kreuzberg.toml - YAML files:
kreuzberg.yaml - JSON files:
kreuzberg.json
Automatic Discovery¶
Configuration files automatically discovered in order:
- Current directory:
./kreuzberg.{toml,yaml,json} - User config:
~/.config/kreuzberg/config.{toml,yaml,json} - System config:
/etc/kreuzberg/config.{toml,yaml,json}
Discovery API:
Environment Variables¶
Override configuration via environment variables:
KREUZBERG_CONFIG_PATH: Path to config fileKREUZBERG_CACHE_DIR: Cache directoryKREUZBERG_OCR_BACKEND: OCR backend selectionKREUZBERG_OCR_LANGUAGE: OCR language
Plugin System¶
Plugin Types¶
Extensible architecture supporting four plugin types:
Document Extractors - Add support for custom file formats - Override default extractors - Priority-based selection
OCR Backends - Integrate cloud OCR services - Custom OCR engines - Preprocessing pipelines
Post Processors - Transform extraction results - Add custom metadata - Filter or enhance content
Validators - Validate extraction results - Enforce quality standards - Custom error handling
Plugin Registration¶
Rust:
let registry = get_document_extractor_registry();
registry.register("custom", Arc::new(MyExtractor), 50)?;
Python:
Plugin Discovery¶
Automatic plugin discovery from: - Python entry points - Configuration files - Environment variables
Server Modes¶
HTTP REST API Server¶
Production-ready RESTful API server.
Endpoints: - POST /extract - Extract from uploaded files - GET /health - Health check - GET /info - Server information - GET /cache/stats - Cache statistics - POST /cache/clear - Clear cache
Features: - File upload support - JSON/multipart request handling - CORS configuration - Request logging and metrics - Graceful shutdown
Start Server:
Model Context Protocol (MCP) Server¶
Stdio-based MCP server for AI agent integration.
Tools: - extract_file - Extract from file path - extract_bytes - Extract from base64 bytes - batch_extract - Extract from multiple files
Features: - Stdio transport (Claude Desktop, Continue.dev, etc.) - JSON-RPC 2.0 protocol - Streaming results - Error handling
Start Server:
Claude Desktop Configuration:
Language Binding Comparison¶
Feature Availability¶
| Feature | Python | TypeScript | Rust | Ruby |
|---|---|---|---|---|
| Core Extraction | ✓ | ✓ | ✓ | ✓ |
| All file formats | ✓ | ✓ | ✓ | ✓ |
| Table extraction | ✓ | ✓ | ✓ | ✓ |
| Metadata extraction | ✓ | ✓ | ✓ | ✓ |
| OCR | ||||
| Tesseract | ✓ | ✓ | ✓ | ✓ |
| EasyOCR | ✓ (optional) | ✗ | ✗ | ✗ |
| PaddleOCR | ✓ (optional) | ✗ | ✗ | ✗ |
| Processing | ||||
| Language detection | ✓ | ✓ | ✓ | ✓ |
| Content chunking | ✓ | ✓ | ✓ | ✓ |
| Embeddings | ✓ | ✓ | ✓ | ✓ |
| Token reduction | ✓ | ✓ | ✓ | ✓ |
| Quality processing | ✓ | ✓ | ✓ | ✓ |
| Keyword extraction | ✓ | ✓ | ✓ | ✓ |
| Configuration | ||||
| Programmatic config | ✓ | ✓ | ✓ | ✓ |
| File-based config | ✓ | ✓ | ✓ | ✓ |
| Config discovery | ✓ | ✓ | ✓ | ✓ |
| Plugin System | ||||
| Document extractors | ✓ | ✓ | ✓ | ✓ |
| OCR backends | ✓ | ✓ | ✓ | ✓ |
| Post processors | ✓ | ✓ | ✓ | ✓ |
| Validators | ✓ | ✓ | ✓ | ✓ |
| Servers | ||||
| HTTP REST API | ✓ | ✓ | ✓ | ✓ |
| MCP Server | ✓ | ✓ | ✓ | ✓ |
| APIs | ||||
| Sync API | ✓ | ✓ | ✓ | ✓ |
| Async API | ✓ | ✓ | ✓ | ✗ |
| Batch processing | ✓ | ✓ | ✓ | ✓ |
| Streaming | ✗ | ✗ | ✓ | ✗ |
Package Distribution¶
| Language | Package Manager | Modular Features | Full Package |
|---|---|---|---|
| Python | PyPI (pip) | ✗ | ✓ (default) |
| TypeScript | npm | ✗ | ✓ (default) |
| Rust | crates.io | ✓ | ✗ (opt-in) |
| Ruby | RubyGems (gem) | ✗ | ✓ (default) |
Rust Feature Flags¶
Rust provides fine-grained control over included components via Cargo features:
Format Extractors: - pdf - PDF extraction (pdfium) - excel - Excel/spreadsheet support - office - Office document support (Word, PowerPoint) - email - Email extraction (EML, MSG) - html - HTML to Markdown conversion - xml - XML streaming parser - archives - Archive extraction (ZIP, TAR, 7z)
Processing Features: - ocr - Tesseract OCR integration - language-detection - Language detection - chunking - Content chunking - embeddings - Embedding generation (requires chunking) - quality - Quality processing and text normalization - keywords - Keyword extraction (YAKE + RAKE) - stopwords - Stopword filtering
Server Features: - api - HTTP REST API server - mcp - Model Context Protocol server
Convenience Bundles: - full - All format extractors + all processing features - server - Server features + common extractors - cli - CLI features + common extractors
Example Cargo.toml:
Default: No features enabled (minimal build)
Python Optional Dependencies¶
Python bindings include all core features by default. Optional OCR backends require separate installation:
# Core package (Tesseract OCR only)
pip install kreuzberg
# With EasyOCR
pip install kreuzberg[easyocr]
# With PaddleOCR
pip install kreuzberg[paddleocr]
# All optional features
pip install kreuzberg[all]
Note: EasyOCR and PaddleOCR require Python <3.14 due to PyTorch dependencies.
TypeScript/Ruby Packages¶
TypeScript and Ruby bindings include all features in a single package. No optional dependencies or modular installation.
Performance Characteristics¶
Rust Core Performance¶
Kreuzberg's Rust core provides 10-50x performance improvements over pure Python implementations:
Benchmarked Operations: - Text extraction: 15-25x faster - XML parsing: 30-50x faster (streaming) - Token reduction: 10-15x faster - Quality processing: 12-20x faster
Memory Efficiency¶
Streaming Parsers: - XML: Constant memory usage regardless of file size - Plain text: Line-by-line streaming for multi-GB files - Archives: Extract on-demand without loading entire archive
Memory Limits: - PDF extraction: ~2x file size in memory - Image OCR: ~100MB per image - Embedding generation: Batch processing with configurable limits
Caching Benefits¶
OCR caching provides significant performance gains:
- First extraction: 5-30 seconds (depending on document)
- Cached extraction: <100ms
Cache hit rates typically exceed 80% for document collections with repeated processing.
CLI Tools¶
Extract Command¶
Primary CLI for document extraction.
Features: - Batch processing with glob patterns - Parallel processing (--parallel) - Output format selection (text, JSON) - OCR configuration - Progress reporting
Serve Command¶
Start HTTP REST API server.
MCP Command¶
Start Model Context Protocol server.
Cache Management¶
See CLI Usage for complete documentation.
System Requirements¶
Runtime Dependencies¶
Required (all platforms): - Tesseract OCR (4.0+) for OCR functionality - LibreOffice (optional) for legacy Office formats (.doc, .ppt) - Pandoc (optional) for advanced document conversion
Installation:
# macOS
brew install tesseract libreoffice pandoc
# Ubuntu/Debian
apt-get install tesseract-ocr libreoffice pandoc
# RHEL/CentOS/Fedora
dnf install tesseract libreoffice pandoc
# Windows (Chocolatey)
choco install tesseract libreoffice pandoc
Python Requirements¶
- Python 3.10+
- Optional: CUDA toolkit for GPU-accelerated OCR (EasyOCR)
TypeScript/Node.js Requirements¶
- Node.js 18+
- Native module support (node-gyp)
Rust Requirements¶
- Rust 1.80+ (edition 2024)
- Cargo for building from source
Ruby Requirements¶
- Ruby 3.3+
- Native extension support
Docker Images¶
Pre-built Docker images available on Docker Hub:
Variants: - goldziher/kreuzberg:latest - Core + Tesseract - goldziher/kreuzberg:latest-all - All features
Usage:
See Installation Guide for detailed instructions.
Next Steps¶
- Installation - Install Kreuzberg
- Quick Start - Get started in 5 minutes
- Configuration - Configure extraction behavior
- API Reference - Complete API documentation