Migrating from v3 to v4¶
Kreuzberg v4 represents a complete architectural rewrite with a Rust-first design. This guide helps you migrate from v3 to v4.
Overview of Changes¶
v4 introduces several major changes:
- Rust Core: Complete rewrite of core extraction logic in Rust for 10-50x performance improvements
- Multi-Language Support: Native support for Python, TypeScript, and Rust
- Plugin System: Trait-based plugin architecture for extensibility
- Type Safety: Improved type definitions across all languages
- Breaking API Changes: Several API changes for consistency and better ergonomics
Quick Migration Checklist¶
- Update dependencies to v4
- Update import statements (some modules reorganized)
- Update configuration (new dataclasses/types)
- Update error handling (exception hierarchy changed)
- Migrate custom extractors to new plugin system
- Test thoroughly (behavior may differ in edge cases)
Installation¶
Python¶
# v3
pip install kreuzberg==3.x
# v4
pip install kreuzberg>=4.0
# With features
pip install "kreuzberg[all]"
TypeScript (New in v4)¶
Rust (New in v4)¶
API Changes¶
Python API¶
Import Changes¶
# v3
from kreuzberg import extract_file, ExtractionConfig
# v4 (same, but internal structure changed)
from kreuzberg import extract_file, ExtractionConfig
Configuration Changes¶
# v3
from kreuzberg import ExtractionConfig
config = ExtractionConfig(
enable_ocr=True,
ocr_language="eng",
use_quality_processing=True,
)
# v4
from kreuzberg import ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
),
enable_quality_processing=True,
)
Batch Processing¶
# v3
from kreuzberg import batch_extract
results = batch_extract(["file1.pdf", "file2.pdf"])
# v4
from kreuzberg import batch_extract_files
results = batch_extract_files(["file1.pdf", "file2.pdf"])
Error Handling¶
# v3
from kreuzberg import KreuzbergException
try:
result = extract_file("doc.pdf")
except KreuzbergException as e:
print(f"Error: {e}")
# v4
from kreuzberg import KreuzbergError, ParsingError, ValidationError
try:
result = extract_file("doc.pdf")
except ParsingError as e:
print(f"Parsing error: {e}")
except ValidationError as e:
print(f"Validation error: {e}")
except KreuzbergError as e:
print(f"Error: {e}")
OCR Configuration¶
# v3
config = ExtractionConfig(
enable_ocr=True,
ocr_language="eng",
ocr_psm=6,
)
# v4
from kreuzberg import OcrConfig, TesseractConfig
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(
psm=6,
oem=3,
),
),
)
Complete Configuration (v4)¶
v4 provides extensive configuration options across all features:
from kreuzberg import (
ExtractionConfig,
OcrConfig,
TesseractConfig,
ChunkingConfig,
ImageExtractionConfig,
PdfConfig,
TokenReductionConfig,
LanguageDetectionConfig,
PostProcessorConfig,
)
config = ExtractionConfig(
# Caching
use_cache=True,
# Quality processing
enable_quality_processing=True,
# OCR configuration
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(
psm=6,
oem=3,
),
),
force_ocr=False, # Force OCR even for text-based PDFs
# Chunking
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=100,
),
# Image extraction
images=ImageExtractionConfig(
extract_images=True,
target_dpi=300,
max_image_dimension=4096,
auto_adjust_dpi=True,
min_dpi=72,
),
# PDF options
pdf_options=PdfConfig(
extract_images=True,
passwords=["password1", "password2"], # Try multiple passwords
extract_metadata=True,
),
# Token reduction
token_reduction=TokenReductionConfig(
mode="moderate", # "off", "light", "moderate", "aggressive"
preserve_important_words=True,
),
# Language detection
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.7,
detect_multiple=True,
),
# PostProcessor configuration
postprocessor=PostProcessorConfig(
enabled=True,
),
)
Metadata Access¶
# v3
result = extract_file("doc.pdf")
if "pdf" in result.metadata:
pages = result.metadata["pdf"]["page_count"]
# v4
result = extract_file("doc.pdf")
if result.metadata.pdf:
pages = result.metadata.pdf.page_count
TypeScript API (New in v4)¶
TypeScript support is brand new in v4:
import {
extractFile,
extractFileSync,
ExtractionConfig,
OcrConfig,
} from '@goldziher/kreuzberg';
// Async extraction
const result = await extractFile('document.pdf');
// Sync extraction
const result2 = extractFileSync('document.pdf');
// With configuration
const config = new ExtractionConfig({
ocr: new OcrConfig({
backend: 'tesseract',
language: 'eng',
}),
});
const result3 = await extractFile('document.pdf', null, config);
Rust API (New in v4)¶
The Rust core is now available as a standalone library:
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Content: {}", result.content);
Ok(())
}
Feature Changes¶
Custom Extractors¶
v3 had limited support for custom extractors. v4 introduces a comprehensive plugin system.
Python¶
# v4 - Custom extractor
from kreuzberg import register_document_extractor
class CustomExtractor:
def name(self) -> str:
return "custom"
def supported_mime_types(self) -> list[str]:
return ["application/x-custom"]
def extract(self, data: bytes, mime_type: str, config) -> ExtractionResult:
# Implementation
pass
register_document_extractor(CustomExtractor())
TypeScript¶
// v4 - Custom PostProcessor
import { registerPostProcessor, PostProcessorProtocol } from '@goldziher/kreuzberg';
class CustomProcessor implements PostProcessorProtocol {
name(): string {
return 'custom';
}
process(result: ExtractionResult): ExtractionResult {
// Implementation
return result;
}
}
registerPostProcessor(new CustomProcessor());
OCR Backends¶
# v3 - Only Tesseract
config = ExtractionConfig(enable_ocr=True)
# v4 - Multiple backends
from kreuzberg import OcrConfig
# Tesseract
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
# EasyOCR (requires kreuzberg[easyocr])
config = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="en")
)
# PaddleOCR (requires kreuzberg[paddleocr])
config = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="en")
)
# Custom OCR backend
from kreuzberg import register_ocr_backend
class MyOCR:
def name(self) -> str:
return "my_ocr"
def extract_text(self, image: bytes, language: str) -> str:
# Implementation
pass
register_ocr_backend(MyOCR())
Language Detection¶
# v3 - Not available
# v4 - Automatic language detection
from kreuzberg import ExtractionConfig, LanguageDetectionConfig
config = ExtractionConfig(
language_detection=LanguageDetectionConfig(
min_confidence=0.7,
),
)
result = extract_file("document.pdf", config=config)
print(result.detected_languages) # ['eng', 'deu']
Chunking¶
# v3 - Manual chunking
result = extract_file("doc.pdf")
chunks = [result.content[i:i+1000] for i in range(0, len(result.content), 1000)]
# v4 - Built-in chunking
from kreuzberg import ChunkingConfig
config = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=100,
),
)
result = extract_file("doc.pdf", config=config)
for chunk in result.chunks:
print(f"Chunk: {len(chunk)} chars")
Password-Protected PDFs¶
# v3 - Not available
# v4 - Password support (requires kreuzberg[crypto])
from kreuzberg import PdfConfig
config = ExtractionConfig(
pdf_options=PdfConfig(
passwords=["password1", "password2"], # Try multiple passwords in order
extract_metadata=True,
),
)
result = extract_file("encrypted.pdf", config=config)
Token Reduction¶
# v3 - Not available
# v4 - Token reduction for LLM processing
from kreuzberg import TokenReductionConfig
config = ExtractionConfig(
token_reduction=TokenReductionConfig(
mode="aggressive", # "off", "light", "moderate", "aggressive"
preserve_important_words=True,
),
)
result = extract_file("document.pdf", config=config)
# Content is automatically reduced while preserving meaning
Extract from Bytes¶
# v3 - Limited support
# v4 - Full bytes extraction API
from kreuzberg import extract_bytes, extract_bytes_sync
# Read file into memory
with open("document.pdf", "rb") as f:
data = f.read()
# Sync extraction
result = extract_bytes_sync(data, "application/pdf")
# Async extraction
import asyncio
result = asyncio.run(extract_bytes(data, "application/pdf"))
# With MIME type auto-detection
result = extract_bytes_sync(data, None) # Auto-detect MIME type
Table Extraction¶
# v3 - Limited table support
result = extract_file("doc.pdf")
# Tables mixed into content
# v4 - Structured table extraction
result = extract_file("doc.pdf")
for table in result.tables:
print(table.markdown) # Markdown format
print(table.cells) # Structured data
Performance Improvements¶
v4 is significantly faster than v3:
| Operation | v3 Time | v4 Time | Improvement |
|---|---|---|---|
| PDF Extraction | 2.5s | 0.15s | 16x faster |
| OCR Processing | 5.0s | 0.8s | 6x faster |
| XML Parsing | 3.0s | 0.06s | 50x faster |
| Batch (100 files) | 180s | 12s | 15x faster |
These improvements are due to:
- Rust core implementation
- Streaming parsers for large files
- Optimized memory management
- SIMD text processing
- Efficient caching
New Features in v4¶
Plugin System¶
Four plugin types:
- DocumentExtractor - Custom file format extractors
- OcrBackend - Custom OCR engines
- PostProcessor - Data transformation and enrichment
- Validator - Fail-fast validation
Multi-Language Support¶
v4 provides native APIs for:
- Python - PyO3 bindings
- TypeScript/Node.js - NAPI-RS bindings
- Rust - Direct library usage
Configuration Discovery¶
# v4 - Automatic config discovery
# Looks for kreuzberg.toml, kreuzberg.yaml, kreuzberg.json
result = extract_file("doc.pdf") # Uses discovered config
# Manual config
from kreuzberg import load_config
config = load_config("custom-config.toml")
result = extract_file("doc.pdf", config=config)
Image Extraction¶
# v3 - Basic image extraction
# v4 - Advanced image extraction with DPI control
from kreuzberg import ImageExtractionConfig
config = ExtractionConfig(
images=ImageExtractionConfig(
extract_images=True,
target_dpi=300, # Target DPI for extracted images
max_image_dimension=4096, # Max dimension in pixels
auto_adjust_dpi=True, # Auto-adjust DPI for memory efficiency
min_dpi=72, # Minimum DPI threshold
),
)
result = extract_file("document.pdf", config=config)
# Images are extracted with optimized DPI settings
API Server¶
# v3 - Not available
# v4 - Built-in REST API server
pip install "kreuzberg[api]"
python -m kreuzberg serve --host 0.0.0.0 --port 8000
# Or via CLI binary
kreuzberg serve --port 8000
# Or via Docker
docker run -p 8000:8000 goldziher/kreuzberg:latest
MCP Server¶
# v3 - Not available
# v4 - Model Context Protocol server for Claude Desktop
python -m kreuzberg mcp
# Or via CLI binary
kreuzberg mcp
# Configure in Claude Desktop:
# macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
# Add: {"mcpServers": {"kreuzberg": {"command": "/path/to/kreuzberg", "args": ["mcp"]}}}
Breaking Changes¶
Configuration Structure¶
v3 used flat configuration. v4 uses nested dataclasses:
# v3
config = ExtractionConfig(
enable_ocr=True,
ocr_language="eng",
ocr_psm=6,
use_cache=True,
)
# v4
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6),
),
use_cache=True,
)
Metadata Structure¶
v3 used dictionaries. v4 uses typed dataclasses:
Error Hierarchy¶
# v3
KreuzbergException (base)
# v4
KreuzbergError (base)
├── ValidationError
├── ParsingError
├── OCRError
├── MissingDependencyError
├── PluginError
└── ConfigurationError
Function Names¶
| v3 | v4 |
|---|---|
batch_extract() | batch_extract_files() |
extract_bytes() | extract_bytes() (same) |
extract_file() | extract_file() (same) |
Removed Features¶
GMFT (Give Me Formatted Tables)¶
v3's vision-based table extraction using TATR models. Replaced with Tesseract OCR table detection:
# v4 - Tesseract table detection
config = ExtractionConfig(
ocr=OcrConfig(
tesseract_config=TesseractConfig(enable_table_detection=True)
)
)
result = extract_file("doc.pdf", config=config)
# result.tables -> list[ExtractedTable] with .cells and .markdown
Entity Extraction, Keyword Extraction, Document Classification¶
Removed. Use external libraries (spaCy, KeyBERT, etc.) with postprocessors if needed.
Other¶
- ExtractorRegistry: Custom extractors must be Rust plugins
- HTMLToMarkdownConfig, JSONExtractionConfig: Now use defaults
- ImageOCRConfig: Replaced by
ImageExtractionConfig
Migration Examples¶
Basic Extraction¶
# v3
from kreuzberg import extract_file
result = extract_file("document.pdf")
print(result["content"])
print(result["metadata"])
# v4
from kreuzberg import extract_file
result = extract_file("document.pdf")
print(result.content)
print(result.metadata)
OCR Extraction¶
# v3
from kreuzberg import extract_file, ExtractionConfig
config = ExtractionConfig(
enable_ocr=True,
ocr_language="eng",
)
result = extract_file("scanned.pdf", config=config)
# v4
from kreuzberg import extract_file, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
),
)
result = extract_file("scanned.pdf", config=config)
Batch Processing¶
# v3
from kreuzberg import batch_extract
results = batch_extract(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
for result in results:
print(result["content"])
# v4
from kreuzberg import batch_extract_files
results = batch_extract_files(["doc1.pdf", "doc2.pdf", "doc3.pdf"])
for result in results:
print(result.content)
Error Handling¶
# v3
from kreuzberg import extract_file, KreuzbergException
try:
result = extract_file("doc.pdf")
except KreuzbergException as e:
print(f"Error: {e}")
# v4
from kreuzberg import extract_file, KreuzbergError, ParsingError
try:
result = extract_file("doc.pdf")
except ParsingError as e:
print(f"Parsing error: {e}")
# Handle parsing-specific error
except KreuzbergError as e:
print(f"Error: {e}")
# Handle other errors
Testing Your Migration¶
Automated Testing¶
# test_migration.py
import pytest
from kreuzberg import extract_file, ExtractionConfig
def test_basic_extraction():
"""Test that basic extraction works"""
result = extract_file("tests/fixtures/sample.pdf")
assert result.content
assert result.mime_type == "application/pdf"
def test_ocr_extraction():
"""Test OCR extraction"""
from kreuzberg import OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng"),
)
result = extract_file("tests/fixtures/scanned.pdf", config=config)
assert result.content
assert result.metadata.ocr
def test_batch_processing():
"""Test batch processing"""
from kreuzberg import batch_extract_files
files = ["tests/fixtures/doc1.pdf", "tests/fixtures/doc2.pdf"]
results = batch_extract_files(files)
assert len(results) == 2
for result in results:
assert result.content
def test_error_handling():
"""Test error handling"""
from kreuzberg import ParsingError
with pytest.raises(ParsingError):
extract_file("tests/fixtures/corrupted.pdf")
Performance Testing¶
import time
from kreuzberg import extract_file, batch_extract_files
# Single file
start = time.time()
result = extract_file("large_document.pdf")
print(f"Single file: {time.time() - start:.2f}s")
# Batch processing
files = [f"document{i}.pdf" for i in range(100)]
start = time.time()
results = batch_extract_files(files)
print(f"Batch (100 files): {time.time() - start:.2f}s")
Getting Help¶
- Documentation: https://docs.kreuzberg.dev
- Examples: See Python API Reference, TypeScript API Reference, Rust API Reference
- Issues: GitHub Issues
- Changelog: CHANGELOG.md
Deprecation Timeline¶
- v3.x: Maintenance mode (bug fixes only)
- v4.0: Current stable release
- v3 EOL: June 2025 (no further updates)
We recommend migrating to v4 as soon as possible to benefit from performance improvements and new features.