Kreuzberg¶

Kreuzberg is a document intelligence framework that transforms unstructured documents into structured, machine-readable data. Built on a foundation of established open source technologies—PDFium for PDF processing, Tesseract for optical character recognition, and Pandoc for universal document conversion—Kreuzberg provides a unified interface for extracting text, metadata, and structural information from diverse document formats.

The framework emphasizes extensibility, allowing developers to integrate custom extractors and processing hooks while maintaining consistent APIs and error handling across all document types.

Core Capabilities¶

Kreuzberg addresses the complete document intelligence pipeline through a modular, extensible architecture designed for production environments.

Performance Characteristics¶

Throughput: Process 30+ documents per second (small files), with linear scaling to larger documents
Resource Efficiency: 71MB installation footprint, ~360MB runtime memory usage
Reliability: 100% extraction success rate across 18 tested file formats
Architecture: Native C extensions (PDFium, Tesseract) with Python async/await support
Benchmarks: Comprehensive performance analysis

Engineering Principles¶

Test Coverage: 95%+ coverage with comprehensive test suites
API Design: True async/await implementation alongside synchronous APIs
Error Handling: Consistent exception hierarchy with detailed context
Type Safety: Full type annotations for enhanced developer experience
Profiling: Continuous performance monitoring and optimization

Developer Integration¶

Zero Configuration: Functional defaults with progressive configuration options
AI Tool Integration: Native Model Context Protocol (MCP) server implementation
IDE Support: Complete type annotations and docstrings for intelligent code completion
Documentation: Comprehensive API reference with practical examples

Deployment Architecture¶

Containerization: Multi-architecture Docker images (linux/amd64, linux/arm64)
Serverless: Optimized for AWS Lambda, Google Cloud Functions, Azure Functions
Processing Modes: CPU-based with optional GPU acceleration (EasyOCR, PaddleOCR)
Data Sovereignty: Local processing without external API dependencies
Interface Options: Command-line interface, REST API, MCP server

Document Intelligence Features¶

Format Support: 18 document types including PDF, DOCX, PPTX, images, HTML, and structured data formats
OCR Engines: Tesseract (default), EasyOCR, PaddleOCR with automatic fallback strategies
Data Extraction: Text content, document metadata, table structures, and embedded resources
Processing Capabilities: Content chunking for RAG pipelines, language detection, format preservation
Extensibility: Plugin architecture for custom extractors and hooks

Architecture Philosophy¶

Kreuzberg builds upon established open source foundations, leveraging Pandoc's universal document conversion capabilities, PDFium's robust PDF handling, and Tesseract's proven OCR technology. This approach ensures reliability while enabling rapid feature development.

The framework is designed for modern document processing workflows including Retrieval Augmented Generation (RAG) pipelines, batch document analysis, and real-time content extraction in cloud-native environments.