User Guide¶

This guide provides comprehensive documentation for the Kreuzberg document intelligence framework, covering core concepts, configuration options, and integration patterns.

Contents¶

Basic Usage - Essential usage patterns and concepts (API)
Extraction Configuration - Configure the extraction process (API)
Metadata Extraction - Document metadata extraction (API)
Content Chunking - Split documents into manageable chunks
Token Reduction - Optimize text for LLMs and storage (API)
Document Classification - Automatic document type detection
OCR Configuration - Configure OCR settings (API)
OCR Backends - Choose and configure different OCR engines
Supported Formats - All supported document formats
MCP Server - Model Context Protocol server for AI integration
API Server - REST API for document extraction
Docker - Using Kreuzberg with Docker

Best Practices¶

Use the async API for better performance in web applications and concurrent extraction
Configure OCR language settings to match your document languages for better accuracy
For large documents, consider file streaming methods to reduce memory usage
When processing many similar documents, reuse configuration objects for consistency

Common Use Cases¶

Document Analysis:

from kreuzberg import extract_file, ExtractionConfig

async def analyze_document(file_path):
    result = await extract_file(file_path, config=ExtractionConfig())

    # Get basic document content
    text = result.content

    # Access metadata
    title = result.metadata.get("title", "Untitled")
    author = result.metadata.get("authors", ["Unknown"])[0]

    return {"title": title, "author": author, "content": text, "word_count": len(text.split()), "char_count": len(text)}