Installation¶
Kreuzberg is a modular document intelligence framework with a core package and optional components for specialized functionality.
System Dependencies¶
Pandoc¶
Pandoc is the foundation of Kreuzberg's universal document conversion capabilities. This required system dependency enables reliable extraction across diverse document formats. Install Pandoc for your platform:
Ubuntu/Debian¶
macOS¶
Windows¶
Kreuzberg Core Package¶
The Kreuzberg core package can be installed using pip with:
Optional Features¶
OCR¶
OCR is an optional feature for extracting text from images and non-searchable PDFs. Kreuzberg supports multiple OCR backends. To understand the differences between these backends, please read the OCR Backends documentation.
Tesseract OCR¶
Tesseract OCR is built into Kreuzberg and doesn't require additional Python packages. However, you must install Tesseract 5.0 or higher on your system:
Ubuntu/Debian¶
macOS¶
Windows¶
Language Support
Tesseract includes English language support by default. Kreuzberg Docker images come pre-configured with 12 common business languages: English, Spanish, French, German, Italian, Portuguese, Chinese (Simplified & Traditional), Japanese, Arabic, Russian, and Hindi.
For local installations requiring additional languages, you must install the appropriate language data files:
- Ubuntu/Debian:
sudo apt-get install tesseract-ocr-deu
(for German) - macOS:
brew install tesseract-lang
- Windows: See the Tesseract documentation
For more details on language installation and configuration, refer to the Tesseract documentation.
Chunking¶
Chunking is an optional feature - useful for RAG applications among others. Kreuzberg uses the excellent semantic-text-splitter
package for chunking. To install Kreuzberg with chunking support, you can use:
Language Detection¶
Language detection is an optional feature that automatically detects the language of extracted text. It uses the fast-langdetect package. To install Kreuzberg with language detection support, you can use:
Document Classification¶
For automatic document type detection (invoice, contract, receipt, etc.), install the document classification extra:
This feature uses Google Translate for multi-language support and requires explicit opt-in by setting auto_detect_document_type=True
in your configuration.
All Optional Dependencies¶
To install Kreuzberg with all optional dependencies, you can use the all
extra group:
This is equivalent to: