Format Support¶
Kreuzberg supports 118+ file extensions across 11 major categories, providing comprehensive document intelligence capabilities through native Rust extractors, Pandoc integration, and LibreOffice conversion.
Overview¶
Kreuzberg v4 uses a high-performance Rust core with three extraction methods:
- Native Rust Extractors: Fast, memory-efficient extractors for common formats
- Pandoc Integration: Support for 30+ academic and publishing formats
- LibreOffice Conversion: Legacy Microsoft Office format support (
.doc,.ppt)
All formats support async/await and batch processing. Image formats and PDFs support optional OCR when configured.
Format Support Matrix¶
Office Documents¶
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
.pdf | application/pdf | Native Rust (pdfium-render) | Yes | Metadata extraction, image extraction, text layer detection | |
| Excel | .xlsx, .xlsm, .xlsb, .xls, .xlam, .xla, .ods | Various Excel MIME types | Native Rust (calamine) | No | Multi-sheet support, formula preservation |
| PowerPoint | .pptx, .pptm, .ppsx | application/vnd.openxmlformats-officedocument.presentationml.presentation | Native Rust (roxmltree) | Yes (for embedded images) | Slide extraction, image OCR, table detection |
| Word (Modern) | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | Pandoc | No | Preserves formatting, extracts metadata |
| Word (Legacy) | .doc | application/msword | LibreOffice + Pandoc | No | Converts to DOCX then extracts |
| PowerPoint (Legacy) | .ppt | application/vnd.ms-powerpoint | LibreOffice + Pandoc | No | Converts to PPTX then extracts |
| OpenDocument Text | .odt | application/vnd.oasis.opendocument.text | Pandoc | No | Full OpenDocument support |
| OpenDocument Spreadsheet | .ods | application/vnd.oasis.opendocument.spreadsheet | Native Rust (calamine) | No | Multi-sheet support |
Text & Markup¶
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
| Plain Text | .txt | text/plain | Native Rust (streaming) | No | Line/word/character counting, memory-efficient streaming |
| Markdown | .md, .markdown | text/markdown, text/x-markdown | Native Rust (streaming) | No | Header extraction, link detection, code block detection |
| HTML | .html, .htm | text/html, application/xhtml+xml | Native Rust (html-to-markdown-rs) | No | Converts to Markdown, metadata extraction |
| XML | .xml | application/xml, text/xml | Native Rust (quick-xml streaming) | No | Element counting, unique element tracking |
| SVG | .svg | image/svg+xml | Native Rust (XML parser) | No | Treated as XML document |
| reStructuredText | .rst | text/x-rst | Pandoc | No | Full reST syntax support |
| Org Mode | .org | text/x-org | Pandoc | No | Emacs Org mode support |
| Rich Text Format | .rtf | application/rtf, text/rtf | Pandoc | No | RTF 1.x support |
Structured Data¶
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
| JSON | .json | application/json, text/json | Native Rust (serde_json) | No | Field counting, nested structure extraction |
| YAML | .yaml, .yml | application/x-yaml, text/yaml, text/x-yaml | Native Rust (serde_yaml) | No | Multi-document support, field counting |
| TOML | .toml | application/toml, text/toml | Native Rust (toml crate) | No | Configuration file support |
| CSV | .csv | text/csv | Native Rust (via Pandoc) | No | Tabular data extraction |
| TSV | .tsv | text/tab-separated-values | Native Rust (via Pandoc) | No | Tab-separated data extraction |
Email¶
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
| EML | .eml | message/rfc822 | Native Rust (mail-parser) | No | Header extraction, attachment listing, body text |
| MSG | .msg | application/vnd.ms-outlook | Native Rust (mail-parser) | No | Outlook message support, metadata extraction |
Images¶
All image formats support OCR when configured with ocr parameter in ExtractionConfig.
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
| PNG | .png | image/png | Native Rust (image-rs) | Yes | EXIF metadata extraction |
| JPEG | .jpg, .jpeg | image/jpeg, image/jpg | Native Rust (image-rs) | Yes | EXIF metadata extraction |
| WebP | .webp | image/webp | Native Rust (image-rs) | Yes | Modern format support |
| BMP | .bmp | image/bmp, image/x-bmp, image/x-ms-bmp | Native Rust (image-rs) | Yes | Uncompressed format |
| TIFF | .tiff, .tif | image/tiff, image/x-tiff | Native Rust (image-rs) | Yes | Multi-page support |
| GIF | .gif | image/gif | Native Rust (image-rs) | Yes | Animation frame extraction |
| JPEG 2000 | .jp2, .jpx, .jpm, .mj2 | image/jp2, image/jpx, image/jpm, image/mj2 | Native Rust (image-rs) | Yes | Advanced JPEG format |
| PNM Family | .pnm, .pbm, .pgm, .ppm | image/x-portable-anymap, etc. | Native Rust (image-rs) | Yes | NetPBM formats |
Archives¶
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
| ZIP | .zip | application/zip, application/x-zip-compressed | Native Rust (zip crate) | No | File listing, text content extraction |
| TAR | .tar, .tgz | application/x-tar, application/tar, application/x-gtar, application/x-ustar | Native Rust (tar crate) | No | Unix archive support, compression detection |
| 7-Zip | .7z | application/x-7z-compressed | Native Rust (sevenz-rust) | No | High compression format support |
| Gzip | .gz | application/gzip | Native Rust | No | Gzip compression support |
Academic & Publishing (via Pandoc)¶
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|---|---|---|---|---|---|
| LaTeX | .tex, .latex | application/x-latex, text/x-tex | Pandoc | No | Full LaTeX document support |
| EPUB | .epub | application/epub+zip | Pandoc | No | E-book format, metadata extraction |
| BibTeX | .bib | application/x-bibtex, application/x-biblatex | Pandoc | No | Bibliography database support |
| Typst | .typst | application/x-typst | Pandoc | No | Modern typesetting format |
| Jupyter Notebook | .ipynb | application/x-ipynb+json | Pandoc | No | Code cells, markdown cells, output extraction |
| FictionBook | - | application/x-fictionbook+xml | Pandoc | No | XML-based e-book format |
| DocBook | - | application/docbook+xml | Pandoc | No | Technical documentation format |
| JATS | - | application/x-jats+xml | Pandoc | No | Journal article XML format |
| OPML | - | application/x-opml+xml | Pandoc | No | Outline format |
| RIS | - | application/x-research-info-systems | Pandoc | No | Citation format |
| EndNote XML | - | application/x-endnote+xml | Pandoc | No | Reference manager format |
| CSL JSON | - | application/csl+json | Pandoc | No | Citation Style Language JSON |
Markdown Variants (via Pandoc)¶
| Format | MIME Type | Extraction Method | Special Features |
|---|---|---|---|
| CommonMark | text/x-commonmark | Pandoc | Standard Markdown spec |
| GitHub Flavored Markdown | text/x-gfm | Pandoc | GFM extensions (tables, strikethrough, etc.) |
| MultiMarkdown | text/x-multimarkdown | Pandoc | MMD extensions |
| Markdown Extra | text/x-markdown-extra | Pandoc | PHP Markdown Extra extensions |
Other Formats¶
| Format | MIME Type | Extraction Method | Special Features |
|---|---|---|---|
| Man Pages | text/x-mdoc | Pandoc | Unix manual page format |
| Troff | text/troff | Pandoc | Unix document format |
| POD | text/x-pod | Pandoc | Perl documentation format |
| DokuWiki | text/x-dokuwiki | Pandoc | Wiki markup format |
Architecture Diagram¶
graph TD
A[File Input] --> B{MIME Detection}
B --> C{Extraction Method}
C -->|Native Format| D[Rust Core Extractors]
C -->|Pandoc Format| E[Pandoc Subprocess]
C -->|Legacy Office| F[LibreOffice Conversion]
D --> G[PDF Extractor]
D --> H[Excel Extractor]
D --> I[Image Extractor]
D --> J[XML/Text/HTML Extractors]
D --> K[Email Extractor]
D --> L[Archive Extractor]
E --> M[DOCX/ODT/EPUB/LaTeX]
F --> N[Convert DOC→DOCX]
F --> O[Convert PPT→PPTX]
N --> E
O --> D
G --> P{OCR Needed?}
I --> P
P -->|Yes| Q[Tesseract OCR]
P -->|No| R[Text Output]
Q --> R
H --> R
J --> R
K --> R
L --> R
M --> R
R --> S[Post-Processing Pipeline]
S --> T[Final Result] Feature Flags¶
Kreuzberg uses Cargo feature flags to enable optional format support:
| Feature Flag | Formats Enabled | Default |
|---|---|---|
pdf | PDF documents | No |
excel | Excel spreadsheets (all variants) | No |
office | PowerPoint, Pandoc formats | No |
ocr | OCR for images and PDFs | No |
email | EML, MSG email formats | No |
html | HTML to Markdown conversion | No |
xml | XML document parsing | No |
archives | ZIP, TAR, 7z archive support | No |
Note: No features are enabled by default (default = []). You must explicitly enable the features you need.
To enable specific features:
To enable all features with --all-features:
Or use the convenience bundles:
[dependencies]
# All format extraction features (no server)
kreuzberg = { version = "4.0", features = ["full"] }
# Server features (API, MCP) with common formats
kreuzberg = { version = "4.0", features = ["server"] }
# CLI features with common formats
kreuzberg = { version = "4.0", features = ["cli"] }
System Dependencies¶
Some formats require external system tools:
Tesseract OCR (Optional)¶
Required for OCR on images and PDFs:
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# RHEL/CentOS/Fedora
sudo dnf install tesseract
# Windows (Scoop)
scoop install tesseract
Pandoc (Optional)¶
Required for academic and publishing formats (DOCX, EPUB, LaTeX, etc.):
# macOS
brew install pandoc
# Ubuntu/Debian
sudo apt-get install pandoc
# RHEL/CentOS/Fedora
sudo dnf install pandoc
# Windows (Scoop)
scoop install pandoc
Minimum version: Pandoc 2.x or later
LibreOffice (Optional)¶
Required for legacy Microsoft Office formats (.doc, .ppt):
# macOS
brew install libreoffice
# Ubuntu/Debian
sudo apt-get install libreoffice
# RHEL/CentOS/Fedora
sudo dnf install libreoffice
# Windows
# Download from https://www.libreoffice.org/download/
Docker Note: All system dependencies are pre-installed in official Kreuzberg Docker images.
Format Detection¶
Kreuzberg automatically detects file formats using:
- File Extension Mapping: 118+ extensions mapped to MIME types
- mime_guess Crate: Fallback for unknown extensions
- Manual Override: Explicit MIME type can be provided
Example with manual override:
from kreuzberg import extract_file
# Auto-detect from extension
result = extract_file("document.pdf")
# Manual MIME type override
result = extract_file("document.dat", mime_type="application/pdf")
import { extractFile } from 'kreuzberg';
// Auto-detect from extension
const result = await extractFile('document.pdf');
// Manual MIME type override
const result2 = await extractFile('document.dat', { mimeType: 'application/pdf' });
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
// Auto-detect from extension
let result = extract_file("document.pdf", None, &config).await?;
// Manual MIME type override
let result = extract_file("document.dat", Some("application/pdf"), &config).await?;
Ok(())
}
OCR Support¶
OCR is available for:
- All image formats (PNG, JPEG, WebP, BMP, TIFF, GIF, etc.)
- PDF documents (with automatic fallback for scanned PDFs)
- Embedded images in PowerPoint presentations
Configuration¶
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
config = ExtractionConfig(
ocr=OcrConfig(
tesseract_config=TesseractConfig(
lang="eng+deu", # Multiple languages
psm=3, # Page segmentation mode
oem=1 # OCR Engine mode
)
),
force_ocr=False # Only use OCR when native text is insufficient
)
result = extract_file("scanned_document.pdf", config=config)
Automatic OCR Decision¶
For PDFs, Kreuzberg automatically decides whether OCR is needed by analyzing native text:
- No OCR: Document has substantial, meaningful text (>64 non-whitespace chars, >32 chars/page average)
- OCR Fallback: Document appears scanned (mostly punctuation, very low alphanumeric ratio)
Override with force_ocr=True to always use OCR regardless of native text quality.
Performance Characteristics¶
Native Rust Extractors¶
- PDF: 10-50x faster than Python libraries
- Excel: Streaming parser, handles multi-GB files
- XML: Streaming parser, memory-efficient for large documents
- Text/Markdown: Streaming parser with lazy regex compilation
- Archives: Efficient extraction without full decompression
Pandoc Extractors¶
- Subprocess overhead (~50-200ms per file)
- Good for batch processing with concurrent execution
- Memory-efficient for large documents
LibreOffice Extractors¶
- Higher overhead (~500-2000ms per file)
- Only used for legacy formats (
.doc,.ppt) - Automatic conversion to modern formats
Batch Processing¶
All formats support concurrent batch processing:
from kreuzberg import batch_extract_file, ExtractionConfig
paths = ["file1.pdf", "file2.docx", "file3.xlsx"]
config = ExtractionConfig(max_concurrent_extractions=8)
results = batch_extract_file(paths, config=config)
Format Limitations¶
Known Limitations¶
- Password-Protected PDFs: Requires
cryptoextra (pip install kreuzberg[crypto]) - Legacy Excel (.xls): Formula evaluation not supported (values only)
- Encrypted Office Documents: Password protection not supported
- Multi-page TIFF: OCR processes first page only (configurable)
- Animated GIF: Extracts first frame only
Unsupported Formats¶
- Video formats (MP4, AVI, MOV, etc.)
- Audio formats (MP3, WAV, FLAC, etc.)
- CAD formats (DWG, DXF, etc.)
- Database files (MDB, ACCDB, etc.)
- Compressed Office formats without proper headers
Adding New Formats¶
Kreuzberg's plugin system allows adding custom format extractors:
Python Plugin¶
from kreuzberg import DocumentExtractor, ExtractionResult, Metadata
class CustomExtractor(DocumentExtractor):
def name(self) -> str:
return "custom-format-extractor"
def supported_mime_types(self) -> list[str]:
return ["application/x-custom"]
def extract_bytes(self, content: bytes, mime_type: str, config) -> ExtractionResult:
# Your extraction logic here
text = parse_custom_format(content)
return ExtractionResult(
content=text,
mime_type=mime_type,
metadata=Metadata()
)
# Register plugin
from kreuzberg import get_document_extractor_registry
registry = get_document_extractor_registry()
registry.register(CustomExtractor())
Rust Plugin¶
use kreuzberg::plugins::{DocumentExtractor, Plugin};
use kreuzberg::types::ExtractionResult;
use async_trait::async_trait;
pub struct CustomExtractor;
impl Plugin for CustomExtractor {
fn name(&self) -> &str {
"custom-format-extractor"
}
fn version(&self) -> String {
"1.0.0".to_string()
}
}
#[async_trait]
impl DocumentExtractor for CustomExtractor {
async fn extract_bytes(
&self,
content: &[u8],
mime_type: &str,
config: &ExtractionConfig,
) -> kreuzberg::Result<ExtractionResult> {
// Your extraction logic here
let text = parse_custom_format(content)?;
Ok(ExtractionResult {
content: text,
mime_type: mime_type.to_string(),
..Default::default()
})
}
fn supported_mime_types(&self) -> &[&str] {
&["application/x-custom"]
}
}
// Register plugin
use kreuzberg::plugins::registry::get_document_extractor_registry;
use std::sync::Arc;
let registry = get_document_extractor_registry();
registry.write().unwrap().register(Arc::new(CustomExtractor))?;
See Also¶
- Configuration Reference - Detailed configuration options
- Extraction Guide - Extraction examples
- OCR Guide - OCR configuration and usage
- Plugin System - Custom extractor development