Skip to content

Format Support

Kreuzberg supports 118+ file extensions across 11 major categories, providing comprehensive document intelligence capabilities through native Rust extractors, Pandoc integration, and LibreOffice conversion.

Overview

Kreuzberg v4 uses a high-performance Rust core with three extraction methods:

  • Native Rust Extractors: Fast, memory-efficient extractors for common formats
  • Pandoc Integration: Support for 30+ academic and publishing formats
  • LibreOffice Conversion: Legacy Microsoft Office format support (.doc, .ppt)

All formats support async/await and batch processing. Image formats and PDFs support optional OCR when configured.

Format Support Matrix

Office Documents

Format Extensions MIME Type Extraction Method OCR Support Special Features
PDF .pdf application/pdf Native Rust (pdfium-render) Yes Metadata extraction, image extraction, text layer detection
Excel .xlsx, .xlsm, .xlsb, .xls, .xlam, .xla, .ods Various Excel MIME types Native Rust (calamine) No Multi-sheet support, formula preservation
PowerPoint .pptx, .pptm, .ppsx application/vnd.openxmlformats-officedocument.presentationml.presentation Native Rust (roxmltree) Yes (for embedded images) Slide extraction, image OCR, table detection
Word (Modern) .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Pandoc No Preserves formatting, extracts metadata
Word (Legacy) .doc application/msword LibreOffice + Pandoc No Converts to DOCX then extracts
PowerPoint (Legacy) .ppt application/vnd.ms-powerpoint LibreOffice + Pandoc No Converts to PPTX then extracts
OpenDocument Text .odt application/vnd.oasis.opendocument.text Pandoc No Full OpenDocument support
OpenDocument Spreadsheet .ods application/vnd.oasis.opendocument.spreadsheet Native Rust (calamine) No Multi-sheet support

Text & Markup

Format Extensions MIME Type Extraction Method OCR Support Special Features
Plain Text .txt text/plain Native Rust (streaming) No Line/word/character counting, memory-efficient streaming
Markdown .md, .markdown text/markdown, text/x-markdown Native Rust (streaming) No Header extraction, link detection, code block detection
HTML .html, .htm text/html, application/xhtml+xml Native Rust (html-to-markdown-rs) No Converts to Markdown, metadata extraction
XML .xml application/xml, text/xml Native Rust (quick-xml streaming) No Element counting, unique element tracking
SVG .svg image/svg+xml Native Rust (XML parser) No Treated as XML document
reStructuredText .rst text/x-rst Pandoc No Full reST syntax support
Org Mode .org text/x-org Pandoc No Emacs Org mode support
Rich Text Format .rtf application/rtf, text/rtf Pandoc No RTF 1.x support

Structured Data

Format Extensions MIME Type Extraction Method OCR Support Special Features
JSON .json application/json, text/json Native Rust (serde_json) No Field counting, nested structure extraction
YAML .yaml, .yml application/x-yaml, text/yaml, text/x-yaml Native Rust (serde_yaml) No Multi-document support, field counting
TOML .toml application/toml, text/toml Native Rust (toml crate) No Configuration file support
CSV .csv text/csv Native Rust (via Pandoc) No Tabular data extraction
TSV .tsv text/tab-separated-values Native Rust (via Pandoc) No Tab-separated data extraction

Email

Format Extensions MIME Type Extraction Method OCR Support Special Features
EML .eml message/rfc822 Native Rust (mail-parser) No Header extraction, attachment listing, body text
MSG .msg application/vnd.ms-outlook Native Rust (mail-parser) No Outlook message support, metadata extraction

Images

All image formats support OCR when configured with ocr parameter in ExtractionConfig.

Format Extensions MIME Type Extraction Method OCR Support Special Features
PNG .png image/png Native Rust (image-rs) Yes EXIF metadata extraction
JPEG .jpg, .jpeg image/jpeg, image/jpg Native Rust (image-rs) Yes EXIF metadata extraction
WebP .webp image/webp Native Rust (image-rs) Yes Modern format support
BMP .bmp image/bmp, image/x-bmp, image/x-ms-bmp Native Rust (image-rs) Yes Uncompressed format
TIFF .tiff, .tif image/tiff, image/x-tiff Native Rust (image-rs) Yes Multi-page support
GIF .gif image/gif Native Rust (image-rs) Yes Animation frame extraction
JPEG 2000 .jp2, .jpx, .jpm, .mj2 image/jp2, image/jpx, image/jpm, image/mj2 Native Rust (image-rs) Yes Advanced JPEG format
PNM Family .pnm, .pbm, .pgm, .ppm image/x-portable-anymap, etc. Native Rust (image-rs) Yes NetPBM formats

Archives

Format Extensions MIME Type Extraction Method OCR Support Special Features
ZIP .zip application/zip, application/x-zip-compressed Native Rust (zip crate) No File listing, text content extraction
TAR .tar, .tgz application/x-tar, application/tar, application/x-gtar, application/x-ustar Native Rust (tar crate) No Unix archive support, compression detection
7-Zip .7z application/x-7z-compressed Native Rust (sevenz-rust) No High compression format support
Gzip .gz application/gzip Native Rust No Gzip compression support

Academic & Publishing (via Pandoc)

Format Extensions MIME Type Extraction Method OCR Support Special Features
LaTeX .tex, .latex application/x-latex, text/x-tex Pandoc No Full LaTeX document support
EPUB .epub application/epub+zip Pandoc No E-book format, metadata extraction
BibTeX .bib application/x-bibtex, application/x-biblatex Pandoc No Bibliography database support
Typst .typst application/x-typst Pandoc No Modern typesetting format
Jupyter Notebook .ipynb application/x-ipynb+json Pandoc No Code cells, markdown cells, output extraction
FictionBook - application/x-fictionbook+xml Pandoc No XML-based e-book format
DocBook - application/docbook+xml Pandoc No Technical documentation format
JATS - application/x-jats+xml Pandoc No Journal article XML format
OPML - application/x-opml+xml Pandoc No Outline format
RIS - application/x-research-info-systems Pandoc No Citation format
EndNote XML - application/x-endnote+xml Pandoc No Reference manager format
CSL JSON - application/csl+json Pandoc No Citation Style Language JSON

Markdown Variants (via Pandoc)

Format MIME Type Extraction Method Special Features
CommonMark text/x-commonmark Pandoc Standard Markdown spec
GitHub Flavored Markdown text/x-gfm Pandoc GFM extensions (tables, strikethrough, etc.)
MultiMarkdown text/x-multimarkdown Pandoc MMD extensions
Markdown Extra text/x-markdown-extra Pandoc PHP Markdown Extra extensions

Other Formats

Format MIME Type Extraction Method Special Features
Man Pages text/x-mdoc Pandoc Unix manual page format
Troff text/troff Pandoc Unix document format
POD text/x-pod Pandoc Perl documentation format
DokuWiki text/x-dokuwiki Pandoc Wiki markup format

Architecture Diagram

graph TD
    A[File Input] --> B{MIME Detection}
    B --> C{Extraction Method}

    C -->|Native Format| D[Rust Core Extractors]
    C -->|Pandoc Format| E[Pandoc Subprocess]
    C -->|Legacy Office| F[LibreOffice Conversion]

    D --> G[PDF Extractor]
    D --> H[Excel Extractor]
    D --> I[Image Extractor]
    D --> J[XML/Text/HTML Extractors]
    D --> K[Email Extractor]
    D --> L[Archive Extractor]

    E --> M[DOCX/ODT/EPUB/LaTeX]

    F --> N[Convert DOC→DOCX]
    F --> O[Convert PPT→PPTX]
    N --> E
    O --> D

    G --> P{OCR Needed?}
    I --> P
    P -->|Yes| Q[Tesseract OCR]
    P -->|No| R[Text Output]
    Q --> R

    H --> R
    J --> R
    K --> R
    L --> R
    M --> R

    R --> S[Post-Processing Pipeline]
    S --> T[Final Result]

Feature Flags

Kreuzberg uses Cargo feature flags to enable optional format support:

Feature Flag Formats Enabled Default
pdf PDF documents No
excel Excel spreadsheets (all variants) No
office PowerPoint, Pandoc formats No
ocr OCR for images and PDFs No
email EML, MSG email formats No
html HTML to Markdown conversion No
xml XML document parsing No
archives ZIP, TAR, 7z archive support No

Note: No features are enabled by default (default = []). You must explicitly enable the features you need.

To enable specific features:

[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "excel"] }

To enable all features with --all-features:

cargo build --all-features

Or use the convenience bundles:

[dependencies]
# All format extraction features (no server)
kreuzberg = { version = "4.0", features = ["full"] }

# Server features (API, MCP) with common formats
kreuzberg = { version = "4.0", features = ["server"] }

# CLI features with common formats
kreuzberg = { version = "4.0", features = ["cli"] }

System Dependencies

Some formats require external system tools:

Tesseract OCR (Optional)

Required for OCR on images and PDFs:

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# RHEL/CentOS/Fedora
sudo dnf install tesseract

# Windows (Scoop)
scoop install tesseract

Pandoc (Optional)

Required for academic and publishing formats (DOCX, EPUB, LaTeX, etc.):

# macOS
brew install pandoc

# Ubuntu/Debian
sudo apt-get install pandoc

# RHEL/CentOS/Fedora
sudo dnf install pandoc

# Windows (Scoop)
scoop install pandoc

Minimum version: Pandoc 2.x or later

LibreOffice (Optional)

Required for legacy Microsoft Office formats (.doc, .ppt):

# macOS
brew install libreoffice

# Ubuntu/Debian
sudo apt-get install libreoffice

# RHEL/CentOS/Fedora
sudo dnf install libreoffice

# Windows
# Download from https://www.libreoffice.org/download/

Docker Note: All system dependencies are pre-installed in official Kreuzberg Docker images.

Format Detection

Kreuzberg automatically detects file formats using:

  1. File Extension Mapping: 118+ extensions mapped to MIME types
  2. mime_guess Crate: Fallback for unknown extensions
  3. Manual Override: Explicit MIME type can be provided

Example with manual override:

from kreuzberg import extract_file

# Auto-detect from extension
result = extract_file("document.pdf")

# Manual MIME type override
result = extract_file("document.dat", mime_type="application/pdf")
import { extractFile } from 'kreuzberg';

// Auto-detect from extension
const result = await extractFile('document.pdf');

// Manual MIME type override
const result2 = await extractFile('document.dat', { mimeType: 'application/pdf' });
use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();

    // Auto-detect from extension
    let result = extract_file("document.pdf", None, &config).await?;

    // Manual MIME type override
    let result = extract_file("document.dat", Some("application/pdf"), &config).await?;

    Ok(())
}

OCR Support

OCR is available for:

  • All image formats (PNG, JPEG, WebP, BMP, TIFF, GIF, etc.)
  • PDF documents (with automatic fallback for scanned PDFs)
  • Embedded images in PowerPoint presentations

Configuration

from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

config = ExtractionConfig(
    ocr=OcrConfig(
        tesseract_config=TesseractConfig(
            lang="eng+deu",  # Multiple languages
            psm=3,           # Page segmentation mode
            oem=1            # OCR Engine mode
        )
    ),
    force_ocr=False  # Only use OCR when native text is insufficient
)

result = extract_file("scanned_document.pdf", config=config)

Automatic OCR Decision

For PDFs, Kreuzberg automatically decides whether OCR is needed by analyzing native text:

  • No OCR: Document has substantial, meaningful text (>64 non-whitespace chars, >32 chars/page average)
  • OCR Fallback: Document appears scanned (mostly punctuation, very low alphanumeric ratio)

Override with force_ocr=True to always use OCR regardless of native text quality.

Performance Characteristics

Native Rust Extractors

  • PDF: 10-50x faster than Python libraries
  • Excel: Streaming parser, handles multi-GB files
  • XML: Streaming parser, memory-efficient for large documents
  • Text/Markdown: Streaming parser with lazy regex compilation
  • Archives: Efficient extraction without full decompression

Pandoc Extractors

  • Subprocess overhead (~50-200ms per file)
  • Good for batch processing with concurrent execution
  • Memory-efficient for large documents

LibreOffice Extractors

  • Higher overhead (~500-2000ms per file)
  • Only used for legacy formats (.doc, .ppt)
  • Automatic conversion to modern formats

Batch Processing

All formats support concurrent batch processing:

from kreuzberg import batch_extract_file, ExtractionConfig

paths = ["file1.pdf", "file2.docx", "file3.xlsx"]
config = ExtractionConfig(max_concurrent_extractions=8)

results = batch_extract_file(paths, config=config)

Format Limitations

Known Limitations

  • Password-Protected PDFs: Requires crypto extra (pip install kreuzberg[crypto])
  • Legacy Excel (.xls): Formula evaluation not supported (values only)
  • Encrypted Office Documents: Password protection not supported
  • Multi-page TIFF: OCR processes first page only (configurable)
  • Animated GIF: Extracts first frame only

Unsupported Formats

  • Video formats (MP4, AVI, MOV, etc.)
  • Audio formats (MP3, WAV, FLAC, etc.)
  • CAD formats (DWG, DXF, etc.)
  • Database files (MDB, ACCDB, etc.)
  • Compressed Office formats without proper headers

Adding New Formats

Kreuzberg's plugin system allows adding custom format extractors:

Python Plugin

from kreuzberg import DocumentExtractor, ExtractionResult, Metadata

class CustomExtractor(DocumentExtractor):
    def name(self) -> str:
        return "custom-format-extractor"

    def supported_mime_types(self) -> list[str]:
        return ["application/x-custom"]

    def extract_bytes(self, content: bytes, mime_type: str, config) -> ExtractionResult:
        # Your extraction logic here
        text = parse_custom_format(content)
        return ExtractionResult(
            content=text,
            mime_type=mime_type,
            metadata=Metadata()
        )

# Register plugin
from kreuzberg import get_document_extractor_registry
registry = get_document_extractor_registry()
registry.register(CustomExtractor())

Rust Plugin

use kreuzberg::plugins::{DocumentExtractor, Plugin};
use kreuzberg::types::ExtractionResult;
use async_trait::async_trait;

pub struct CustomExtractor;

impl Plugin for CustomExtractor {
    fn name(&self) -> &str {
        "custom-format-extractor"
    }

    fn version(&self) -> String {
        "1.0.0".to_string()
    }
}

#[async_trait]
impl DocumentExtractor for CustomExtractor {
    async fn extract_bytes(
        &self,
        content: &[u8],
        mime_type: &str,
        config: &ExtractionConfig,
    ) -> kreuzberg::Result<ExtractionResult> {
        // Your extraction logic here
        let text = parse_custom_format(content)?;
        Ok(ExtractionResult {
            content: text,
            mime_type: mime_type.to_string(),
            ..Default::default()
        })
    }

    fn supported_mime_types(&self) -> &[&str] {
        &["application/x-custom"]
    }
}

// Register plugin
use kreuzberg::plugins::registry::get_document_extractor_registry;
use std::sync::Arc;

let registry = get_document_extractor_registry();
registry.write().unwrap().register(Arc::new(CustomExtractor))?;

See Also