Skip to content

Supported Formats

Kreuzberg handles a wide range of document, image, text, and structured data formats.

Document Formats

  • PDF (.pdf, both searchable and scanned) - includes detailed metadata extraction
  • Microsoft Word (.docx)
  • PowerPoint presentations (.pptx)
  • OpenDocument Text (.odt)
  • Rich Text Format (.rtf)
  • EPUB (.epub)
  • DocBook XML (.dbk, .xml)
  • FictionBook (.fb2)
  • LaTeX (.tex, .latex)
  • Typst (.typ)

Markup and Text Formats

  • HTML (.html, .htm)
  • Plain text (.txt) and Markdown (.md, .markdown)
  • reStructuredText (.rst)
  • Org-mode (.org)
  • DokuWiki (.txt)
  • Pod (.pod)
  • Troff/Man (.1, .2, etc.)

Data and Research Formats

  • Spreadsheets (.xlsx, .xls, .xlsm, .xlsb, .xlam, .xla, .ods)
  • CSV (.csv) and TSV (.tsv) files
  • OPML files (.opml)
  • Jupyter Notebooks (.ipynb)
  • BibTeX (.bib) and BibLaTeX (.bib)
  • CSL-JSON (.json)
  • EndNote and JATS XML (.xml)
  • RIS (.ris)

Structured Data Formats

  • JSON (.json) - High-performance extraction using msgspec with schema analysis
  • YAML (.yaml, .yml) - Full YAML 1.2 support with nested structure extraction
  • TOML (.toml) - Configuration and metadata files with type-aware processing

These formats benefit from:

  • Schema extraction: Automatically analyze and extract the structure of your data
  • Custom field detection: Configure additional text fields for specialized extraction
  • Type information: Optionally include data type annotations in extracted content
  • Performance optimization: Uses msgspec for efficient JSON parsing

Image Formats

  • JPEG (.jpg, .jpeg, .pjpeg)
  • PNG (.png)
  • TIFF (.tiff, .tif)
  • BMP (.bmp)
  • GIF (.gif)
  • JPEG 2000 family (.jp2, .jpm, .jpx, .mj2)
  • WebP (.webp)
  • Portable anymap formats (.pbm, .pgm, .ppm, .pnm)

Image Extraction Support: Kreuzberg can extract embedded images from the following document types:

  • PDF documents (embedded images and graphics)
  • PowerPoint presentations (PPTX - slide images, charts, shapes)
  • HTML documents (inline images and base64-encoded images)
  • Microsoft Word documents (DOCX - embedded images and charts)
  • Email messages (EML, MSG - image attachments and inline images)

See the Image Extraction guide for configuration options.