# Kreuzberg Capabilities

Kreuzberg extracts text, tables, metadata, and chunks/embeddings from documents and source code. The same Rust core powers the open-source library and the managed cloud API.

Some advanced backends (vision-language-model OCR, fully custom embedding providers) are available in the self-hosted library but not exposed on the cloud — see the per-feature notes below.

## Supported file formats (92 total)

**Documents**: PDF (native and scanned), EPUB, FB2, HWP/HWPX, RTF, plain text, Markdown, MDX, Djot, RST, Org-mode, LaTeX, Typst, JATS, DocBook, OPML, troff, manpages.

**Microsoft Office**: DOCX, DOCM, DOTX, DOTM, DOT, XLSX, XLSM, XLSB, XLS, XLA, XLAM, XLTM, XLTX, XLT, PPTX, PPTM, PPSX, POTX, POTM, POT.

**Apple iWork**: Pages, Numbers, Keynote.

**OpenDocument**: ODT, ODS.

**Images**: PNG, JPG/JPEG, GIF, WEBP, BMP, TIFF/TIF, JP2/JPX/JPM/MJ2, JBIG2/JB2, PNM/PBM/PGM/PPM, SVG.

**Web / structured**: HTML, XHTML, XML, JSON, YAML, TOML.

**Tabular**: CSV, TSV.

**Email**: EML, MSG.

**Archives**: ZIP, TAR, TGZ, GZ, 7Z (extracted recursively).

**Bibliographic**: BIB, RIS, NBIB, ENW, CSL.

**Notebooks**: IPYNB.

**Database**: DBF.

The full canonical list is the `fileFormats` array in the marketing site source.

## Code intelligence (305 programming languages)

Extract functions, classes, imports, and symbols from source code across 305 languages. Output is structured for semantic chunking and RAG ingestion.

## Extraction features

- **OCR for scanned PDFs and images**.
  - Cloud: Tesseract only. The API rejects any other OCR backend at validation time.
  - Self-hosted: Tesseract plus optional VLM (vision-language-model) backends — useful for complex layouts, but configured locally rather than via the cloud API.
- **Layout detection**: page structure, reading order, headings, paragraphs.
- **Table extraction**: structured rows and columns from PDFs, Office docs, and HTML.
- **Metadata extraction**: title, author, subject, page count, language, document properties.
- **Chunking**: configurable text chunking for downstream RAG pipelines (`extraction_config.chunking`).
- **Embeddings**: inline embedding generation via `extraction_config.embedding`. Cloud uses managed model presets; the self-hosted library can target arbitrary providers/local models.
- **Token reduction, language detection, image extraction, force/disable OCR** — all expressible via `ExtractionConfig`.
- **NER (named entity recognition)** — in the OSS toolbox.

## Performance characteristics

- Rust core; CPU-optimised binary, no GPU required for default OCR.
- Most documents process in milliseconds.
- Designed for batch — thousands of pages per hour per API key on the cloud platform.
- Cloud autoscales on queue depth so bursty traffic doesn't stall.

## Polyglot SDKs (12)

Native bindings shipped for: Rust, Python, TypeScript, JavaScript, Node.js, PHP, Ruby, Elixir, Go, C#, R, WebAssembly.

## Output

- JSON response with full document structure (text, tables, images, metadata, chunks, embeddings when configured).
- Webhook delivery for async cloud workflows. Body is HMAC-SHA256-signed when a `webhook.secret` is supplied; signature header is `X-Webhook-Signature: sha256=<hex>`.
- Direct integration with embeddings pipelines and RAG frameworks.

## See also

- [Pricing](https://kreuzberg.dev/llms/pricing.md)
- [Extraction API](https://kreuzberg.dev/llms/api.md)
- [Getting started](https://kreuzberg.dev/llms/getting-started.md)
