Logo

Extract structured content from any document.

Open source. High-performance extraction. Built for AI pipelines, and applications.

97

file formats,

16

Bindings

305

Programming languages.

pip install kreuzberg
Open Source

The full stack

Document extraction

Document extraction

Extract text, tables, images, and metadata from 97+ file formats. PDF, Office, images, HTML, archives, email, and more - one API handles them all.

Code intelligence

Code intelligence

Understand code structure across 248 programming languages. Extract functions, classes, imports, symbols, and docstrings with semantic chunking.

Web crawling

Web crawling

Scrape and crawl any website with structured output. Text, metadata, links, and clean Markdown - ready for AI pipelines.

Powered by kreuzcrawl

LLM integration

LLM integration

Connect to 142 LLM providers from any language. VLM OCR, structured JSON extraction, and embeddings - one unified API.

Powered by liter-llm

Runs on

Windows

Windows

Linux

Linux

macOS

macOS

Android

Android

iOS

iOS

Flutter

Flutter

Kotlin

Kotlin

CLI

CLI

Docker

Docker

Homebrew

Homebrew
Bindings

In your language

Lightning fast Rust core with polyglot bindings. Build document intelligence pipelines in your language of choice.

Rust

Rust

Python

Python

TypeScript

TypeScript

PHP

PHP

JavaScript

JavaScript

Ruby

Ruby

Elixir

Elixir

Go

Go

C#

C#

Node.js

Node.js

R

R

WASM

WASM

Dart

Dart

Kotlin

Kotlin

Swift

Swift

Java

Java

Zig

Zig

Rust

Rust

Python

Python

TypeScript

TypeScript

PHP

PHP

JavaScript

JavaScript

Ruby

Ruby

Elixir

Elixir

Go

Go

C#

C#

Node.js

Node.js

R

R

WASM

WASM

Dart

Dart

Kotlin

Kotlin

Swift

Swift

Java

Java

Zig

Zig

Integrations

Fits into your workflow

LangChain

LangChain

Document loader for LangChain pipelines

LlamaIndex

LlamaIndex

Native reader for LlamaIndex RAG workflows

Haystack

Haystack

File converter for Haystack pipelines

CrewAI

CrewAI

Document tool for CrewAI agents

txtAI

txtAI

Extractor for txtAI semantic search

SurrealDB

SurrealDB

Document ingestion for SurrealDB

Open WebUI

Open WebUI

Built-in extraction for Open WebUI

See how Kreuzberg performs

Benchmarked across format types and document sizes.

Frequently Asked Questions

How fast is 'fast'?
Kreuzberg is built on a high-performance Rust core, so most documents are processed almost instantly- in milliseconds instead of seconds. For bulk jobs that's thousands of pages per hour on a single API key.
What file types do you support?
PDFs (native and scanned), images (JPG, PNG), Microsoft Office (DOCX, PPTX, XLSX), web content, and plain text. We detect document type automatically and optimize extraction for each format.
Do you handle scanned documents?
Yes. Built-in OCR recognizes text in images and scanned PDFs. No additional configuration needed—just send the file and get structured output back.
Is Kreuzberg open source?
Yes. You can use it freely for personal projects, internal tools, and commercial applications. From v4.8.0 onward it's licensed under ELv2 - the one restriction is you cannot offer it as a managed service to third parties. Versions v4.7.x and below are MIT-licensed.

Already using the open source library? Kreuzberg Cloud runs the infrastructure for you.

Kreuzberg Cloud runs the extraction layer for you. Start with 10,000 free pages.

We value your privacy

Kreuzberg uses cookies to improve your experience, personalize content, and analyze traffic. You can manage your preferences at any time.