Xberg is built on a high-performance Rust core, so most documents are processed almost instantly- in milliseconds instead of seconds. For bulk jobs that's thousands of pages per hour on a single API key.

What file types do you support?

PDFs (native and scanned), images (JPG, PNG), Microsoft Office (DOCX, PPTX, XLSX), web content, and plain text. We detect document type automatically and optimize extraction for each format.

Do you handle scanned documents?

Yes. Built-in OCR recognizes text in images and scanned PDFs. No additional configuration needed—just send the file and get structured output back.

What happens to my documents?

Documents are processed in memory and deleted immediately after extraction. No storage, no indexing. We don't train on your data or use it for model improvement.

I already use your open-source library with good results. Why should I try Xberg cloud?

The open-source engine is fully usable and powerful on its own. Xberg Enterprise removes the operational complexity, so you can run it in production without worrying about managing infrastructure.

Document intelligence & RAG

Crawl, extract,enrich, retrieve.for AI agents

A self-hostable document intelligence and RAG platform — an integral component in AI meshes. Open-source primitives, one managed backend. 96 file formats, 306 programming languages, RAG built in.

Get the libraries

96file formatsEXTRACT

306programming languagesCODE INTELLIGENCE

143LLM providersENRICH

14language bindingsOPEN SOURCE

96file formatsEXTRACT

306programming languagesCODE INTELLIGENCE

143LLM providersENRICH

14language bindingsOPEN SOURCE

96file formatsEXTRACT

306programming languagesCODE INTELLIGENCE

143LLM providersENRICH

14language bindingsOPEN SOURCE

96file formatsEXTRACT

306programming languagesCODE INTELLIGENCE

143LLM providersENRICH

14language bindingsOPEN SOURCE

The pipeline

Four stages. One backend. Open underneath.

Acquire

Crawl the web. HTTP + headless Chrome.

crawlberg

Extract

96 formats. 306 code languages. Lossless HTML.

xberg html-to-markdown tree-sitter-language-pack

Enrich

LLM structured extraction. VLM OCR. Summarize. Redact.

xberg liter-llm

Embed & Retrieve

Collections. Hybrid retrieval. Reranking.

services/rag

Live playground

Try every use case on
a real document.

Everything teams build on the pipeline — run any of them and see structured output. No account, no setup.

RAG Pipeline Ingestion

Turn a pile of PDFs, Office docs, and HTML into clean, chunked, embedded data for your vector database — in one call.

RAGChunkingEmbeddings

Pipeline stages

AcquireExtractEnrichEmbed & Retrieve

Input · Documents → chunked, embedded data

Drop a file here or click to browse1 page per document · under 1 MB · 10 demo runs per IP

10-K excerpt.pdfpdf

output.json

Your output will appear here

Choose a sample or upload a document, then run the extraction to inspect the result.

Run to extract a real document through the live API — no account, no setup.

Integrations

Fits into
your workflow.

Drop-in integrations for the AI frameworks and tools you already build with.

langchain.py

from langchain_xberg import XbergLoader
 
# Load any document as LangChain Documents
loader = XbergLoader("report.pdf")
docs = loader.load()
 
# Drop straight into your vector store
db.add_documents(docs)

pip install langchain-xberg

For developers

One API.
Every language.

Install Xberg in your stack and call the same extraction API — from Python to Rust to the browser.

12 languages

Python, TypeScript, Rust, Go, Java, Ruby, and more.

306 code formats

Functions, classes, imports, symbols — all parsed.

Docker & CLI

Run our images or the single-binary CLI.

ELv2 licensed

Free for personal, internal, and commercial use.

terminal — Python

$pip install xberg

Read the docs|View on GitHub

Two doors

Adopt the
way that fits.

Free · Self-host

Open source

All primitives, in 14 languages. Embed in your app or self-host the extraction server. Use any subset of the pipeline.

Get the libraries

Invitation · Curated

Design Partner program

Direct roadmap input, early access to new pipeline stages, 24-month locked pricing. We work closely with a small cohort across commercial, non-profit, and research teams. Free-tier access available at our discretion in exchange for attribution.

Apply for partnership

Capabilities

Why Xberg?
Built for AI pipelines.

Capabilities

Why Xberg?
Built for AI pipelines.

Speed That Unblocks Your Team

Process documents in milliseconds instead of seconds. Your RAG pipeline moves at the speed of API calls, not extraction bottlenecks. Index millions of documents without waiting weeks for processing to complete.

Batch-Processing at Scale

Effectively process large numbers of documents in bulk. Xberg is built for batch processing, and our cloud infrastructure is designed to scale.

Embeddings

Ultra-fast embeddings via a Rust-native ONNX engine. 4 presets out of the box, extensible to any model. No separate embedding pipeline needed.

Chunking and Metadata

Semantic chunking across code, markdown, and plain text. Token reduction, keyword extraction, and rich metadata — structured output ready for any AI pipeline.

Code Intelligence

Extract functions, classes, imports, and symbols from code files across 306 programming languages. Structured output, ready for semantic chunking and RAG pipelines.

LLM-Powered Intelligence

Go beyond extraction. Use vision language models as an OCR backend, extract structured JSON from documents using a schema, and generate embeddings — all via 143 LLM providers, including local models with zero API key configuration.

Comparison

Beyond the
extraction box.

Textract, Document AI and Azure DI stop at OCR — and send your files to their cloud. Xberg runs the whole pipeline, open source, on your own infrastructure.

Capability	Xberg	Google Document AI	Amazon Textract	Azure Doc Intelligence
Deployment	Self-host or managed	Cloud only (GCP)	Cloud only (AWS)	Cloud only (Azure)
Source model	Open source	Proprietary	Proprietary	Proprietary
Pipeline scope	Acquire → retrieve	Extract / OCR	Extract / OCR	Extract / OCR
Data residency	Stays in your infra	Sent to vendor	Sent to vendor	Sent to vendor
Formats	96 + 306 code langs	Docs / forms	PDF / images	Docs / forms
Code intelligence	Built in	—	—	—
Pricing	Usage or self-host	Per page	Per page	Per page

Want the numbers — latency, throughput, accuracy? Every figure is reproducible from our open-source harness.

See the head-to-head benchmarks

Use Cases

Built for how teams
actually work.

RAG Systems

Feed your vector database with semantically accurate document chunks. Preserve table structure so your AI understands relationships. Bulk-process your knowledge base in hours instead of weeks.

Document Classification & Routing

Extract content, metadata, and structure to power smart routing. Automatically categorize incoming documents. Reduce manual sorting and classification workflows.

Compliance & Document Review

Extract and structure compliance documents, contracts, and regulatory filings. Preserve table relationships and metadata for audit trails. Support for scanned documents with OCR means nothing falls through the cracks.

See how teams put document intelligence to work. Explore all use cases

Trusted by

Our first design partners are onboarding now. Want your logo here? Apply for partnership

Crawl, extract,enrich, retrieve.for AI agents

Four stages. One backend. Open underneath.

Acquire

Extract

Enrich

Embed & Retrieve

Try every use case on
a real document.

RAG Pipeline Ingestion

Fits into
your workflow.

One API.
Every language.

12 languages

306 code formats

Docker & CLI

ELv2 licensed

Adopt the
way that fits.

Open source

Design Partner program

Why Xberg?
Built for AI pipelines.

Why Xberg?
Built for AI pipelines.

Speed That Unblocks Your Team

Batch-Processing at Scale

Embeddings

Chunking and Metadata

Code Intelligence

LLM-Powered Intelligence

Beyond the
extraction box.

Built for how teams
actually work.

RAG Systems

Document Classification & Routing

Compliance & Document Review

Open-source primitives, composed into one backend. Curated cohort of design partners. Apply to work with us.

We value your privacy

Crawl, extract,enrich, retrieve.for AI agents

Four stages. One backend. Open underneath.

Acquire

Extract

Enrich

Embed & Retrieve

Try every use case ona real document.

RAG Pipeline Ingestion

Fits intoyour workflow.

One API.Every language.

12 languages

306 code formats

Docker & CLI

ELv2 licensed

Adopt the way that fits.

Open source

Design Partner program

Why Xberg?Built for AI pipelines.

Why Xberg?Built for AI pipelines.

Speed That Unblocks Your Team

Batch-Processing at Scale

Embeddings

Chunking and Metadata

Code Intelligence

LLM-Powered Intelligence

Beyond the extraction box.

Built for how teamsactually work.

RAG Systems

Document Classification & Routing

Compliance & Document Review

Open-source primitives, composed into one backend. Curated cohort of design partners. Apply to work with us.

We value your privacy

Try every use case on
a real document.

Fits into
your workflow.

One API.
Every language.

Adopt the
way that fits.

Why Xberg?
Built for AI pipelines.

Why Xberg?
Built for AI pipelines.

Beyond the
extraction box.

Built for how teams
actually work.