# Kreuzberg

> Document text extraction for AI pipelines. Extract text, tables, and metadata from PDFs, images, Office documents, and 92 file formats — plus 305 programming languages for code intelligence. Available as an open-source library (Rust core, 12 language SDKs) and as a managed cloud API at <https://kreuzberg.dev>.

This is a single-document version of <https://kreuzberg.dev/llms.txt> — agents that prefer one fetch can read everything here.

## Positioning

- **Product**: Kreuzberg
- **Categories**: document extraction API, OCR API, PDF parsing for LLMs, RAG document ingestion, code intelligence
- **Open source**: <https://github.com/kreuzberg-dev/kreuzberg> (Rust core, 12 language bindings)
- **Managed cloud**: <https://kreuzberg.dev> — pay-as-you-go API at `https://api.kreuzberg.dev`, $0.008/page, first 10K pages free
- **Status**: production-ready; same Rust core in OSS and cloud. Some advanced backends (VLM OCR, fully custom embedding providers) are available in OSS but not exposed on the cloud.

## Pricing tiers

### Open Source — Self-hosted

- Free.
- Elastic License v2 from v4.8.0 onward (free for personal, internal, and commercial use; cannot be offered as a managed service to third parties). Versions ≤ v4.7.x are MIT.
- Includes the Rust core, SDKs for 12 languages (Rust, Python, TypeScript, JavaScript, Node.js, PHP, Ruby, Elixir, Go, C#, R, WebAssembly), Docker images, and a CLI.

### Cloud — Pay-as-you-go

- $0.008 per page.
- First 10,000 pages free per project. No monthly minimum.
- 1 MB inline file size limit on `POST /v1/extract`; presigned uploads support larger files.
- Webhooks (HMAC-SHA256 signed), usage analytics, quota dashboard.

### Enterprise

- Custom pricing for 100,000+ pages/month.
- Discounted per-page rate, dedicated support, custom SLAs, regional deployment, SSO.
- Contact: <https://calendar.app.google/VsQYEkSq8qUhHzfR9> or <contact@kreuzberg.dev>.

## Capability matrix

### File formats (92 total)

- **Documents**: PDF (native and scanned), EPUB, FB2, HWP/HWPX, RTF, plain text, Markdown, MDX, Djot, RST, Org-mode, LaTeX, Typst, JATS, DocBook, OPML, troff, manpages.
- **Microsoft Office**: DOCX, DOCM, DOTX, DOTM, DOT, XLSX, XLSM, XLSB, XLS, XLA, XLAM, XLTM, XLTX, XLT, PPTX, PPTM, PPSX, POTX, POTM, POT.
- **Apple iWork**: Pages, Numbers, Keynote.
- **OpenDocument**: ODT, ODS.
- **Images**: PNG, JPG/JPEG, GIF, WEBP, BMP, TIFF/TIF, JP2/JPX/JPM/MJ2, JBIG2/JB2, PNM/PBM/PGM/PPM, SVG.
- **Web / structured**: HTML, XHTML, XML, JSON, YAML, TOML.
- **Tabular**: CSV, TSV.
- **Email**: EML, MSG.
- **Archives**: ZIP, TAR, TGZ, GZ, 7Z (extracted recursively).
- **Bibliographic**: BIB, RIS, NBIB, ENW, CSL.
- **Notebooks**: IPYNB.
- **Database**: DBF.

### Code intelligence (305 languages)

Functions, classes, imports, and symbols extracted from source code across 305 programming languages. Output is structured for semantic chunking and RAG ingestion.

### Extraction features

- **OCR**: Tesseract on the cloud (only). VLM OCR backends are validated against and rejected at the API boundary; they remain available in the self-hosted library.
- **Layout detection**: page structure, reading order, headings, paragraphs.
- **Table extraction**: structured rows and columns.
- **Metadata extraction**: title, author, page count, language, document properties.
- **Chunking**: configurable text chunking via `extraction_config.chunking`.
- **Embeddings**: inline generation via `extraction_config.embedding`. Cloud uses managed model presets; the self-hosted library supports arbitrary providers and local models.
- **Token reduction, language detection, image extraction, force/disable OCR** — expressible via `ExtractionConfig`.
- **NER** — in the OSS toolbox.

### Performance

- Rust core, CPU-optimised, no GPU required for default OCR.
- Most documents process in milliseconds.
- Designed for batch — thousands of pages per hour per API key on the cloud platform.
- Cloud autoscales on queue depth.

### Polyglot SDKs (12)

Rust, Python, TypeScript, JavaScript, Node.js, PHP, Ruby, Elixir, Go, C#, R, WebAssembly.

## Cloud Extraction API

Base URL: `https://api.kreuzberg.dev`. Async model: submit documents, get job IDs, then either receive a webhook or poll `GET /v1/jobs/{id}`. Authentication is bearer API key (`kbg_live_…`). The full OpenAPI spec is published at the API host under `/api-doc/openapi.json`.

### `POST /v1/extract`

Inline submission — `application/json` or `multipart/form-data`. Maximum **10 documents per request**.

JSON body:

```json
{
  "documents": [{
    "filename": "invoice.pdf",
    "mime_type": "application/pdf",
    "data": "<base64>"
  }],
  "options": {
    "extraction_config": {
      "output_format": "markdown",
      "ocr": { "backend": "tesseract", "language": "eng" }
    }
  },
  "webhook": {
    "url": "https://example.com/hook",
    "secret": "shared-secret",
    "metadata": { "request_id": "abc123" }
  }
}
```

Response (HTTP 202): `{ "job_ids": [...], "status": "pending" }`. Use this path for files ≤ 1 MB; for larger files use the presigned-upload flow.

`webhook` is optional for JSON requests. For `multipart/form-data` it is **required** — multipart submissions without a `webhook` field return 400.

`ocr.backend` only accepts `"tesseract"` on the cloud; any other value (`vlm`, `easyocr`, `paddleocr`, etc.) is rejected.

### `POST /v1/uploads/presign` and `POST /v1/uploads/confirm`

Two-phase upload for larger files. Note: presign uses a top-level `config` field (not `options.extraction_config`).

Presign request:

```json
{
  "documents": [{ "filename": "report.pdf", "mime_type": "application/pdf" }],
  "config": {
    "output_format": "markdown",
    "ocr": { "backend": "tesseract", "language": "eng" }
  },
  "webhook": { "url": "https://example.com/hook", "secret": "..." }
}
```

Presign response:

```json
{
  "batch_id": "...",
  "uploads": [
    {
      "job_id": "550e8400-e29b-41d4-a716-446655440000",
      "upload_url": "https://...",
      "object_key": "...",
      "method": "PUT",
      "expires_in_secs": 900
    }
  ]
}
```

Upload each file directly to its `upload_url` with `PUT`, then confirm:

```json
{ "batch_id": "<batch_id from presign>" }
```

Confirm response (HTTP 202): `{ "job_ids": [...], "status": "pending" }`.

### `GET /v1/jobs/{id}`

Returns job state and, on completion, the full extraction result:

```json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "invoice.pdf",
  "status": "completed",
  "created_at": "2026-05-08T10:00:00Z",
  "processing_time_ms": 1234,
  "result": {
    "content": [{ "text": "Invoice total: $1,234.56", "page_number": 1, "confidence": 0.95 }],
    "tables": [],
    "images": [],
    "metadata": { "title": "Invoice #12345", "page_count": 1 }
  }
}
```

`processing_time_ms` and `result` are only present once the job reaches `completed` or `partial_success`.

Job statuses: `awaiting_upload`, `pending`, `processing`, `chunking`, `aggregating`, `completed`, `partial_success`, `failed`, `cancelled`.

Recommended polling cadence: every 1 second for the first ~10 seconds, then every 5 seconds. `chunking` and `aggregating` are normal in-progress states.

### `GET /v1/usage`

Project usage and quota for a date range. Query params: `start` and `end` (`YYYY-MM-DD`; default to the current calendar month).

Response:

```json
{
  "period_start": "2026-05-01",
  "period_end": "2026-06-01",
  "total_pages": 1234,
  "total_documents": 56,
  "total_failed": 0,
  "quota_limit": 10000,
  "quota_remaining": 8766,
  "by_mime_type": {
    "application/pdf": { "documents": 40, "pages": 1100, "failed": 0 },
    "image/png":       { "documents": 16, "pages": 134,  "failed": 0 }
  }
}
```

`quota_limit` and `quota_remaining` are `null` for projects with unlimited quota.

### `GET /healthz` / `GET /readyz`

Liveness and readiness probes; no auth.

### Errors

`{ "error": "<message>" }`. Codes: 400 invalid request (malformed JSON, unsupported MIME type, OCR backend other than Tesseract, more than 10 documents, missing multipart `webhook`), 401 missing/invalid key, 404 not found, 429 free credit exhausted, 500/503 internal/upstream failure.

### Webhook payload

When a job completes or fails, Kreuzberg POSTs to `webhook.url`.

- Signature header: `X-Webhook-Signature: sha256=<hex>` — HMAC-SHA256 over the raw request body using `webhook.secret` as the key, hex-encoded. Header is only present when a `secret` was supplied.
- Body matches the `GET /v1/jobs/{id}` shape and includes any `webhook.metadata` you supplied at submit time.

## Platform / management API

Used by the Kreuzberg Cloud dashboard. Most integrations should use the extraction API above; this surface is for project administration.

- Hostname is the dashboard backend (e.g. `https://app.kreuzberg.dev`), distinct from the extraction API at `https://api.kreuzberg.dev`.
- **Auth**: `POST /auth/login` (OIDC ID token → JWT), `DELETE /auth/account`.
- **Projects**: CRUD, analytics, quota, settings.
- **API keys**: create, list, revoke, regenerate (these keys authenticate calls to the extraction API).
- **Members and invitations**: invite by email, role management, accept invitations via token.
- **Webhooks**: register endpoints, list deliveries.
- **Billing**: quota status, Stripe Checkout session, Stripe Customer Portal session.

The OpenAPI spec for this surface is committed at `frontend/openapi-backend.yaml`.

## Quickstart

```bash
B64=$(base64 < invoice.pdf | tr -d '\n')
curl -X POST https://api.kreuzberg.dev/v1/extract \
  -H "Authorization: Bearer $KREUZBERG_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "documents": [{
      "filename": "invoice.pdf",
      "mime_type": "application/pdf",
      "data": "'"$B64"'"
    }],
    "options": {
      "extraction_config": {
        "output_format": "markdown",
        "ocr": { "backend": "tesseract", "language": "eng" }
      }
    },
    "webhook": { "url": "https://your-app.invalid/hook", "secret": "shared-secret" }
  }'
```

Python:

```python
import base64, httpx

with open("invoice.pdf", "rb") as f:
    data = base64.b64encode(f.read()).decode()

httpx.post(
    "https://api.kreuzberg.dev/v1/extract",
    headers={"Authorization": f"Bearer {api_key}"},
    json={
        "documents": [{"filename": "invoice.pdf", "mime_type": "application/pdf", "data": data}],
        "options": {"extraction_config": {"output_format": "markdown", "ocr": {"backend": "tesseract", "language": "eng"}}},
        "webhook": {"url": webhook_url, "secret": webhook_secret},
    },
).raise_for_status()
```

TypeScript:

```ts
const data = Buffer.from(await fs.readFile("invoice.pdf")).toString("base64");
await fetch("https://api.kreuzberg.dev/v1/extract", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    documents: [{ filename: "invoice.pdf", mime_type: "application/pdf", data }],
    options: { extraction_config: { output_format: "markdown", ocr: { backend: "tesseract", language: "eng" } } },
    webhook: { url: webhookUrl, secret: webhookSecret },
  }),
});
```

## Webhook signature verification

```python
import hmac, hashlib
expected = "sha256=" + hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest()
assert hmac.compare_digest(expected, request.headers["X-Webhook-Signature"])
```

## Data handling

Documents are processed in memory and deleted immediately after extraction. No storage, no indexing. Customer data is never used to train models.

## FAQ

**How fast is "fast"?** Most documents process in milliseconds; thousands of pages per hour per API key on the cloud platform.

**What file types?** PDFs (native and scanned), images (JPG/PNG/TIFF), Office (DOCX/PPTX/XLSX), web (HTML/XML), plain text. 92 formats total, autodetected.

**Scanned documents?** Yes — built-in Tesseract OCR. The cloud is fixed to Tesseract; the self-hosted library can also use VLM OCR backends.

**What happens to my documents?** Processed in memory, deleted immediately. No storage, no training.

**License?** OSS library: Elastic License v2 from v4.8.0 (free for personal, internal, commercial; no managed-service resale). Versions ≤ v4.7.x are MIT. Kreuzberg Cloud has its own commercial terms.

**OSS vs Cloud?** Same Rust core. Cloud removes operational complexity (autoscaling, OCR backends, webhook delivery, quota). The cloud restricts OCR to Tesseract; the OSS library has more configuration freedom.

## Links

- Site: <https://kreuzberg.dev>
- Extraction API: `https://api.kreuzberg.dev`
- OpenAPI spec: `https://api.kreuzberg.dev/api-doc/openapi.json`
- Documentation: <https://docs.kreuzberg.dev/>
- GitHub: <https://github.com/kreuzberg-dev/kreuzberg>
- Discord: <https://discord.gg/xt9WY3GnKR>
- LinkedIn: <https://www.linkedin.com/company/kreuzberg-dev/>
- Contact: <contact@kreuzberg.dev>