# Kreuzberg > Document text extraction for AI pipelines. Extract text, tables, and metadata from PDFs, images, Office documents, and 92 file formats — plus 305 programming languages for code intelligence. Available as an open-source library (Rust core, 12 language SDKs) and as a managed cloud API at . This is a single-document version of — agents that prefer one fetch can read everything here. ## Positioning - **Product**: Kreuzberg - **Categories**: document extraction API, OCR API, PDF parsing for LLMs, RAG document ingestion, code intelligence - **Open source**: (Rust core, 12 language bindings) - **Managed cloud**: — pay-as-you-go API at `https://api.kreuzberg.dev`, $0.008/page, first 10K pages free - **Status**: production-ready; same Rust core in OSS and cloud. Some advanced backends (VLM OCR, fully custom embedding providers) are available in OSS but not exposed on the cloud. ## Pricing tiers ### Open Source — Self-hosted - Free. - Elastic License v2 from v4.8.0 onward (free for personal, internal, and commercial use; cannot be offered as a managed service to third parties). Versions ≤ v4.7.x are MIT. - Includes the Rust core, SDKs for 12 languages (Rust, Python, TypeScript, JavaScript, Node.js, PHP, Ruby, Elixir, Go, C#, R, WebAssembly), Docker images, and a CLI. ### Cloud — Pay-as-you-go - $0.008 per page. - First 10,000 pages free per project. No monthly minimum. - 1 MB inline file size limit on `POST /v1/extract`; presigned uploads support larger files. - Webhooks (HMAC-SHA256 signed), usage analytics, quota dashboard. ### Enterprise - Custom pricing for 100,000+ pages/month. - Discounted per-page rate, dedicated support, custom SLAs, regional deployment, SSO. - Contact: or . ## Capability matrix ### File formats (92 total) - **Documents**: PDF (native and scanned), EPUB, FB2, HWP/HWPX, RTF, plain text, Markdown, MDX, Djot, RST, Org-mode, LaTeX, Typst, JATS, DocBook, OPML, troff, manpages. - **Microsoft Office**: DOCX, DOCM, DOTX, DOTM, DOT, XLSX, XLSM, XLSB, XLS, XLA, XLAM, XLTM, XLTX, XLT, PPTX, PPTM, PPSX, POTX, POTM, POT. - **Apple iWork**: Pages, Numbers, Keynote. - **OpenDocument**: ODT, ODS. - **Images**: PNG, JPG/JPEG, GIF, WEBP, BMP, TIFF/TIF, JP2/JPX/JPM/MJ2, JBIG2/JB2, PNM/PBM/PGM/PPM, SVG. - **Web / structured**: HTML, XHTML, XML, JSON, YAML, TOML. - **Tabular**: CSV, TSV. - **Email**: EML, MSG. - **Archives**: ZIP, TAR, TGZ, GZ, 7Z (extracted recursively). - **Bibliographic**: BIB, RIS, NBIB, ENW, CSL. - **Notebooks**: IPYNB. - **Database**: DBF. ### Code intelligence (305 languages) Functions, classes, imports, and symbols extracted from source code across 305 programming languages. Output is structured for semantic chunking and RAG ingestion. ### Extraction features - **OCR**: Tesseract on the cloud (only). VLM OCR backends are validated against and rejected at the API boundary; they remain available in the self-hosted library. - **Layout detection**: page structure, reading order, headings, paragraphs. - **Table extraction**: structured rows and columns. - **Metadata extraction**: title, author, page count, language, document properties. - **Chunking**: configurable text chunking via `extraction_config.chunking`. - **Embeddings**: inline generation via `extraction_config.embedding`. Cloud uses managed model presets; the self-hosted library supports arbitrary providers and local models. - **Token reduction, language detection, image extraction, force/disable OCR** — expressible via `ExtractionConfig`. - **NER** — in the OSS toolbox. ### Performance - Rust core, CPU-optimised, no GPU required for default OCR. - Most documents process in milliseconds. - Designed for batch — thousands of pages per hour per API key on the cloud platform. - Cloud autoscales on queue depth. ### Polyglot SDKs (12) Rust, Python, TypeScript, JavaScript, Node.js, PHP, Ruby, Elixir, Go, C#, R, WebAssembly. ## Cloud Extraction API Base URL: `https://api.kreuzberg.dev`. Async model: submit documents, get job IDs, then either receive a webhook or poll `GET /v1/jobs/{id}`. Authentication is bearer API key (`kbg_live_…`). The full OpenAPI spec is published at the API host under `/api-doc/openapi.json`. ### `POST /v1/extract` Inline submission — `application/json` or `multipart/form-data`. Maximum **10 documents per request**. JSON body: ```json { "documents": [{ "filename": "invoice.pdf", "mime_type": "application/pdf", "data": "" }], "options": { "extraction_config": { "output_format": "markdown", "ocr": { "backend": "tesseract", "language": "eng" } } }, "webhook": { "url": "https://example.com/hook", "secret": "shared-secret", "metadata": { "request_id": "abc123" } } } ``` Response (HTTP 202): `{ "job_ids": [...], "status": "pending" }`. Use this path for files ≤ 1 MB; for larger files use the presigned-upload flow. `webhook` is optional for JSON requests. For `multipart/form-data` it is **required** — multipart submissions without a `webhook` field return 400. `ocr.backend` only accepts `"tesseract"` on the cloud; any other value (`vlm`, `easyocr`, `paddleocr`, etc.) is rejected. ### `POST /v1/uploads/presign` and `POST /v1/uploads/confirm` Two-phase upload for larger files. Note: presign uses a top-level `config` field (not `options.extraction_config`). Presign request: ```json { "documents": [{ "filename": "report.pdf", "mime_type": "application/pdf" }], "config": { "output_format": "markdown", "ocr": { "backend": "tesseract", "language": "eng" } }, "webhook": { "url": "https://example.com/hook", "secret": "..." } } ``` Presign response: ```json { "batch_id": "...", "uploads": [ { "job_id": "550e8400-e29b-41d4-a716-446655440000", "upload_url": "https://...", "object_key": "...", "method": "PUT", "expires_in_secs": 900 } ] } ``` Upload each file directly to its `upload_url` with `PUT`, then confirm: ```json { "batch_id": "" } ``` Confirm response (HTTP 202): `{ "job_ids": [...], "status": "pending" }`. ### `GET /v1/jobs/{id}` Returns job state and, on completion, the full extraction result: ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "filename": "invoice.pdf", "status": "completed", "created_at": "2026-05-08T10:00:00Z", "processing_time_ms": 1234, "result": { "content": [{ "text": "Invoice total: $1,234.56", "page_number": 1, "confidence": 0.95 }], "tables": [], "images": [], "metadata": { "title": "Invoice #12345", "page_count": 1 } } } ``` `processing_time_ms` and `result` are only present once the job reaches `completed` or `partial_success`. Job statuses: `awaiting_upload`, `pending`, `processing`, `chunking`, `aggregating`, `completed`, `partial_success`, `failed`, `cancelled`. Recommended polling cadence: every 1 second for the first ~10 seconds, then every 5 seconds. `chunking` and `aggregating` are normal in-progress states. ### `GET /v1/usage` Project usage and quota for a date range. Query params: `start` and `end` (`YYYY-MM-DD`; default to the current calendar month). Response: ```json { "period_start": "2026-05-01", "period_end": "2026-06-01", "total_pages": 1234, "total_documents": 56, "total_failed": 0, "quota_limit": 10000, "quota_remaining": 8766, "by_mime_type": { "application/pdf": { "documents": 40, "pages": 1100, "failed": 0 }, "image/png": { "documents": 16, "pages": 134, "failed": 0 } } } ``` `quota_limit` and `quota_remaining` are `null` for projects with unlimited quota. ### `GET /healthz` / `GET /readyz` Liveness and readiness probes; no auth. ### Errors `{ "error": "" }`. Codes: 400 invalid request (malformed JSON, unsupported MIME type, OCR backend other than Tesseract, more than 10 documents, missing multipart `webhook`), 401 missing/invalid key, 404 not found, 429 free credit exhausted, 500/503 internal/upstream failure. ### Webhook payload When a job completes or fails, Kreuzberg POSTs to `webhook.url`. - Signature header: `X-Webhook-Signature: sha256=` — HMAC-SHA256 over the raw request body using `webhook.secret` as the key, hex-encoded. Header is only present when a `secret` was supplied. - Body matches the `GET /v1/jobs/{id}` shape and includes any `webhook.metadata` you supplied at submit time. ## Platform / management API Used by the Kreuzberg Cloud dashboard. Most integrations should use the extraction API above; this surface is for project administration. - Hostname is the dashboard backend (e.g. `https://app.kreuzberg.dev`), distinct from the extraction API at `https://api.kreuzberg.dev`. - **Auth**: `POST /auth/login` (OIDC ID token → JWT), `DELETE /auth/account`. - **Projects**: CRUD, analytics, quota, settings. - **API keys**: create, list, revoke, regenerate (these keys authenticate calls to the extraction API). - **Members and invitations**: invite by email, role management, accept invitations via token. - **Webhooks**: register endpoints, list deliveries. - **Billing**: quota status, Stripe Checkout session, Stripe Customer Portal session. The OpenAPI spec for this surface is committed at `frontend/openapi-backend.yaml`. ## Quickstart ```bash B64=$(base64 < invoice.pdf | tr -d '\n') curl -X POST https://api.kreuzberg.dev/v1/extract \ -H "Authorization: Bearer $KREUZBERG_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "documents": [{ "filename": "invoice.pdf", "mime_type": "application/pdf", "data": "'"$B64"'" }], "options": { "extraction_config": { "output_format": "markdown", "ocr": { "backend": "tesseract", "language": "eng" } } }, "webhook": { "url": "https://your-app.invalid/hook", "secret": "shared-secret" } }' ``` Python: ```python import base64, httpx with open("invoice.pdf", "rb") as f: data = base64.b64encode(f.read()).decode() httpx.post( "https://api.kreuzberg.dev/v1/extract", headers={"Authorization": f"Bearer {api_key}"}, json={ "documents": [{"filename": "invoice.pdf", "mime_type": "application/pdf", "data": data}], "options": {"extraction_config": {"output_format": "markdown", "ocr": {"backend": "tesseract", "language": "eng"}}}, "webhook": {"url": webhook_url, "secret": webhook_secret}, }, ).raise_for_status() ``` TypeScript: ```ts const data = Buffer.from(await fs.readFile("invoice.pdf")).toString("base64"); await fetch("https://api.kreuzberg.dev/v1/extract", { method: "POST", headers: { Authorization: `Bearer ${apiKey}`, "Content-Type": "application/json" }, body: JSON.stringify({ documents: [{ filename: "invoice.pdf", mime_type: "application/pdf", data }], options: { extraction_config: { output_format: "markdown", ocr: { backend: "tesseract", language: "eng" } } }, webhook: { url: webhookUrl, secret: webhookSecret }, }), }); ``` ## Webhook signature verification ```python import hmac, hashlib expected = "sha256=" + hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest() assert hmac.compare_digest(expected, request.headers["X-Webhook-Signature"]) ``` ## Data handling Documents are processed in memory and deleted immediately after extraction. No storage, no indexing. Customer data is never used to train models. ## FAQ **How fast is "fast"?** Most documents process in milliseconds; thousands of pages per hour per API key on the cloud platform. **What file types?** PDFs (native and scanned), images (JPG/PNG/TIFF), Office (DOCX/PPTX/XLSX), web (HTML/XML), plain text. 92 formats total, autodetected. **Scanned documents?** Yes — built-in Tesseract OCR. The cloud is fixed to Tesseract; the self-hosted library can also use VLM OCR backends. **What happens to my documents?** Processed in memory, deleted immediately. No storage, no training. **License?** OSS library: Elastic License v2 from v4.8.0 (free for personal, internal, commercial; no managed-service resale). Versions ≤ v4.7.x are MIT. Kreuzberg Cloud has its own commercial terms. **OSS vs Cloud?** Same Rust core. Cloud removes operational complexity (autoscaling, OCR backends, webhook delivery, quota). The cloud restricts OCR to Tesseract; the OSS library has more configuration freedom. ## Links - Site: - Extraction API: `https://api.kreuzberg.dev` - OpenAPI spec: `https://api.kreuzberg.dev/api-doc/openapi.json` - Documentation: - GitHub: - Discord: - LinkedIn: - Contact: