# Kreuzberg Cloud Extraction API

Public document extraction API. Base URL: `https://api.kreuzberg.dev`. The full live OpenAPI specification is published under `/api-doc/openapi.json`; integration guides live at <https://docs.kreuzberg.dev/>.

## Authentication

All endpoints require a project API key passed as a Bearer token:

```text
Authorization: Bearer kbg_live_<your-api-key>
```

API keys are created in the Kreuzberg Cloud dashboard at <https://kreuzberg.dev/dashboard>.

## Async model

Extraction is asynchronous. You submit documents and receive job IDs immediately (HTTP 202). Results are delivered in one of two ways:

1. **Webhook** — provide a `webhook.url` in the request and receive an HMAC-SHA256-signed POST when each job finishes. Header: `X-Webhook-Signature: sha256=<hex>`.
2. **Polling** — call `GET /v1/jobs/{id}` until status is `completed`, `partial_success`, or `failed`.

Recommended polling cadence: every 1 second for the first ~10 seconds, then back off to every 5 seconds. `chunking` and `aggregating` are normal in-progress states between `processing` and `completed`.

Job statuses: `awaiting_upload`, `pending`, `processing`, `chunking`, `aggregating`, `completed`, `partial_success`, `failed`, `cancelled`.

## Endpoints

### `POST /v1/extract` — Submit documents inline

Accepts `application/json` or `multipart/form-data`. Maximum **10 documents per request**; submit larger batches as multiple requests or use the presigned-upload flow.

**JSON body**:

```json
{
	"documents": [
		{
			"filename": "invoice.pdf",
			"mime_type": "application/pdf",
			"data": "<base64-encoded bytes>"
		}
	],
	"options": {
		"extraction_config": {
			"output_format": "markdown",
			"ocr": { "backend": "tesseract", "language": "eng" }
		}
	},
	"webhook": {
		"url": "https://example.com/hooks/kreuzberg",
		"secret": "shared-secret-for-hmac",
		"metadata": { "request_id": "abc123" }
	}
}
```

`webhook` is optional for JSON requests (poll if you skip it). For `multipart/form-data` the `webhook` part is **required** — the API rejects multipart submissions without it.

`ocr.backend` only accepts `"tesseract"` on the cloud. VLM and other backends are validated and rejected.

**Response (HTTP 202)**:

```json
{
	"job_ids": ["550e8400-e29b-41d4-a716-446655440000"],
	"status": "pending"
}
```

Inline extraction is best for files ≤ 1 MB. For larger files, use the presigned-upload flow.

### `POST /v1/uploads/presign` — Get presigned upload URLs

Use for files larger than the inline limit. Send only document metadata; receive a `batch_id` and one presigned `PUT` URL per document.

**Request**:

```json
{
	"documents": [{ "filename": "report.pdf", "mime_type": "application/pdf" }],
	"config": {
		"output_format": "markdown",
		"ocr": { "backend": "tesseract", "language": "eng" }
	},
	"webhook": { "url": "https://example.com/hooks/kreuzberg", "secret": "..." }
}
```

Note that the presign body uses a top-level `config` field, not the `options.extraction_config` nesting that `POST /v1/extract` uses.

**Response**:

```json
{
	"batch_id": "...",
	"uploads": [
		{
			"job_id": "550e8400-e29b-41d4-a716-446655440000",
			"upload_url": "https://...",
			"object_key": "...",
			"method": "PUT",
			"expires_in_secs": 900
		}
	]
}
```

Upload each file directly to its `upload_url` with `PUT`, then call `/v1/uploads/confirm` with the `batch_id`.

### `POST /v1/uploads/confirm` — Trigger processing

```json
{ "batch_id": "<batch_id from presign>" }
```

**Response (HTTP 202)**:

```json
{
	"job_ids": ["..."],
	"status": "pending"
}
```

### `GET /v1/jobs/{id}` — Poll job status and results

Returns job metadata and, on completion, the full extraction result:

```json
{
	"id": "550e8400-e29b-41d4-a716-446655440000",
	"filename": "invoice.pdf",
	"status": "completed",
	"created_at": "2026-05-08T10:00:00Z",
	"processing_time_ms": 1234,
	"result": {
		"content": [{ "text": "Invoice total: $1,234.56", "page_number": 1, "confidence": 0.95 }],
		"tables": [],
		"images": [],
		"metadata": { "title": "Invoice #12345", "page_count": 1 }
	}
}
```

`processing_time_ms` and `result` are only present once the job reaches `completed` or `partial_success`.

### `GET /v1/usage` — Project usage and quota

Query params: `start` and `end` (both `YYYY-MM-DD`; default to the current calendar month).

**Response**:

```json
{
	"period_start": "2026-05-01",
	"period_end": "2026-06-01",
	"total_pages": 1234,
	"total_documents": 56,
	"total_failed": 0,
	"quota_limit": 10000,
	"quota_remaining": 8766,
	"by_mime_type": {
		"application/pdf": { "documents": 40, "pages": 1100, "failed": 0 },
		"image/png": { "documents": 16, "pages": 134, "failed": 0 }
	}
}
```

`quota_limit` and `quota_remaining` are `null` when the project has unlimited quota.

### `GET /healthz`, `GET /readyz`

Liveness and readiness probes — open to all callers, no auth required.

## Errors

JSON errors of the form `{"error": "<message>"}`. Status codes:

- `400` — invalid request (bad MIME type, missing field, malformed JSON, OCR backend other than Tesseract, more than 10 documents, missing multipart `webhook`)
- `401` — missing or invalid API key
- `404` — job/resource not found
- `429` — free page credit exhausted; add a payment method to continue
- `500` / `503` — internal or upstream failure

## Webhook payload

When a job completes (or fails), Kreuzberg POSTs to your `webhook.url`.

- **Signature**: HMAC-SHA256 over the raw request body, using `webhook.secret` as the key, hex-encoded. Sent as `X-Webhook-Signature: sha256=<hex>`. The header is only present when you supply a `secret`.
- **Body**: same shape as `GET /v1/jobs/{id}`, plus any `webhook.metadata` you supplied at submit time.

## See also

- [Capabilities](https://kreuzberg.dev/llms/capabilities.md)
- [Getting started](https://kreuzberg.dev/llms/getting-started.md)
- [Pricing](https://kreuzberg.dev/llms/pricing.md)
- [Full documentation](https://docs.kreuzberg.dev/)
