# Kreuzberg Cloud Extraction API

Public document extraction API. Base URL: `https://api.kreuzberg.dev`. The full live OpenAPI specification is published under `/api-doc/openapi.json`; integration guides live at <https://docs.kreuzberg.dev/>.

## Authentication

All endpoints require a project API key passed as a Bearer token:

```text
Authorization: Bearer kbg_live_<your-api-key>
```

API keys are created in the Kreuzberg Cloud dashboard at <https://kreuzberg.dev/dashboard>.

## Async model

Extraction is asynchronous. You submit documents and receive job IDs immediately (HTTP 202). Results are delivered in one of two ways:

1. **Webhook** — provide a `webhook.url` in the request and receive an HMAC-SHA256-signed POST when each job finishes. Header: `X-Webhook-Signature: sha256=<hex>`.
2. **Polling** — call `GET /v1/jobs/{id}` until status is `completed`, `partial_success`, or `failed`.

Recommended polling cadence: every 1 second for the first ~10 seconds, then back off to every 5 seconds. `chunking` and `aggregating` are normal in-progress states between `processing` and `completed`.

Job statuses: `awaiting_upload`, `pending`, `processing`, `chunking`, `aggregating`, `completed`, `partial_success`, `failed`, `cancelled`.

## Endpoints

### `POST /v1/extract` — Submit documents inline

Accepts `application/json` or `multipart/form-data`. Maximum **10 documents per request**; submit larger batches as multiple requests or use the presigned-upload flow.

**JSON body**:

```json
{
  "documents": [
    {
      "filename": "invoice.pdf",
      "mime_type": "application/pdf",
      "data": "<base64-encoded bytes>"
    }
  ],
  "options": {
    "extraction_config": {
      "output_format": "markdown",
      "ocr": { "backend": "tesseract", "language": "eng" }
    }
  },
  "webhook": {
    "url": "https://example.com/hooks/kreuzberg",
    "secret": "shared-secret-for-hmac",
    "metadata": { "request_id": "abc123" }
  }
}
```

`webhook` is optional for JSON requests (poll if you skip it). For `multipart/form-data` the `webhook` part is **required** — the API rejects multipart submissions without it.

`ocr.backend` only accepts `"tesseract"` on the cloud. VLM and other backends are validated and rejected.

**Response (HTTP 202)**:

```json
{
  "job_ids": ["550e8400-e29b-41d4-a716-446655440000"],
  "status": "pending"
}
```

Inline extraction is best for files ≤ 1 MB. For larger files, use the presigned-upload flow.

### `POST /v1/uploads/presign` — Get presigned upload URLs

Use for files larger than the inline limit. Send only document metadata; receive a `batch_id` and one presigned `PUT` URL per document.

**Request**:

```json
{
  "documents": [
    { "filename": "report.pdf", "mime_type": "application/pdf" }
  ],
  "config": {
    "output_format": "markdown",
    "ocr": { "backend": "tesseract", "language": "eng" }
  },
  "webhook": { "url": "https://example.com/hooks/kreuzberg", "secret": "..." }
}
```

Note that the presign body uses a top-level `config` field, not the `options.extraction_config` nesting that `POST /v1/extract` uses.

**Response**:

```json
{
  "batch_id": "...",
  "uploads": [
    {
      "job_id": "550e8400-e29b-41d4-a716-446655440000",
      "upload_url": "https://...",
      "object_key": "...",
      "method": "PUT",
      "expires_in_secs": 900
    }
  ]
}
```

Upload each file directly to its `upload_url` with `PUT`, then call `/v1/uploads/confirm` with the `batch_id`.

### `POST /v1/uploads/confirm` — Trigger processing

```json
{ "batch_id": "<batch_id from presign>" }
```

**Response (HTTP 202)**:

```json
{
  "job_ids": ["..."],
  "status": "pending"
}
```

### `GET /v1/jobs/{id}` — Poll job status and results

Returns job metadata and, on completion, the full extraction result:

```json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "filename": "invoice.pdf",
  "status": "completed",
  "created_at": "2026-05-08T10:00:00Z",
  "processing_time_ms": 1234,
  "result": {
    "content": [{ "text": "Invoice total: $1,234.56", "page_number": 1, "confidence": 0.95 }],
    "tables": [],
    "images": [],
    "metadata": { "title": "Invoice #12345", "page_count": 1 }
  }
}
```

`processing_time_ms` and `result` are only present once the job reaches `completed` or `partial_success`.

### `GET /v1/usage` — Project usage and quota

Query params: `start` and `end` (both `YYYY-MM-DD`; default to the current calendar month).

**Response**:

```json
{
  "period_start": "2026-05-01",
  "period_end": "2026-06-01",
  "total_pages": 1234,
  "total_documents": 56,
  "total_failed": 0,
  "quota_limit": 10000,
  "quota_remaining": 8766,
  "by_mime_type": {
    "application/pdf": { "documents": 40, "pages": 1100, "failed": 0 },
    "image/png":       { "documents": 16, "pages": 134,  "failed": 0 }
  }
}
```

`quota_limit` and `quota_remaining` are `null` when the project has unlimited quota.

### `GET /healthz`, `GET /readyz`

Liveness and readiness probes — open to all callers, no auth required.

## Errors

JSON errors of the form `{"error": "<message>"}`. Status codes:

- `400` — invalid request (bad MIME type, missing field, malformed JSON, OCR backend other than Tesseract, more than 10 documents, missing multipart `webhook`)
- `401` — missing or invalid API key
- `404` — job/resource not found
- `429` — free page credit exhausted; add a payment method to continue
- `500` / `503` — internal or upstream failure

## Webhook payload

When a job completes (or fails), Kreuzberg POSTs to your `webhook.url`.

- **Signature**: HMAC-SHA256 over the raw request body, using `webhook.secret` as the key, hex-encoded. Sent as `X-Webhook-Signature: sha256=<hex>`. The header is only present when you supply a `secret`.
- **Body**: same shape as `GET /v1/jobs/{id}`, plus any `webhook.metadata` you supplied at submit time.

## See also

- [Capabilities](https://kreuzberg.dev/llms/capabilities.md)
- [Getting started](https://kreuzberg.dev/llms/getting-started.md)
- [Pricing](https://kreuzberg.dev/llms/pricing.md)
- [Full documentation](https://docs.kreuzberg.dev/)
