API Server¶
Kreuzberg includes a built-in REST API server powered by Litestar for document extraction over HTTP.
Installation¶
Install Kreuzberg with the API extra:
Running the API Server¶
Using Python¶
Using Litestar CLI¶
With Custom Settings¶
API Endpoints¶
Health Check¶
Returns the server status:
Extract Files¶
Extract text from one or more files.
Request:
- Method:
POST
- Content-Type:
multipart/form-data
- Body: One or more files with field name
data
- Maximum file size: Configurable via
KREUZBERG_MAX_UPLOAD_SIZE
environment variable (default: 1GB per file)
Response:
- Status: 201 Created
- Body: Array of extraction results
Example:
Response Format:
Runtime Configuration¶
The /extract
endpoint supports runtime configuration via query parameters and HTTP headers, allowing you to customize extraction behavior without requiring static configuration files.
Query Parameters¶
Configure extraction options directly via URL query parameters:
Enable chunking with custom settings:
Extract entities and keywords:
Force OCR with specific backend:
Image Extraction and OCR¶
Kreuzberg can extract embedded images from various document formats and optionally run OCR on them to extract text content:
Image Extraction Response Format:
When image extraction is enabled, the response includes additional fields:
Advanced Image OCR Configuration:
For complex image OCR scenarios, use header-based configuration:
Supported Document Types for Image Extraction:
- PDF documents: Embedded images, graphics, and charts
- PowerPoint presentations (PPTX): Slide images, shapes, and media
- HTML documents: Inline images and base64-encoded images
- Microsoft Word documents (DOCX): Embedded images and charts
- Email files (EML, MSG): Image attachments and inline images
Enable language detection:
Supported Query Parameters:
chunk_content
(boolean): Enable content chunkingmax_chars
(integer): Maximum characters per chunkmax_overlap
(integer): Overlap between chunks in charactersextract_tables
(boolean): Enable table extractionextract_entities
(boolean): Enable named entity extractionextract_keywords
(boolean): Enable keyword extractionkeyword_count
(integer): Number of keywords to extractforce_ocr
(boolean): Force OCR processingocr_backend
(string): OCR engine (tesseract
,easyocr
,paddleocr
)auto_detect_language
(boolean): Enable automatic language detectionpdf_password
(string): Password for encrypted PDFsextract_images
(boolean): Extract embedded images from supported formats (PDF, PPTX, HTML, Office, Email)ocr_extracted_images
(boolean): Run OCR on extracted images to get text contentimage_ocr_backend
(string): OCR engine to use for images (tesseract
,easyocr
,paddleocr
)image_ocr_min_width
/image_ocr_min_height
(integer): Minimum image dimensions for OCR eligibilityimage_ocr_max_width
/image_ocr_max_height
(integer): Maximum image dimensions for OCR processingdeduplicate_images
(boolean): Remove duplicate images by content hash (enabled by default)
Boolean Parameter Formats:
Query parameters accept flexible boolean values:
true
,false
1
,0
yes
,no
on
,off
Header Configuration¶
For complex nested configurations, use the X-Extraction-Config
header with JSON format:
Basic header configuration:
Advanced OCR configuration:
GMFT Deprecation
GMFT-based table extraction is deprecated and scheduled for removal in Kreuzberg v4.0. The example below exists for legacy users; plan to migrate to the TATR-based table extraction pipeline before upgrading.
Table extraction with GMFT configuration:
Configuration Precedence¶
When multiple configuration sources are present, they are merged with the following precedence:
- Header config (highest priority) -
X-Extraction-Config
header - Query params - URL query parameters
- Static config -
kreuzberg.toml
orpyproject.toml
files - Defaults (lowest priority) - Built-in default values
Header overrides query parameters:
Result: max_chars will be 500 (from header)
Interactive API Documentation¶
Kreuzberg automatically generates comprehensive OpenAPI documentation that you can access through your web browser when the API server is running.
Accessing the Documentation¶
Once the API server is running, you can access interactive documentation at:
- OpenAPI Schema:
http://localhost:8000/schema/openapi.json
- Swagger UI:
http://localhost:8000/schema/swagger
- ReDoc Documentation:
http://localhost:8000/schema/redoc
- Stoplight Elements:
http://localhost:8000/schema/elements
- RapiDoc:
http://localhost:8000/schema/rapidoc
Features¶
The interactive documentation provides:
- Complete API Reference: All endpoints with detailed parameter descriptions
- Try It Out: Test API endpoints directly from the browser
- Request/Response Examples: Sample requests and responses for each endpoint
- Schema Validation: Interactive validation of request parameters
- Download Options: Export the OpenAPI specification
Example Usage¶
The documentation includes examples for all configuration options, making it easy to understand the full capabilities of the extraction API.
Error Handling¶
Invalid configuration returns appropriate error responses:
Error Handling¶
The API uses standard HTTP status codes:
200 OK
: Successful health check201 Created
: Successful extraction400 Bad Request
: Validation error (e.g., invalid file format)422 Unprocessable Entity
: Parsing error (e.g., corrupted file)500 Internal Server Error
: Unexpected error
Error responses include:
Debugging 500 Errors¶
For detailed error information when 500 Internal Server Errors occur, set the DEBUG
environment variable:
When DEBUG=1
is set, 500 errors will include:
- Full stack traces
- Detailed error context
- Internal state information
- Request debugging details
⚠️ Warning: Only enable debug mode in development environments. Debug information may expose sensitive details and should never be used in production.
Features¶
- Runtime Configuration: Configure extraction via query parameters and HTTP headers
- Batch Processing: Extract from multiple files in a single request
- Automatic Format Detection: Detects file types from MIME types
- OCR Support: Automatically applies OCR to images and scanned PDFs
- Configuration Precedence: Flexible configuration merging with clear precedence
- Structured Logging: Uses structlog for detailed logging
- OpenTelemetry: Built-in observability support
- Async Processing: High-performance async request handling
Configuration¶
The API server uses the default Kreuzberg extraction configuration:
- Tesseract OCR is included by default
- PDF, image, and document extraction is supported
- Table extraction with GMFT (if installed)
Environment Variables¶
The API server can be configured using environment variables for production deployments:
Server Configuration¶
Variable | Description | Default | Example |
---|---|---|---|
KREUZBERG_MAX_UPLOAD_SIZE | Maximum upload size in bytes | 1073741824 (1GB) | 2147483648 (2GB) |
KREUZBERG_ENABLE_OPENTELEMETRY | Enable OpenTelemetry tracing | true | false |
Usage Examples¶
Note: Boolean environment variables accept true
/false
, 1
/0
, yes
/no
, or on
/off
values.
To use custom configuration, modify the extraction call in your own API wrapper:
Production Deployment¶
For production use, consider:
- Reverse Proxy: Use nginx or similar for SSL termination
- Process Manager: Use systemd, supervisor, or similar
- Workers: Run multiple workers with uvicorn or gunicorn
- Monitoring: Enable OpenTelemetry exporters
- Rate Limiting: Add rate limiting middleware
- Authentication: Add authentication middleware if needed
- Security: Ensure
DEBUG
environment variable is not set
Example production command: