Docker
Kreuzberg provides official Docker images for easy deployment and containerized usage.
Available Images
Docker images are available on Docker Hub:
goldziher/kreuzberg:latest
- Core image with API server and Tesseract OCR goldziher/kreuzberg:latest-easyocr
- With EasyOCR support goldziher/kreuzberg:latest-paddle
- With PaddleOCR support goldziher/kreuzberg:latest-gmft
- With GMFT table extraction goldziher/kreuzberg:latest-all
- With all optional dependencies
Note: Specific version tags are also available (e.g., v3.8.0
, v3.8.0-easyocr
)
Quick Start
Running the API Server
| # Pull the latest image
docker pull goldziher/kreuzberg:latest
# Run the API server
docker run -p 8000:8000 goldziher/kreuzberg:latest
|
The API server will be available at http://localhost:8000
.
| # Extract a single file
curl -X POST http://localhost:8000/extract \
-F "data=@document.pdf"
# Extract multiple files
curl -X POST http://localhost:8000/extract \
-F "data=@document1.pdf" \
-F "data=@document2.docx"
|
Docker Compose
Create a docker-compose.yml
:
| services:
kreuzberg:
image: goldziher/kreuzberg:latest
ports:
- "8000:8000"
environment:
- PYTHONUNBUFFERED=1
restart: unless-stopped
|
Run with:
Using Different OCR Engines
EasyOCR
| docker run -p 8000:8000 goldziher/kreuzberg:latest-easyocr
|
PaddleOCR
| docker run -p 8000:8000 goldziher/kreuzberg:latest-paddle
|
All Features
| docker run -p 8000:8000 goldziher/kreuzberg:latest-all
|
Configuration
Using Configuration Files
You can provide configuration files to customize Kreuzberg's behavior:
| # Create a configuration file
cat > kreuzberg.toml << EOF
force_ocr = false
chunk_content = true
extract_tables = true
ocr_backend = "tesseract"
auto_detect_language = true
[tesseract]
language = "eng"
psm = 6
[gmft]
detector_base_threshold = 0.9
remove_null_rows = true
EOF
# Mount the configuration file
docker run -p 8000:8000 \
-v "$(pwd)/kreuzberg.toml:/app/kreuzberg.toml" \
goldziher/kreuzberg:latest
|
Docker Compose with Configuration
| services:
kreuzberg:
image: goldziher/kreuzberg:latest
ports:
- "8000:8000"
volumes:
- "./kreuzberg.toml:/app/kreuzberg.toml"
environment:
- PYTHONUNBUFFERED=1
restart: unless-stopped
|
Configuration Examples
For OCR-heavy workloads:
| force_ocr = true
ocr_backend = "tesseract"
[tesseract]
language = "eng+deu+fra" # Multiple languages
psm = 6 # Uniform block of text
|
For table extraction:
| extract_tables = true
chunk_content = true
max_chars = 2000
[gmft]
detector_base_threshold = 0.85
remove_null_rows = true
enable_multi_header = true
|
For multilingual documents:
| auto_detect_language = true
extract_entities = true
extract_keywords = true
[language_detection]
multilingual = true
top_k = 5
[spacy_entity_extraction]
language_models = { en = "en_core_web_sm", de = "de_core_news_sm" }
|
Checking Configuration
You can verify what configuration is loaded:
| # Check configuration via API
curl http://localhost:8000/config
|
Building Custom Images
If you need a custom configuration, you can build your own image:
| FROM goldziher/kreuzberg:latest
# Add custom dependencies
RUN pip install your-custom-package
# Add custom configuration file
COPY kreuzberg.toml /app/kreuzberg.toml
# Or copy a pyproject.toml with [tool.kreuzberg] section
COPY pyproject.toml /app/pyproject.toml
|
Image Details
Base Image
- Based on
python:3.13-bookworm
(requires Python 3.10+) - Includes system dependencies:
pandoc
, tesseract-ocr
- Runs as non-root user
appuser
- Exposes port 8000
Included Dependencies
All images include:
- Kreuzberg core library
- Litestar API framework
- Tesseract OCR
- Pandoc for document conversion
Additional dependencies by variant:
- easyocr: EasyOCR deep learning models
- paddle: PaddleOCR and PaddlePaddle
- gmft: GMFT for table extraction
- all: All optional dependencies
Health Check
All Docker images include a health check endpoint:
| # Check API health
curl http://localhost:8000/health
|
Returns a JSON response with service status and version information.
Observability
The Docker images include built-in OpenTelemetry instrumentation via Litestar:
- Tracing: Automatic request/response tracing
- Metrics: Performance and usage metrics
- Logging: Structured JSON logging
Configure via standard OpenTelemetry environment variables:
| docker run -p 8000:8000 \
-e OTEL_SERVICE_NAME=kreuzberg-api \
-e OTEL_EXPORTER_OTLP_ENDPOINT=http://your-collector:4317 \
goldziher/kreuzberg:latest
|
Environment Variables
PYTHONUNBUFFERED=1
- Ensures proper logging output PYTHONDONTWRITEBYTECODE=1
- Prevents .pyc file creation UV_LINK_MODE=copy
- Optimizes package installation
Production Deployment
With nginx Reverse Proxy
| server {
listen 80;
server_name api.example.com;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# File upload settings
client_max_body_size 100M;
proxy_read_timeout 300s;
}
# Health check endpoint
location /health {
proxy_pass http://localhost:8000/health;
access_log off;
}
}
|
Kubernetes Deployment
| apiVersion: apps/v1
kind: Deployment
metadata:
name: kreuzberg-api
spec:
replicas: 3
selector:
matchLabels:
app: kreuzberg-api
template:
metadata:
labels:
app: kreuzberg-api
spec:
containers:
- name: kreuzberg
image: goldziher/kreuzberg:latest
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
env:
- name: OTEL_SERVICE_NAME
value: "kreuzberg-api"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
---
apiVersion: v1
kind: Service
metadata:
name: kreuzberg-api
spec:
selector:
app: kreuzberg-api
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
|
Resource Requirements
Minimum Requirements
- CPU: 1 core
- Memory: 512MB
- Storage: 1GB (more for OCR models)
Recommended for Production
- CPU: 2+ cores
- Memory: 2GB+ (4GB+ for EasyOCR/PaddleOCR)
- Storage: 5GB+ (depends on OCR models)
OCR Model Sizes
- Tesseract: ~100MB
- EasyOCR: ~64MB-2.5GB per language
- PaddleOCR: ~400MB
Troubleshooting
Container Logs
| docker logs <container-id>
|
Shell Access
| docker exec -it <container-id> /bin/bash
|
Common Issues
- Out of Memory: Increase Docker memory allocation or use a smaller OCR engine
- Slow Startup: First run downloads OCR models; subsequent runs are faster
- Permission Denied: Ensure mounted volumes have correct permissions
Security Considerations
- Runs as non-root user by default
- No external API calls or cloud dependencies
- Process files locally within the container
- Consider adding authentication for production use
- Use volume mounts carefully to limit file system access