Docker¶

Kreuzberg provides official Docker images for easy deployment and containerized usage.

Available Images¶

Docker images are available on Docker Hub:

goldziher/kreuzberg:latest - Core image with API server and Tesseract OCR
goldziher/kreuzberg:latest-easyocr - With EasyOCR support
goldziher/kreuzberg:latest-paddle - With PaddleOCR support
goldziher/kreuzberg:latest-gmft - With GMFT table extraction
goldziher/kreuzberg:latest-all - With all optional dependencies

Note: Specific version tags are also available (e.g., v3.8.0, v3.8.0-easyocr)

Quick Start¶

Running the API Server¶

# Pull the latest image
docker pull goldziher/kreuzberg:latest

# Run the API server
docker run -p 8000:8000 goldziher/kreuzberg:latest

The API server will be available at http://localhost:8000.

Extract Files¶

# Extract a single file
curl -X POST http://localhost:8000/extract \
  -F "data=@document.pdf"

# Extract multiple files
curl -X POST http://localhost:8000/extract \
  -F "data=@document1.pdf" \
  -F "data=@document2.docx"

Docker Compose¶

Create a docker-compose.yml:

services:
  kreuzberg:
    image: goldziher/kreuzberg:latest
    ports:
      - "8000:8000"
    environment:
      - PYTHONUNBUFFERED=1
    restart: unless-stopped

Run with:

1	`docker-compose up -d`

Using Different OCR Engines¶

EasyOCR¶

docker run -p 8000:8000 goldziher/kreuzberg:latest-easyocr

PaddleOCR¶

docker run -p 8000:8000 goldziher/kreuzberg:latest-paddle

All Features¶

docker run -p 8000:8000 goldziher/kreuzberg:latest-all

Configuration¶

Using Configuration Files¶

You can provide configuration files to customize Kreuzberg's behavior:

# Create a configuration file
cat > kreuzberg.toml << EOF
force_ocr = false
chunk_content = true
extract_tables = true
ocr_backend = "tesseract"
auto_detect_language = true

[tesseract]
language = "eng"
psm = 6

[gmft]
detector_base_threshold = 0.9
remove_null_rows = true
EOF

# Mount the configuration file
docker run -p 8000:8000 \
  -v "$(pwd)/kreuzberg.toml:/app/kreuzberg.toml" \
  goldziher/kreuzberg:latest

Docker Compose with Configuration¶

services:
  kreuzberg:
    image: goldziher/kreuzberg:latest
    ports:
      - "8000:8000"
    volumes:
      - "./kreuzberg.toml:/app/kreuzberg.toml"
    environment:
      - PYTHONUNBUFFERED=1
    restart: unless-stopped

Configuration Examples¶

For OCR-heavy workloads:

force_ocr = true
ocr_backend = "tesseract"

[tesseract]
language = "eng+deu+fra"  # Multiple languages
psm = 6                   # Uniform block of text

For table extraction:

extract_tables = true
chunk_content = true
max_chars = 2000

[gmft]
detector_base_threshold = 0.85
remove_null_rows = true
enable_multi_header = true

For multilingual documents:

auto_detect_language = true
extract_entities = true
extract_keywords = true

[language_detection]
multilingual = true
top_k = 5

[spacy_entity_extraction]
language_models = { en = "en_core_web_sm", de = "de_core_news_sm" }

Checking Configuration¶

You can verify what configuration is loaded:

1 2	`# Check configuration via API curl http://localhost:8000/config`

Building Custom Images¶

If you need a custom configuration, you can build your own image:

FROM goldziher/kreuzberg:latest

# Add custom dependencies
RUN pip install your-custom-package

# Add custom configuration file
COPY kreuzberg.toml /app/kreuzberg.toml

# Or copy a pyproject.toml with [tool.kreuzberg] section
COPY pyproject.toml /app/pyproject.toml

Image Details¶

Base Image¶

Based on python:3.13-bookworm (requires Python 3.10+)
Includes system dependencies: pandoc, tesseract-ocr
Runs as non-root user appuser
Exposes port 8000

Included Dependencies¶

All images include:

Kreuzberg core library
Litestar API framework
Tesseract OCR
Pandoc for document conversion

Additional dependencies by variant:

easyocr: EasyOCR deep learning models
paddle: PaddleOCR and PaddlePaddle
gmft: GMFT for table extraction
all: All optional dependencies

Health Check¶

All Docker images include a health check endpoint:

1 2	`# Check API health curl http://localhost:8000/health`

Returns a JSON response with service status and version information.

Observability¶

The Docker images include built-in OpenTelemetry instrumentation via Litestar:

Tracing: Automatic request/response tracing
Metrics: Performance and usage metrics
Logging: Structured JSON logging

Configure via standard OpenTelemetry environment variables:

docker run -p 8000:8000 \
  -e OTEL_SERVICE_NAME=kreuzberg-api \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://your-collector:4317 \
  goldziher/kreuzberg:latest

Environment Variables¶

PYTHONUNBUFFERED=1 - Ensures proper logging output
PYTHONDONTWRITEBYTECODE=1 - Prevents .pyc file creation
UV_LINK_MODE=copy - Optimizes package installation

Production Deployment¶

With nginx Reverse Proxy¶

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # File upload settings
        client_max_body_size 100M;
        proxy_read_timeout 300s;
    }

    # Health check endpoint
    location /health {
        proxy_pass http://localhost:8000/health;
        access_log off;
    }
}

Kubernetes Deployment¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kreuzberg-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kreuzberg-api
  template:
    metadata:
      labels:
        app: kreuzberg-api
    spec:
      containers:
      - name: kreuzberg
        image: goldziher/kreuzberg:latest
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: OTEL_SERVICE_NAME
          value: "kreuzberg-api"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
---
apiVersion: v1
kind: Service
metadata:
  name: kreuzberg-api
spec:
  selector:
    app: kreuzberg-api
  ports:
    - port: 80
      targetPort: 8000
  type: LoadBalancer

Resource Requirements¶

Minimum Requirements¶

CPU: 1 core
Memory: 512MB
Storage: 1GB (more for OCR models)

Recommended for Production¶

CPU: 2+ cores
Memory: 2GB+ (4GB+ for EasyOCR/PaddleOCR)
Storage: 5GB+ (depends on OCR models)

OCR Model Sizes¶

Tesseract: ~100MB
EasyOCR: ~64MB-2.5GB per language
PaddleOCR: ~400MB

Troubleshooting¶

Container Logs¶

1	`docker logs <container-id>`

Shell Access¶

docker exec -it <container-id> /bin/bash

Common Issues¶

Out of Memory: Increase Docker memory allocation or use a smaller OCR engine
Slow Startup: First run downloads OCR models; subsequent runs are faster
Permission Denied: Ensure mounted volumes have correct permissions

Security Considerations¶

Runs as non-root user by default
No external API calls or cloud dependencies
Process files locally within the container
Consider adding authentication for production use
Use volume mounts carefully to limit file system access