Skip to content

Docker

Kreuzberg provides official Docker images for easy deployment and containerized usage.

Available Images

Docker images are available on Docker Hub:

  • goldziher/kreuzberg:latest - Core image with API server and Tesseract OCR
  • goldziher/kreuzberg:latest-easyocr - With EasyOCR support
  • goldziher/kreuzberg:latest-paddle - With PaddleOCR support
  • goldziher/kreuzberg:latest-gmft - With GMFT table extraction
  • goldziher/kreuzberg:latest-all - With all optional dependencies

Note: Specific version tags are also available (e.g., v3.8.0, v3.8.0-easyocr)

Quick Start

Running the API Server

1
2
3
4
5
# Pull the latest image
docker pull goldziher/kreuzberg:latest

# Run the API server
docker run -p 8000:8000 goldziher/kreuzberg:latest

The API server will be available at http://localhost:8000.

Extract Files

1
2
3
4
5
6
7
8
# Extract a single file
curl -X POST http://localhost:8000/extract \
  -F "data=@document.pdf"

# Extract multiple files
curl -X POST http://localhost:8000/extract \
  -F "data=@document1.pdf" \
  -F "data=@document2.docx"

Docker Compose

Create a docker-compose.yml:

1
2
3
4
5
6
7
8
services:
  kreuzberg:
    image: goldziher/kreuzberg:latest
    ports:
      - "8000:8000"
    environment:
      - PYTHONUNBUFFERED=1
    restart: unless-stopped

Run with:

docker-compose up -d

Using Different OCR Engines

EasyOCR

docker run -p 8000:8000 goldziher/kreuzberg:latest-easyocr

PaddleOCR

docker run -p 8000:8000 goldziher/kreuzberg:latest-paddle

All Features

docker run -p 8000:8000 goldziher/kreuzberg:latest-all

Configuration

Using Configuration Files

You can provide configuration files to customize Kreuzberg's behavior:

# Create a configuration file
cat > kreuzberg.toml << EOF
force_ocr = false
chunk_content = true
extract_tables = true
ocr_backend = "tesseract"
auto_detect_language = true

[tesseract]
language = "eng"
psm = 6

[gmft]
detector_base_threshold = 0.9
remove_null_rows = true
EOF

# Mount the configuration file
docker run -p 8000:8000 \
  -v "$(pwd)/kreuzberg.toml:/app/kreuzberg.toml" \
  goldziher/kreuzberg:latest

Docker Compose with Configuration

services:
  kreuzberg:
    image: goldziher/kreuzberg:latest
    ports:
      - "8000:8000"
    volumes:
      - "./kreuzberg.toml:/app/kreuzberg.toml"
    environment:
      - PYTHONUNBUFFERED=1
    restart: unless-stopped

Configuration Examples

For OCR-heavy workloads:

1
2
3
4
5
6
force_ocr = true
ocr_backend = "tesseract"

[tesseract]
language = "eng+deu+fra"  # Multiple languages
psm = 6                   # Uniform block of text

For table extraction:

1
2
3
4
5
6
7
8
extract_tables = true
chunk_content = true
max_chars = 2000

[gmft]
detector_base_threshold = 0.85
remove_null_rows = true
enable_multi_header = true

For multilingual documents:

auto_detect_language = true
extract_entities = true
extract_keywords = true

[language_detection]
multilingual = true
top_k = 5

[spacy_entity_extraction]
language_models = { en = "en_core_web_sm", de = "de_core_news_sm" }

Checking Configuration

You can verify what configuration is loaded:

# Check configuration via API
curl http://localhost:8000/config

Building Custom Images

If you need a custom configuration, you can build your own image:

FROM goldziher/kreuzberg:latest

# Add custom dependencies
RUN pip install your-custom-package

# Add custom configuration file
COPY kreuzberg.toml /app/kreuzberg.toml

# Or copy a pyproject.toml with [tool.kreuzberg] section
COPY pyproject.toml /app/pyproject.toml

Image Details

Base Image

  • Based on python:3.13-bookworm (requires Python 3.10+)
  • Includes system dependencies: pandoc, tesseract-ocr
  • Runs as non-root user appuser
  • Exposes port 8000

Included Dependencies

All images include:

  • Kreuzberg core library
  • Litestar API framework
  • Tesseract OCR
  • Pandoc for document conversion

Additional dependencies by variant:

  • easyocr: EasyOCR deep learning models
  • paddle: PaddleOCR and PaddlePaddle
  • gmft: GMFT for table extraction
  • all: All optional dependencies

Health Check

All Docker images include a health check endpoint:

# Check API health
curl http://localhost:8000/health

Returns a JSON response with service status and version information.

Observability

The Docker images include built-in OpenTelemetry instrumentation via Litestar:

  • Tracing: Automatic request/response tracing
  • Metrics: Performance and usage metrics
  • Logging: Structured JSON logging

Configure via standard OpenTelemetry environment variables:

1
2
3
4
docker run -p 8000:8000 \
  -e OTEL_SERVICE_NAME=kreuzberg-api \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://your-collector:4317 \
  goldziher/kreuzberg:latest

Environment Variables

  • PYTHONUNBUFFERED=1 - Ensures proper logging output
  • PYTHONDONTWRITEBYTECODE=1 - Prevents .pyc file creation
  • UV_LINK_MODE=copy - Optimizes package installation

Production Deployment

With nginx Reverse Proxy

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # File upload settings
        client_max_body_size 100M;
        proxy_read_timeout 300s;
    }

    # Health check endpoint
    location /health {
        proxy_pass http://localhost:8000/health;
        access_log off;
    }
}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kreuzberg-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kreuzberg-api
  template:
    metadata:
      labels:
        app: kreuzberg-api
    spec:
      containers:
      - name: kreuzberg
        image: goldziher/kreuzberg:latest
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
        env:
        - name: OTEL_SERVICE_NAME
          value: "kreuzberg-api"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
---
apiVersion: v1
kind: Service
metadata:
  name: kreuzberg-api
spec:
  selector:
    app: kreuzberg-api
  ports:
    - port: 80
      targetPort: 8000
  type: LoadBalancer

Resource Requirements

Minimum Requirements

  • CPU: 1 core
  • Memory: 512MB
  • Storage: 1GB (more for OCR models)
  • CPU: 2+ cores
  • Memory: 2GB+ (4GB+ for EasyOCR/PaddleOCR)
  • Storage: 5GB+ (depends on OCR models)

OCR Model Sizes

  • Tesseract: ~100MB
  • EasyOCR: ~64MB-2.5GB per language
  • PaddleOCR: ~400MB

Troubleshooting

Container Logs

docker logs <container-id>

Shell Access

docker exec -it <container-id> /bin/bash

Common Issues

  1. Out of Memory: Increase Docker memory allocation or use a smaller OCR engine
  2. Slow Startup: First run downloads OCR models; subsequent runs are faster
  3. Permission Denied: Ensure mounted volumes have correct permissions

Security Considerations

  • Runs as non-root user by default
  • No external API calls or cloud dependencies
  • Process files locally within the container
  • Consider adding authentication for production use
  • Use volume mounts carefully to limit file system access