Skip to content

CLI Usage

The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.

Installation

brew install kreuzberg
cargo install kreuzberg-cli
docker pull goldziher/kreuzberg:latest
docker run -v $(pwd):/data goldziher/kreuzberg:latest extract /data/document.pdf

Basic Usage

Extract from Single File

# Extract to stdout
kreuzberg extract document.pdf

# Save to file
kreuzberg extract document.pdf -o output.txt

# Extract with metadata
kreuzberg extract document.pdf --metadata

Extract from Multiple Files

# Multiple files
kreuzberg extract doc1.pdf doc2.docx doc3.pptx

# Glob patterns
kreuzberg extract documents/**/*.pdf

# Directory (all files)
kreuzberg extract documents/

Output Formats

# Plain text (default)
kreuzberg extract document.pdf

# JSON output
kreuzberg extract document.pdf --format json

# JSON with metadata
kreuzberg extract document.pdf --format json --metadata

# Pretty JSON
kreuzberg extract document.pdf --format json --pretty

OCR Extraction

Enable OCR

# Basic OCR with Tesseract
kreuzberg extract scanned.pdf --ocr

# Specify language
kreuzberg extract scanned.pdf --ocr --language eng

# Multiple languages
kreuzberg extract scanned.pdf --ocr --language eng+deu+fra

Force OCR

Force OCR even for PDFs with text layer:

kreuzberg extract document.pdf --ocr --force-ocr

OCR Configuration

# Custom Tesseract config
kreuzberg extract scanned.pdf --ocr --tesseract-config "--psm 6"

# Available page segmentation modes (--psm):
# 0 = Orientation and script detection (OSD) only
# 1 = Automatic page segmentation with OSD
# 3 = Fully automatic page segmentation (default)
# 6 = Assume a single uniform block of text
# 11 = Sparse text. Find as much text as possible

Configuration Files

Using Config Files

Kreuzberg automatically discovers configuration files:

# Searches in order:
# 1. ./kreuzberg.toml
# 2. ./kreuzberg.yaml
# 3. ./kreuzberg.json
# 4. ./.kreuzberg/config.toml
# 5. ~/.config/kreuzberg/config.toml

kreuzberg extract document.pdf  # Uses config if found

Specify Config File

kreuzberg extract document.pdf --config my-config.toml

Example Config Files

kreuzberg.toml:

# OCR settings
[ocr]
backend = "tesseract"
language = "eng"
tesseract_config = "--psm 3"

# Quality processing
enable_quality_processing = true

# Caching
use_cache = true

# Chunking
[chunking]
max_chunk_size = 1000
overlap = 100

# Token reduction
[token_reduction]
enabled = true
target_reduction = 0.3

# Language detection
[language_detection]
enabled = true
detect_multiple = true

kreuzberg.yaml:

ocr:
  backend: tesseract
  language: eng
  tesseract_config: "--psm 3"

enable_quality_processing: true
use_cache: true

chunking:
  max_chunk_size: 1000
  overlap: 100

token_reduction:
  enabled: true
  target_reduction: 0.3

language_detection:
  enabled: true
  detect_multiple: true

kreuzberg.json:

{
  "ocr": {
    "backend": "tesseract",
    "language": "eng",
    "tesseract_config": "--psm 3"
  },
  "enable_quality_processing": true,
  "use_cache": true,
  "chunking": {
    "max_chunk_size": 1000,
    "overlap": 100
  },
  "token_reduction": {
    "enabled": true,
    "target_reduction": 0.3
  },
  "language_detection": {
    "enabled": true,
    "detect_multiple": true
  }
}

Batch Processing

Process Multiple Files

# All PDFs in directory
kreuzberg extract documents/*.pdf -o output/

# Recursive search
kreuzberg extract documents/**/*.pdf -o output/

# Multiple file types
kreuzberg extract documents/**/*.{pdf,docx,txt}

Batch with JSON Output

# Single JSON array with all results
kreuzberg extract documents/*.pdf --format json --output results.json

# One JSON file per input
kreuzberg extract documents/*.pdf --format json --output-dir results/

Parallel Processing

# Process files in parallel (default: CPU count)
kreuzberg extract documents/*.pdf --parallel

# Specify worker count
kreuzberg extract documents/*.pdf --parallel --workers 4

Advanced Features

Language Detection

# Enable language detection
kreuzberg extract document.pdf --detect-language

# JSON output shows detected languages
kreuzberg extract document.pdf --detect-language --format json

Content Chunking

# Enable chunking for LLM processing
kreuzberg extract document.pdf --chunk --chunk-size 1000

# With overlap
kreuzberg extract document.pdf --chunk --chunk-size 1000 --chunk-overlap 100

# JSON output includes chunks
kreuzberg extract document.pdf --chunk --format json

Token Reduction

# Enable token reduction
kreuzberg extract document.pdf --reduce-tokens --reduction-target 0.3

# Reduces content by ~30% while preserving meaning

Quality Processing

# Enable quality processing (formatting, cleanup)
kreuzberg extract document.pdf --quality-processing

Caching

# Enable caching (default: enabled)
kreuzberg extract scanned.pdf --ocr --cache

# Disable caching
kreuzberg extract scanned.pdf --ocr --no-cache

# Clear cache
kreuzberg cache clear

Output Options

Standard Output

# Print to stdout
kreuzberg extract document.pdf

# Redirect to file
kreuzberg extract document.pdf > output.txt

File Output

# Single file output
kreuzberg extract document.pdf -o output.txt

# Directory output (preserves structure)
kreuzberg extract documents/*.pdf -o output_dir/

JSON Output

# Compact JSON
kreuzberg extract document.pdf --format json

# Pretty JSON
kreuzberg extract document.pdf --format json --pretty

# JSON with metadata
kreuzberg extract document.pdf --format json --metadata

JSON Output Structure:

{
  "content": "Extracted text content...",
  "metadata": {
    "mime_type": "application/pdf",
    "page_count": 10,
    "author": "John Doe"
  },
  "tables": [
    {
      "cells": [["Name", "Age"], ["Alice", "30"]],
      "markdown": "| Name | Age |\n|------|-----|\n| Alice | 30 |"
    }
  ],
  "chunks": [],
  "detected_languages": ["eng"],
  "keywords": []
}

Table Extraction

# Extract tables
kreuzberg extract document.pdf --tables

# JSON output includes table data
kreuzberg extract document.pdf --tables --format json

# Markdown tables in output
kreuzberg extract document.pdf --tables --table-format markdown

Error Handling

Verbose Output

# Show detailed error messages
kreuzberg extract document.pdf --verbose

# Show debug information
kreuzberg extract document.pdf --debug

Continue on Errors

# Continue processing other files on error
kreuzberg extract documents/*.pdf --continue-on-error

# Show summary of errors
kreuzberg extract documents/*.pdf --continue-on-error --show-errors

Timeout

# Set extraction timeout (seconds)
kreuzberg extract document.pdf --timeout 30

# Useful for hanging documents
kreuzberg extract problematic/*.pdf --timeout 10 --continue-on-error

Examples

Extract All PDFs in Directory

kreuzberg extract documents/*.pdf -o output/

OCR All Scanned Documents

kreuzberg extract scans/*.pdf --ocr --language eng -o ocr_output/

Extract with Full Metadata

kreuzberg extract document.pdf --format json --metadata --pretty

Process Documents for LLM

kreuzberg extract documents/*.pdf \
  --chunk --chunk-size 1000 --chunk-overlap 100 \
  --reduce-tokens --reduction-target 0.3 \
  --format json -o llm_ready/

Extract Tables from Spreadsheets

kreuzberg extract data/*.xlsx --tables --format json --pretty

Multilingual OCR

kreuzberg extract international/*.pdf \
  --ocr --language eng+deu+fra+spa \
  --detect-language \
  --format json -o results/

Batch Processing with Progress

kreuzberg extract large_dataset/**/*.pdf \
  --parallel --workers 8 \
  --continue-on-error \
  --verbose \
  -o processed/

Environment Variables

Set default configuration via environment variables:

# OCR settings
export KREUZBERG_OCR_BACKEND=tesseract
export KREUZBERG_OCR_LANGUAGE=eng

# Cache settings
export KREUZBERG_CACHE_DIR=~/.cache/kreuzberg
export KREUZBERG_CACHE_ENABLED=true

# Parallel processing
export KREUZBERG_WORKERS=4

# Use in commands
kreuzberg extract document.pdf --ocr  # Uses env vars

Shell Integration

Bash Completion

# Generate completion script
kreuzberg completion bash > ~/.local/share/bash-completion/completions/kreuzberg

# Or add to .bashrc
eval "$(kreuzberg completion bash)"

Zsh Completion

# Add to .zshrc
eval "$(kreuzberg completion zsh)"

Fish Completion

kreuzberg completion fish > ~/.config/fish/completions/kreuzberg.fish

Docker Usage

Basic Docker

# Mount current directory
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
  extract /data/document.pdf

# Save output to host
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
  extract /data/document.pdf -o /data/output.txt

Docker with OCR

# Use OCR-enabled image
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
  extract /data/scanned.pdf --ocr --language eng

Docker Compose

docker-compose.yml:

version: '3.8'

services:
  kreuzberg:
    image: goldziher/kreuzberg:latest
    volumes:
      - ./documents:/input
      - ./output:/output
    command: extract /input --ocr -o /output

Run:

docker-compose up

Performance Tips

Optimize for Large Files

# Disable quality processing for speed
kreuzberg extract large.pdf --no-quality-processing

# Increase timeout
kreuzberg extract large.pdf --timeout 300

# Use parallel processing
kreuzberg extract large_files/*.pdf --parallel --workers 8

Optimize for Small Files

# Single thread for small files (less overhead)
kreuzberg extract small_files/*.txt --no-parallel

# Disable caching
kreuzberg extract small_files/*.txt --no-cache

Memory Management

# Process files one at a time (low memory)
kreuzberg extract huge_files/*.pdf --workers 1

# Stream large outputs
kreuzberg extract huge_file.pdf | gzip > output.txt.gz

Troubleshooting

Check Installation

# Verify CLI is installed
kreuzberg --version

# Check dependencies
kreuzberg doctor

Common Issues

Issue: "Tesseract not found"

# Install Tesseract
brew install tesseract  # macOS
sudo apt-get install tesseract-ocr  # Ubuntu

Issue: "Pandoc not found"

# Install Pandoc
brew install pandoc  # macOS
sudo apt-get install pandoc  # Ubuntu

Issue: "Out of memory"

# Process one file at a time
kreuzberg extract large_files/*.pdf --workers 1 --no-parallel

Issue: "Extraction timeout"

# Increase timeout
kreuzberg extract slow_file.pdf --timeout 300

Server Commands

Start API Server

The serve command starts a RESTful HTTP API server:

# Start server (default: 127.0.0.1:8000)
kreuzberg serve

# Specify host and port
kreuzberg serve --host 0.0.0.0 --port 3000

# With custom config
kreuzberg serve --config production.toml
# Start server via Python CLI proxy
python -m kreuzberg serve

# Specify host and port
python -m kreuzberg serve --host 0.0.0.0 --port 3000

# With custom config
python -m kreuzberg serve --config production.toml
# Start server via TypeScript CLI proxy
npx kreuzberg serve

# Specify host and port
npx kreuzberg serve --host 0.0.0.0 --port 3000

# With custom config
npx kreuzberg serve --config production.toml

The server provides endpoints for: - /extract - Extract text from uploaded files - /health - Health check - /info - Server information - /cache/stats - Cache statistics - /cache/clear - Clear cache

See API Server Guide for full API details.

Start MCP Server

The mcp command starts a Model Context Protocol server for AI integration:

# Start MCP server (stdio transport)
kreuzberg mcp

# With custom config
kreuzberg mcp --config kreuzberg.toml
# Start MCP server via Python CLI proxy
python -m kreuzberg mcp

# With custom config
python -m kreuzberg mcp --config kreuzberg.toml
# Start MCP server via TypeScript CLI proxy
npx kreuzberg mcp

# With custom config
npx kreuzberg mcp --config kreuzberg.toml

The MCP server provides tools for AI agents: - extract_file - Extract text from a file path - extract_bytes - Extract text from base64-encoded bytes - batch_extract - Extract from multiple files

See API Server Guide for MCP integration details.

Cache Management

View Cache Statistics

# Show cache statistics
kreuzberg cache stats

# Specify cache directory
kreuzberg cache stats --cache-dir /path/to/cache

# JSON output
kreuzberg cache stats --format json

Clear Cache

# Clear all cached files
kreuzberg cache clear

# Specify cache directory
kreuzberg cache clear --cache-dir /path/to/cache

# Show what was cleared
kreuzberg cache clear --format json

Getting Help

CLI Help

# General help
kreuzberg --help

# Command help
kreuzberg extract --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help

# List all options
kreuzberg extract --help-all

Version Information

# Show version
kreuzberg --version

# Show detailed version info
kreuzberg version --format json

Next Steps