Skip to content

CLI Usage

The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.

Installation

Bash
brew install kreuzberg-dev/tap/kreuzberg
Bash
cargo install kreuzberg-cli
Bash
docker pull goldziher/kreuzberg:latest
docker run -v $(pwd):/data goldziher/kreuzberg:latest extract /data/document.pdf
Bash
go get github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg@latest

Basic Usage

Extract from Single File

Terminal
# Extract text content to stdout
kreuzberg extract document.pdf

# Extract and save to output file
kreuzberg extract document.pdf -o output.txt

# Extract with document metadata included
kreuzberg extract document.pdf --metadata

Extract from Multiple Files

Terminal
# Extract from multiple files at once
kreuzberg extract doc1.pdf doc2.docx doc3.pptx

# Extract using glob patterns
kreuzberg extract documents/**/*.pdf

# Extract all files in a directory
kreuzberg extract documents/

Output Formats

Terminal
# Output as plain text (default format)
kreuzberg extract document.pdf

# Output as JSON
kreuzberg extract document.pdf --format json

# Output as JSON including document metadata
kreuzberg extract document.pdf --format json --metadata

# Output as formatted JSON with indentation
kreuzberg extract document.pdf --format json --pretty

OCR Extraction

Enable OCR

Terminal
# Enable OCR using Tesseract backend
kreuzberg extract scanned.pdf --ocr

# Extract with specific OCR language
kreuzberg extract scanned.pdf --ocr --language eng

# Extract with multiple OCR languages
kreuzberg extract scanned.pdf --ocr --language eng+deu+fra

Force OCR

Force OCR even for PDFs with text layer:

Terminal
kreuzberg extract document.pdf --ocr --force-ocr

OCR Configuration

Terminal
# Extract with custom Tesseract page segmentation mode
kreuzberg extract scanned.pdf --ocr --tesseract-config "--psm 6"

# Page segmentation modes (--psm):
# 0  = Orientation and script detection only
# 1  = Automatic page segmentation with OSD
# 3  = Fully automatic page segmentation (default)
# 6  = Single uniform block of text
# 11 = Sparse text detection

Configuration Files

Using Config Files

Kreuzberg automatically discovers configuration files:

Terminal
# Configuration file search order:
# 1. ./kreuzberg.toml
# 2. ./kreuzberg.yaml
# 3. ./kreuzberg.json
# 4. ./.kreuzberg/config.toml
# 5. ~/.config/kreuzberg/config.toml

# Extract using discovered configuration
kreuzberg extract document.pdf

Specify Config File

Terminal
kreuzberg extract document.pdf --config my-config.toml

Example Config Files

kreuzberg.toml:

OCR configuration
[ocr]
backend = "tesseract"
language = "eng"
tesseract_config = "--psm 3"

# Enable quality processing for better output
enable_quality_processing = true

# Enable result caching
use_cache = true

# Text chunking configuration
[chunking]
max_chunk_size = 1000
overlap = 100

# Token reduction for LLM processing
[token_reduction]
enabled = true
target_reduction = 0.3

# Automatic language detection
[language_detection]
enabled = true
detect_multiple = true

kreuzberg.yaml:

kreuzberg.yaml
ocr:
  backend: tesseract
  language: eng
  tesseract_config: "--psm 3"

enable_quality_processing: true
use_cache: true

chunking:
  max_chunk_size: 1000
  overlap: 100

token_reduction:
  enabled: true
  target_reduction: 0.3

language_detection:
  enabled: true
  detect_multiple: true

kreuzberg.json:

kreuzberg.json
{
  "ocr": {
    "backend": "tesseract",
    "language": "eng",
    "tesseract_config": "--psm 3"
  },
  "enable_quality_processing": true,
  "use_cache": true,
  "chunking": {
    "max_chunk_size": 1000,
    "overlap": 100
  },
  "token_reduction": {
    "enabled": true,
    "target_reduction": 0.3
  },
  "language_detection": {
    "enabled": true,
    "detect_multiple": true
  }
}

Batch Processing

Process Multiple Files

Terminal
# Extract all PDFs in directory
kreuzberg extract documents/*.pdf -o output/

# Extract PDFs recursively from subdirectories
kreuzberg extract documents/**/*.pdf -o output/

# Extract multiple file types
kreuzberg extract documents/**/*.{pdf,docx,txt}

Batch with JSON Output

Terminal
# Output all results as single JSON array
kreuzberg extract documents/*.pdf --format json --output results.json

# Output separate JSON file per input document
kreuzberg extract documents/*.pdf --format json --output-dir results/

Parallel Processing

Terminal
# Enable parallel processing with automatic worker count
kreuzberg extract documents/*.pdf --parallel

# Process with specific number of worker threads
kreuzberg extract documents/*.pdf --parallel --workers 4

Advanced Features

Language Detection

Terminal
# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language

# Output detected languages in JSON format
kreuzberg extract document.pdf --detect-language --format json

Content Chunking

Terminal
# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk --chunk-size 1000

# Split content with overlapping chunks
kreuzberg extract document.pdf --chunk --chunk-size 1000 --chunk-overlap 100

# Output chunked content as JSON
kreuzberg extract document.pdf --chunk --format json

Token Reduction

Terminal
# Reduce token count by 30% while preserving meaning
kreuzberg extract document.pdf --reduce-tokens --reduction-target 0.3

Quality Processing

Terminal
# Apply quality processing for improved formatting and cleanup
kreuzberg extract document.pdf --quality-processing

Caching

Terminal
# Extract with result caching enabled (default)
kreuzberg extract scanned.pdf --ocr --cache

# Extract without caching results
kreuzberg extract scanned.pdf --ocr --no-cache

# Clear all cached results
kreuzberg cache clear

Output Options

Standard Output

Terminal
# Extract and print content to stdout
kreuzberg extract document.pdf

# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt

File Output

Terminal
# Extract and save to single output file
kreuzberg extract document.pdf -o output.txt

# Extract multiple files preserving directory structure
kreuzberg extract documents/*.pdf -o output_dir/

JSON Output

Terminal
# Output as compact JSON
kreuzberg extract document.pdf --format json

# Output as formatted JSON with indentation
kreuzberg extract document.pdf --format json --pretty

# Output as JSON including document metadata
kreuzberg extract document.pdf --format json --metadata

JSON Output Structure:

JSON Response
{
  "content": "Extracted text content...",
  "metadata": {
    "mime_type": "application/pdf",
    "page_count": 10,
    "author": "John Doe"
  },
  "tables": [
    {
      "cells": [["Name", "Age"], ["Alice", "30"]],
      "markdown": "| Name | Age |\n|------|-----|\n| Alice | 30 |"
    }
  ],
  "chunks": [],
  "detected_languages": ["eng"],
  "keywords": []
}

Table Extraction

Terminal
# Extract tables from document
kreuzberg extract document.pdf --tables

# Extract tables and output as JSON
kreuzberg extract document.pdf --tables --format json

# Extract tables formatted as markdown
kreuzberg extract document.pdf --tables --table-format markdown

Error Handling

Verbose Output

Terminal
# Extract with detailed error messages
kreuzberg extract document.pdf --verbose

# Extract with debug-level logging
kreuzberg extract document.pdf --debug

Continue on Errors

Terminal
# Process all files even if some fail
kreuzberg extract documents/*.pdf --continue-on-error

# Process all files and display error summary
kreuzberg extract documents/*.pdf --continue-on-error --show-errors

Timeout

Terminal
# Set maximum extraction time per document (seconds)
kreuzberg extract document.pdf --timeout 30

# Process problematic files with timeout and error tolerance
kreuzberg extract problematic/*.pdf --timeout 10 --continue-on-error

Examples

Extract All PDFs in Directory

Extract all PDFs from directory
kreuzberg extract documents/*.pdf -o output/

OCR All Scanned Documents

OCR extraction from scanned documents
kreuzberg extract scans/*.pdf --ocr --language eng -o ocr_output/

Extract with Full Metadata

Extract with complete metadata as pretty JSON
kreuzberg extract document.pdf --format json --metadata --pretty

Process Documents for LLM

Prepare documents for LLM with chunking and token reduction
kreuzberg extract documents/*.pdf \
  --chunk --chunk-size 1000 --chunk-overlap 100 \
  --reduce-tokens --reduction-target 0.3 \
  --format json -o llm_ready/

Extract Tables from Spreadsheets

Extract table data from spreadsheets
kreuzberg extract data/*.xlsx --tables --format json --pretty

Multilingual OCR

OCR with multiple languages and detection
kreuzberg extract international/*.pdf \
  --ocr --language eng+deu+fra+spa \
  --detect-language \
  --format json -o results/

Batch Processing with Progress

Parallel batch processing with error handling
kreuzberg extract large_dataset/**/*.pdf \
  --parallel --workers 8 \
  --continue-on-error \
  --verbose \
  -o processed/

Environment Variables

Set default configuration via environment variables:

Terminal
# Configure default OCR settings
export KREUZBERG_OCR_BACKEND=tesseract
export KREUZBERG_OCR_LANGUAGE=eng

# Configure cache location and behavior
export KREUZBERG_CACHE_DIR=~/.cache/kreuzberg
export KREUZBERG_CACHE_ENABLED=true

# Configure parallel processing
export KREUZBERG_WORKERS=4

# Extract using configured environment variables
kreuzberg extract document.pdf --ocr

Shell Integration

Bash Completion

Terminal
# Generate and save bash completion script
kreuzberg completion bash > ~/.local/share/bash-completion/completions/kreuzberg

# Enable completion in current session
eval "$(kreuzberg completion bash)"

Zsh Completion

Terminal
# Enable zsh completion (add to .zshrc)
eval "$(kreuzberg completion zsh)"

Fish Completion

Terminal
# Generate and save fish completion script
kreuzberg completion fish > ~/.config/fish/completions/kreuzberg.fish

Docker Usage

Basic Docker

Terminal
# Extract document using Docker with mounted directory
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
  extract /data/document.pdf

# Extract and save output to host directory
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
  extract /data/document.pdf -o /data/output.txt

Docker with OCR

Terminal
# Extract with OCR using Docker
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
  extract /data/scanned.pdf --ocr --language eng

Docker Compose

docker-compose.yaml:

docker-compose.yaml
version: '3.8'

services:
  kreuzberg:
    image: goldziher/kreuzberg:latest
    volumes:
      - ./documents:/input
      - ./output:/output
    command: extract /input --ocr -o /output

Run:

Terminal
docker-compose up

Performance Tips

Optimize for Large Files

Terminal
# Extract without quality processing for faster speed
kreuzberg extract large.pdf --no-quality-processing

# Extract with extended timeout for large files
kreuzberg extract large.pdf --timeout 300

# Extract using parallel processing for multiple large files
kreuzberg extract large_files/*.pdf --parallel --workers 8

Optimize for Small Files

Terminal
# Extract small files without parallel overhead
kreuzberg extract small_files/*.txt --no-parallel

# Extract without caching for quick one-off processing
kreuzberg extract small_files/*.txt --no-cache

Memory Management

Terminal
# Extract large files sequentially to minimize memory usage
kreuzberg extract huge_files/*.pdf --workers 1

# Extract and compress output to save disk space
kreuzberg extract huge_file.pdf | gzip > output.txt.gz

Troubleshooting

Check Installation

Terminal
# Display installed version
kreuzberg --version

# Verify system dependencies
kreuzberg doctor

Common Issues

Issue: "Tesseract not found"

Terminal
# Install Tesseract OCR engine on macOS
brew install tesseract

# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr

Issue: "Out of memory"

Terminal
# Reduce memory usage by processing sequentially
kreuzberg extract large_files/*.pdf --workers 1 --no-parallel

Issue: "Extraction timeout"

Terminal
# Extend timeout for slow documents
kreuzberg extract slow_file.pdf --timeout 300

Server Commands

Start API Server

The serve command starts a RESTful HTTP API server:

Bash
# Start server (default: 127.0.0.1:8000)
kreuzberg serve

# Specify host and port
kreuzberg serve --host 0.0.0.0 --port 3000

# With custom config
kreuzberg serve --config production.toml
Bash
# Start server via Python CLI proxy
python -m kreuzberg serve

# Specify host and port
python -m kreuzberg serve --host 0.0.0.0 --port 3000

# With custom config
python -m kreuzberg serve --config production.toml
Bash
# Start server via TypeScript CLI proxy
npx kreuzberg serve

# Specify host and port
npx kreuzberg serve --host 0.0.0.0 --port 3000

# With custom config
npx kreuzberg serve --config production.toml
Go
package main

import (
    "log"
    "os/exec"
)

func main() {
    cmd := exec.Command("kreuzberg", "serve", "--host", "0.0.0.0", "--port", "3000")
    cmd.Stdout = log.Writer()
    cmd.Stderr = log.Writer()
    if err := cmd.Run(); err != nil {
        log.Fatalf("server exited: %v", err)
    }
}
Java
// Java bindings not yet available
// Use the Rust CLI or Docker for now
Ruby
require 'kreuzberg'

# Start API server on port 8000
Kreuzberg::APIProxy.run(port: 8000, host: '0.0.0.0') do |server|
  puts "API server running on http://localhost:8000"
  # Server runs while block executes
  # Make HTTP requests to endpoint
  sleep
end

The server provides endpoints for: - /extract - Extract text from uploaded files - /health - Health check - /info - Server information - /cache/stats - Cache statistics - /cache/clear - Clear cache

See API Server Guide for full API details.

Start MCP Server

The mcp command starts a Model Context Protocol server for AI integration:

Bash
# Start MCP server (stdio transport)
kreuzberg mcp

# With custom config
kreuzberg mcp --config kreuzberg.toml
Bash
# Start MCP server via Python CLI proxy
python -m kreuzberg mcp

# With custom config
python -m kreuzberg mcp --config kreuzberg.toml
Bash
# Start MCP server via TypeScript CLI proxy
npx kreuzberg mcp

# With custom config
npx kreuzberg mcp --config kreuzberg.toml
Go
package main

import (
    "log"
    "os/exec"
)

func main() {
    cmd := exec.Command("kreuzberg", "mcp")
    cmd.Stdout = log.Writer()
    cmd.Stderr = log.Writer()
    if err := cmd.Run(); err != nil {
        log.Fatalf("mcp exited: %v", err)
    }
}
Java
// Java bindings not yet available
// Use the Rust CLI or Docker for now
Ruby
require 'kreuzberg'

# Start MCP server for Claude Desktop
server = Kreuzberg::MCPProxy::Server.new(transport: 'stdio')
server.start
# Server communicates via stdio for Claude integration

The MCP server provides tools for AI agents: - extract_file - Extract text from a file path - extract_bytes - Extract text from base64-encoded bytes - batch_extract - Extract from multiple files

See API Server Guide for MCP integration details.

Cache Management

View Cache Statistics

Terminal
# Display cache usage statistics
kreuzberg cache stats

# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache

# Output cache statistics as JSON
kreuzberg cache stats --format json

Clear Cache

Terminal
# Remove all cached extraction results
kreuzberg cache clear

# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache

# Clear cache and display removal details
kreuzberg cache clear --format json

Getting Help

CLI Help

Terminal
# Display general CLI help
kreuzberg --help

# Display command-specific help
kreuzberg extract --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help

# Display all available options
kreuzberg extract --help-all

Version Information

Terminal
# Display version number
kreuzberg --version

# Display detailed version information as JSON
kreuzberg version --format json

Next Steps