CLI Usage¶
The Kreuzberg CLI provides command-line access to all extraction features. This guide covers installation, basic usage, and advanced features.
Installation¶
Basic Usage¶
Extract from Single File¶
# Extract to stdout
kreuzberg extract document.pdf
# Save to file
kreuzberg extract document.pdf -o output.txt
# Extract with metadata
kreuzberg extract document.pdf --metadata
Extract from Multiple Files¶
# Multiple files
kreuzberg extract doc1.pdf doc2.docx doc3.pptx
# Glob patterns
kreuzberg extract documents/**/*.pdf
# Directory (all files)
kreuzberg extract documents/
Output Formats¶
# Plain text (default)
kreuzberg extract document.pdf
# JSON output
kreuzberg extract document.pdf --format json
# JSON with metadata
kreuzberg extract document.pdf --format json --metadata
# Pretty JSON
kreuzberg extract document.pdf --format json --pretty
OCR Extraction¶
Enable OCR¶
# Basic OCR with Tesseract
kreuzberg extract scanned.pdf --ocr
# Specify language
kreuzberg extract scanned.pdf --ocr --language eng
# Multiple languages
kreuzberg extract scanned.pdf --ocr --language eng+deu+fra
Force OCR¶
Force OCR even for PDFs with text layer:
OCR Configuration¶
# Custom Tesseract config
kreuzberg extract scanned.pdf --ocr --tesseract-config "--psm 6"
# Available page segmentation modes (--psm):
# 0 = Orientation and script detection (OSD) only
# 1 = Automatic page segmentation with OSD
# 3 = Fully automatic page segmentation (default)
# 6 = Assume a single uniform block of text
# 11 = Sparse text. Find as much text as possible
Configuration Files¶
Using Config Files¶
Kreuzberg automatically discovers configuration files:
# Searches in order:
# 1. ./kreuzberg.toml
# 2. ./kreuzberg.yaml
# 3. ./kreuzberg.json
# 4. ./.kreuzberg/config.toml
# 5. ~/.config/kreuzberg/config.toml
kreuzberg extract document.pdf # Uses config if found
Specify Config File¶
Example Config Files¶
kreuzberg.toml:
# OCR settings
[ocr]
backend = "tesseract"
language = "eng"
tesseract_config = "--psm 3"
# Quality processing
enable_quality_processing = true
# Caching
use_cache = true
# Chunking
[chunking]
max_chunk_size = 1000
overlap = 100
# Token reduction
[token_reduction]
enabled = true
target_reduction = 0.3
# Language detection
[language_detection]
enabled = true
detect_multiple = true
kreuzberg.yaml:
ocr:
backend: tesseract
language: eng
tesseract_config: "--psm 3"
enable_quality_processing: true
use_cache: true
chunking:
max_chunk_size: 1000
overlap: 100
token_reduction:
enabled: true
target_reduction: 0.3
language_detection:
enabled: true
detect_multiple: true
kreuzberg.json:
{
"ocr": {
"backend": "tesseract",
"language": "eng",
"tesseract_config": "--psm 3"
},
"enable_quality_processing": true,
"use_cache": true,
"chunking": {
"max_chunk_size": 1000,
"overlap": 100
},
"token_reduction": {
"enabled": true,
"target_reduction": 0.3
},
"language_detection": {
"enabled": true,
"detect_multiple": true
}
}
Batch Processing¶
Process Multiple Files¶
# All PDFs in directory
kreuzberg extract documents/*.pdf -o output/
# Recursive search
kreuzberg extract documents/**/*.pdf -o output/
# Multiple file types
kreuzberg extract documents/**/*.{pdf,docx,txt}
Batch with JSON Output¶
# Single JSON array with all results
kreuzberg extract documents/*.pdf --format json --output results.json
# One JSON file per input
kreuzberg extract documents/*.pdf --format json --output-dir results/
Parallel Processing¶
# Process files in parallel (default: CPU count)
kreuzberg extract documents/*.pdf --parallel
# Specify worker count
kreuzberg extract documents/*.pdf --parallel --workers 4
Advanced Features¶
Language Detection¶
# Enable language detection
kreuzberg extract document.pdf --detect-language
# JSON output shows detected languages
kreuzberg extract document.pdf --detect-language --format json
Content Chunking¶
# Enable chunking for LLM processing
kreuzberg extract document.pdf --chunk --chunk-size 1000
# With overlap
kreuzberg extract document.pdf --chunk --chunk-size 1000 --chunk-overlap 100
# JSON output includes chunks
kreuzberg extract document.pdf --chunk --format json
Token Reduction¶
# Enable token reduction
kreuzberg extract document.pdf --reduce-tokens --reduction-target 0.3
# Reduces content by ~30% while preserving meaning
Quality Processing¶
# Enable quality processing (formatting, cleanup)
kreuzberg extract document.pdf --quality-processing
Caching¶
# Enable caching (default: enabled)
kreuzberg extract scanned.pdf --ocr --cache
# Disable caching
kreuzberg extract scanned.pdf --ocr --no-cache
# Clear cache
kreuzberg cache clear
Output Options¶
Standard Output¶
# Print to stdout
kreuzberg extract document.pdf
# Redirect to file
kreuzberg extract document.pdf > output.txt
File Output¶
# Single file output
kreuzberg extract document.pdf -o output.txt
# Directory output (preserves structure)
kreuzberg extract documents/*.pdf -o output_dir/
JSON Output¶
# Compact JSON
kreuzberg extract document.pdf --format json
# Pretty JSON
kreuzberg extract document.pdf --format json --pretty
# JSON with metadata
kreuzberg extract document.pdf --format json --metadata
JSON Output Structure:
{
"content": "Extracted text content...",
"metadata": {
"mime_type": "application/pdf",
"page_count": 10,
"author": "John Doe"
},
"tables": [
{
"cells": [["Name", "Age"], ["Alice", "30"]],
"markdown": "| Name | Age |\n|------|-----|\n| Alice | 30 |"
}
],
"chunks": [],
"detected_languages": ["eng"],
"keywords": []
}
Table Extraction¶
# Extract tables
kreuzberg extract document.pdf --tables
# JSON output includes table data
kreuzberg extract document.pdf --tables --format json
# Markdown tables in output
kreuzberg extract document.pdf --tables --table-format markdown
Error Handling¶
Verbose Output¶
# Show detailed error messages
kreuzberg extract document.pdf --verbose
# Show debug information
kreuzberg extract document.pdf --debug
Continue on Errors¶
# Continue processing other files on error
kreuzberg extract documents/*.pdf --continue-on-error
# Show summary of errors
kreuzberg extract documents/*.pdf --continue-on-error --show-errors
Timeout¶
# Set extraction timeout (seconds)
kreuzberg extract document.pdf --timeout 30
# Useful for hanging documents
kreuzberg extract problematic/*.pdf --timeout 10 --continue-on-error
Examples¶
Extract All PDFs in Directory¶
OCR All Scanned Documents¶
Extract with Full Metadata¶
Process Documents for LLM¶
kreuzberg extract documents/*.pdf \
--chunk --chunk-size 1000 --chunk-overlap 100 \
--reduce-tokens --reduction-target 0.3 \
--format json -o llm_ready/
Extract Tables from Spreadsheets¶
Multilingual OCR¶
kreuzberg extract international/*.pdf \
--ocr --language eng+deu+fra+spa \
--detect-language \
--format json -o results/
Batch Processing with Progress¶
kreuzberg extract large_dataset/**/*.pdf \
--parallel --workers 8 \
--continue-on-error \
--verbose \
-o processed/
Environment Variables¶
Set default configuration via environment variables:
# OCR settings
export KREUZBERG_OCR_BACKEND=tesseract
export KREUZBERG_OCR_LANGUAGE=eng
# Cache settings
export KREUZBERG_CACHE_DIR=~/.cache/kreuzberg
export KREUZBERG_CACHE_ENABLED=true
# Parallel processing
export KREUZBERG_WORKERS=4
# Use in commands
kreuzberg extract document.pdf --ocr # Uses env vars
Shell Integration¶
Bash Completion¶
# Generate completion script
kreuzberg completion bash > ~/.local/share/bash-completion/completions/kreuzberg
# Or add to .bashrc
eval "$(kreuzberg completion bash)"
Zsh Completion¶
Fish Completion¶
Docker Usage¶
Basic Docker¶
# Mount current directory
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
extract /data/document.pdf
# Save output to host
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
extract /data/document.pdf -o /data/output.txt
Docker with OCR¶
# Use OCR-enabled image
docker run -v $(pwd):/data goldziher/kreuzberg:latest \
extract /data/scanned.pdf --ocr --language eng
Docker Compose¶
docker-compose.yml:
version: '3.8'
services:
kreuzberg:
image: goldziher/kreuzberg:latest
volumes:
- ./documents:/input
- ./output:/output
command: extract /input --ocr -o /output
Run:
Performance Tips¶
Optimize for Large Files¶
# Disable quality processing for speed
kreuzberg extract large.pdf --no-quality-processing
# Increase timeout
kreuzberg extract large.pdf --timeout 300
# Use parallel processing
kreuzberg extract large_files/*.pdf --parallel --workers 8
Optimize for Small Files¶
# Single thread for small files (less overhead)
kreuzberg extract small_files/*.txt --no-parallel
# Disable caching
kreuzberg extract small_files/*.txt --no-cache
Memory Management¶
# Process files one at a time (low memory)
kreuzberg extract huge_files/*.pdf --workers 1
# Stream large outputs
kreuzberg extract huge_file.pdf | gzip > output.txt.gz
Troubleshooting¶
Check Installation¶
Common Issues¶
Issue: "Tesseract not found"
Issue: "Pandoc not found"
Issue: "Out of memory"
Issue: "Extraction timeout"
Server Commands¶
Start API Server¶
The serve command starts a RESTful HTTP API server:
The server provides endpoints for: - /extract - Extract text from uploaded files - /health - Health check - /info - Server information - /cache/stats - Cache statistics - /cache/clear - Clear cache
See API Server Guide for full API details.
Start MCP Server¶
The mcp command starts a Model Context Protocol server for AI integration:
The MCP server provides tools for AI agents: - extract_file - Extract text from a file path - extract_bytes - Extract text from base64-encoded bytes - batch_extract - Extract from multiple files
See API Server Guide for MCP integration details.
Cache Management¶
View Cache Statistics¶
# Show cache statistics
kreuzberg cache stats
# Specify cache directory
kreuzberg cache stats --cache-dir /path/to/cache
# JSON output
kreuzberg cache stats --format json
Clear Cache¶
# Clear all cached files
kreuzberg cache clear
# Specify cache directory
kreuzberg cache clear --cache-dir /path/to/cache
# Show what was cleared
kreuzberg cache clear --format json
Getting Help¶
CLI Help¶
# General help
kreuzberg --help
# Command help
kreuzberg extract --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help
# List all options
kreuzberg extract --help-all
Version Information¶
Next Steps¶
- API Server Guide - API and MCP server setup
- Advanced Features - Advanced Kreuzberg features
- Plugin Development - Extend Kreuzberg functionality
- API Reference - Programmatic access