CLI Usage¶

The Kreuzberg CLI provides a convenient command-line interface for text extraction from documents.

Installation¶

Install Kreuzberg with CLI support:

pip install kreuzberg[cli]

Or install all optional dependencies:

pip install kreuzberg[all]

Basic Usage¶

Extract from a file¶

1	`kreuzberg extract document.pdf`

Extract to a file¶

kreuzberg extract document.pdf -o output.txt

Extract from stdin¶

cat document.pdf | kreuzberg extract
# or
kreuzberg extract -

Command-Line Options¶

General Options¶

-o, --output PATH: Output file path (default: stdout)
--output-format [text|json]: Output format
--show-metadata: Include metadata in output
-v, --verbose: Verbose output for debugging

Processing Options¶

--force-ocr: Force OCR processing
--chunk-content: Enable content chunking
--extract-tables: Enable table extraction
--max-chars INTEGER: Maximum characters per chunk (default: 2000)
--max-overlap INTEGER: Maximum overlap between chunks (default: 100)

OCR Backend Options¶

--ocr-backend [tesseract|easyocr|paddleocr|none]: OCR backend to use

Tesseract Options¶

--tesseract-lang TEXT: Language(s) (e.g., 'eng+deu')
--tesseract-psm INTEGER: PSM mode (0-13)

EasyOCR Options¶

--easyocr-languages TEXT: Language codes (comma-separated, e.g., 'en,de')

PaddleOCR Options¶

--paddleocr-languages TEXT: Language codes (comma-separated, e.g., 'en,german')

Configuration File¶

Kreuzberg can load configuration from a pyproject.toml file:

[tool.kreuzberg]
force_ocr = false
chunk_content = true
extract_tables = false
max_chars = 5000
ocr_backend = "tesseract"

[tool.kreuzberg.tesseract]
language = "eng+deu"
psm = 3

[tool.kreuzberg.gmft]
verbosity = 1
cell_required_confidence = 50

Use a specific config file:

kreuzberg extract document.pdf --config custom-config.toml

Examples¶

Basic text extraction¶

kreuzberg extract report.pdf -o report.txt

OCR with specific language¶

kreuzberg extract scan.jpg --force-ocr --tesseract-lang deu

Extract tables to JSON¶

kreuzberg extract spreadsheet.pdf --extract-tables --output-format json -o tables.json

Extract with metadata¶

kreuzberg extract document.pdf --show-metadata --output-format json

Using EasyOCR backend¶

kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de

Extract with chunking¶

kreuzberg extract large-document.pdf --chunk-content --max-chars 1000

Module Execution¶

You can also run Kreuzberg as a Python module:

python -m kreuzberg extract document.pdf

Command Reference¶

`kreuzberg extract`¶

Extract text from a document.

Usage:

kreuzberg extract [OPTIONS] [FILE]

Arguments:

FILE: Path to document or '-' for stdin (optional, defaults to stdin)

`kreuzberg config`¶

Show current configuration.