Configuration Reference¶
This page provides a comprehensive reference for all Kreuzberg configuration types. For usage guides and examples, see the Configuration Guide.
Overview¶
Kreuzberg supports multiple configuration methods:
- TOML files - Preferred format, clear syntax
- YAML files - Alternative format
- JSON files - For programmatic generation
- Programmatic - Direct object instantiation
Configuration Discovery¶
Kreuzberg automatically discovers configuration files in this order:
- Current directory:
./kreuzberg.{toml,yaml,yml,json} - User config:
~/.config/kreuzberg/config.{toml,yaml,yml,json} - System config:
/etc/kreuzberg/config.{toml,yaml,yml,json}
For complete examples, see the examples directory.
ExtractionConfig¶
Main extraction configuration controlling all aspects of document processing.
| Field | Type | Default | Description |
|---|---|---|---|
use_cache | bool | true | Enable caching of extraction results for faster re-processing |
enable_quality_processing | bool | true | Enable quality post-processing (deduplication, mojibake fixing, etc.) |
force_ocr | bool | false | Force OCR even for searchable PDFs with text layers |
ocr | OcrConfig? | None | OCR configuration (if None, OCR disabled) |
pdf_options | PdfConfig? | None | PDF-specific configuration options |
images | ImageExtractionConfig? | None | Image extraction configuration |
chunking | ChunkingConfig? | None | Text chunking configuration for splitting into chunks |
token_reduction | TokenReductionConfig? | None | Token reduction configuration for optimizing LLM context |
language_detection | LanguageDetectionConfig? | None | Automatic language detection configuration |
postprocessor | PostProcessorConfig? | None | Post-processing pipeline configuration |
pages | PageConfig? | None | Page extraction and tracking configuration |
max_concurrent_extractions | int? | None | Maximum concurrent batch extractions (defaults to num_cpus * 2) |
Example¶
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
useCache := true
enableQP := true
result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
UseCache: &useCache,
EnableQualityProcessing: &enableQP,
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
ExtractionConfig config = ExtractionConfig.builder()
.useCache(true)
.enableQualityProcessing(true)
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
use_cache: true,
enable_quality_processing: true,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
OcrConfig¶
Configuration for OCR (Optical Character Recognition) processing on images and scanned PDFs.
| Field | Type | Default | Description |
|---|---|---|---|
backend | str | "tesseract" | OCR backend to use: "tesseract", "easyocr", "paddleocr" |
language | str | "eng" | Language code(s) for OCR, e.g., "eng", "eng+fra", "eng+deu+fra" |
tesseract_config | TesseractConfig? | None | Tesseract-specific configuration options |
Example¶
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng+fra")
.tesseractConfig(TesseractConfig.builder()
.psm(3)
.build())
.build())
.build();
import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract", language="eng+fra",
tesseract_config=TesseractConfig(psm=3)
)
)
result = await extract_file("document.pdf", config=config)
print(result.content)
asyncio.run(main())
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: Some("eng+deu+fra".to_string()),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("multilingual.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
TesseractConfig¶
Tesseract OCR engine configuration with fine-grained control over recognition parameters.
| Field | Type | Default | Description |
|---|---|---|---|
language | str | "eng" | Language code(s), e.g., "eng", "eng+fra" |
psm | int | 3 | Page Segmentation Mode (0-13, see below) |
output_format | str | "markdown" | Output format: "text", "markdown", "hocr" |
oem | int | 3 | OCR Engine Mode (0-3, see below) |
min_confidence | float | 0.0 | Minimum confidence threshold (0.0-100.0) |
preprocessing | ImagePreprocessingConfig? | None | Image preprocessing configuration |
enable_table_detection | bool | true | Enable automatic table detection and reconstruction |
table_min_confidence | float | 0.0 | Minimum confidence for table cell recognition (0.0-1.0) |
table_column_threshold | int | 50 | Pixel threshold for detecting table columns |
table_row_threshold_ratio | float | 0.5 | Row threshold ratio for table detection (0.0-1.0) |
use_cache | bool | true | Enable OCR result caching for faster re-processing |
classify_use_pre_adapted_templates | bool | true | Use pre-adapted templates for character classification |
language_model_ngram_on | bool | false | Enable N-gram language model for better word recognition |
tessedit_dont_blkrej_good_wds | bool | true | Don't reject good words during block-level processing |
tessedit_dont_rowrej_good_wds | bool | true | Don't reject good words during row-level processing |
tessedit_enable_dict_correction | bool | true | Enable dictionary-based word correction |
tessedit_char_whitelist | str | "" | Allowed characters (empty = all allowed) |
tessedit_char_blacklist | str | "" | Forbidden characters (empty = none forbidden) |
tessedit_use_primary_params_model | bool | true | Use primary language params model |
textord_space_size_is_variable | bool | true | Enable variable-width space detection |
thresholding_method | bool | false | Use adaptive thresholding method |
Page Segmentation Modes (PSM)¶
0: Orientation and script detection only (no OCR)1: Automatic page segmentation with OSD (Orientation and Script Detection)2: Automatic page segmentation (no OSD, no OCR)3: Fully automatic page segmentation (default, best for most documents)4: Single column of text of variable sizes5: Single uniform block of vertically aligned text6: Single uniform block of text (best for clean documents)7: Single text line8: Single word9: Single word in a circle10: Single character11: Sparse text with no particular order (best for forms, invoices)12: Sparse text with OSD13: Raw line (bypass Tesseract's layout analysis)
OCR Engine Modes (OEM)¶
0: Legacy Tesseract engine only (pre-2016)1: Neural nets LSTM engine only (recommended for best quality)2: Legacy + LSTM engines combined3: Default based on what's available (recommended for compatibility)
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
Ocr = new OcrConfig
{
Language = "eng+fra+deu",
TesseractConfig = new TesseractConfig
{
Psm = 6,
Oem = 1,
MinConfidence = 0.8m,
EnableTableDetection = true
}
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
psm := 6
oem := 1
minConf := 0.8
lang := "eng+fra+deu"
whitelist := "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?"
config := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: &lang,
Tesseract: &kreuzberg.TesseractConfig{
PSM: &psm,
OEM: &oem,
MinConfidence: &minConf,
EnableTableDetection: kreuzberg.BoolPtr(true),
TesseditCharWhitelist: whitelist,
},
},
}
result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.language("eng+fra+deu")
.tesseractConfig(TesseractConfig.builder()
.psm(6)
.oem(1)
.minConfidence(0.8)
.tesseditCharWhitelist("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?")
.enableTableDetection(true)
.build())
.build())
.build();
import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(
language="eng+fra+deu",
tesseract_config=TesseractConfig(
psm=6,
oem=1,
min_confidence=0.8,
enable_table_detection=True,
),
)
)
result = await extract_file("document.pdf", config=config)
print(f"Content: {result.content[:100]}")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
ocr: Kreuzberg::Config::OCR.new(
language: 'eng+fra+deu',
tesseract_config: Kreuzberg::Config::Tesseract.new(
psm: 6,
oem: 1,
min_confidence: 0.8,
tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
enable_table_detection: true
)
)
)
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};
fn main() {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
language: Some("eng+fra+deu".to_string()),
tesseract_config: Some(TesseractConfig {
psm: Some(6),
oem: Some(1),
min_confidence: Some(0.8),
tessedit_char_whitelist: Some("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?".to_string()),
enable_table_detection: Some(true),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
println!("{:?}", config.ocr);
}
import { extractFile } from '@kreuzberg/node';
const config = {
ocr: {
backend: 'tesseract',
language: 'eng+fra+deu',
tesseractConfig: {
psm: 6,
tesseditCharWhitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
enableTableDetection: true,
},
},
};
const result = await extractFile('document.pdf', null, config);
console.log(result.content);
ChunkingConfig¶
Configuration for splitting extracted text into overlapping chunks, useful for vector databases and LLM processing.
| Field | Type | Default | Description |
|---|---|---|---|
max_chars | int | 1000 | Maximum characters per chunk |
max_overlap | int | 200 | Overlap between consecutive chunks in characters |
embedding | EmbeddingConfig? | None | Optional embedding generation for each chunk |
preset | str? | None | Chunking preset: "small" (500/100), "medium" (1000/200), "large" (2000/400) |
Example¶
using Kreuzberg;
class Program { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 1000, MaxOverlap = 200, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-minilm-l6-v2"), Normalize = true, BatchSize = 32 } } };
try
{
var result = await KreuzbergClient.ExtractFileAsync(
"document.pdf",
config
).ConfigureAwait(false);
Console.WriteLine($"Chunks: {result.Chunks.Count}");
foreach (var chunk in result.Chunks)
{
Console.WriteLine($"Content length: {chunk.Content.Length}");
if (chunk.Embedding != null)
{
Console.WriteLine($"Embedding dimensions: {chunk.Embedding.Length}");
}
}
}
catch (KreuzbergException ex)
{
Console.WriteLine($"Error: {ex.Message}");
}
}
}
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
maxChars := 1000
maxOverlap := 200
config := &kreuzberg.ExtractionConfig{
Chunking: &kreuzberg.ChunkingConfig{
MaxChars: &maxChars,
MaxOverlap: &maxOverlap,
},
}
fmt.Printf("Config: MaxChars=%d, MaxOverlap=%d\n", *config.Chunking.MaxChars, *config.Chunking.MaxOverlap)
}
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
separator="sentence"
)
)
result = await extract_file("document.pdf", config=config)
print(f"Chunks: {len(result.chunks or [])}")
for chunk in result.chunks or []:
print(f"Length: {len(chunk.content)}")
asyncio.run(main())
LanguageDetectionConfig¶
Configuration for automatic language detection in extracted text.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable language detection |
min_confidence | float | 0.8 | Minimum confidence threshold (0.0-1.0) for reporting detected languages |
detect_multiple | bool | false | Detect multiple languages (vs. dominant language only) |
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
LanguageDetection = new LanguageDetectionConfig
{
Enabled = true,
MinConfidence = 0.9m,
DetectMultiple = true
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages ?? new List<string>())}");
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
minConfidence := 0.8
config := &kreuzberg.ExtractionConfig{
LanguageDetection: &kreuzberg.LanguageDetectionConfig{
Enabled: true,
MinConfidence: &minConfidence,
DetectMultiple: false,
},
}
fmt.Printf("Language detection enabled: %v\n", config.LanguageDetection.Enabled)
fmt.Printf("Min confidence: %f\n", *config.LanguageDetection.MinConfidence)
}
import asyncio
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.85,
detect_multiple=False
)
)
result = await extract_file("document.pdf", config=config)
if result.detected_languages:
print(f"Primary language: {result.detected_languages[0]}")
print(f"Content length: {len(result.content)} chars")
asyncio.run(main())
import { extractFile } from '@kreuzberg/node';
const config = {
languageDetection: {
enabled: true,
minConfidence: 0.8,
detectMultiple: false,
},
};
const result = await extractFile('document.pdf', null, config);
if (result.detectedLanguages) {
console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}
PdfConfig¶
PDF-specific extraction configuration.
| Field | Type | Default | Description |
|---|---|---|---|
extract_images | bool | false | Extract embedded images from PDF pages |
extract_metadata | bool | true | Extract PDF metadata (title, author, creation date, etc.) |
passwords | list[str]? | None | List of passwords to try for encrypted PDFs (tries in order) |
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
PdfOptions = new PdfConfig
{
ExtractImages = true,
ExtractMetadata = true,
Passwords = new List<string> { "password1", "password2" }
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
pw := []string{"password1", "password2"}
result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
PdfOptions: &kreuzberg.PdfConfig{
ExtractImages: kreuzberg.BoolPtr(true),
ExtractMetadata: kreuzberg.BoolPtr(true),
Passwords: pw,
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PdfConfig;
import java.util.Arrays;
ExtractionConfig config = ExtractionConfig.builder()
.pdfOptions(PdfConfig.builder()
.extractImages(true)
.extractMetadata(true)
.passwords(Arrays.asList("password1", "password2"))
.build())
.build();
import asyncio
from kreuzberg import ExtractionConfig, PdfConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
pdf_options=PdfConfig(
extract_images=True,
extract_metadata=True,
passwords=["password1", "password2"],
)
)
result = await extract_file("document.pdf", config=config)
print(f"Content: {result.content[:100]}")
asyncio.run(main())
use kreuzberg::{ExtractionConfig, PdfConfig};
fn main() {
let config = ExtractionConfig {
pdf_options: Some(PdfConfig {
extract_images: Some(true),
extract_metadata: Some(true),
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
}),
..Default::default()
};
println!("{:?}", config.pdf_options);
}
PageConfig¶
Configuration for page extraction and tracking.
Controls whether to extract per-page content and how to mark page boundaries in the combined text output.
Configuration¶
| Field | Type | Default | Description |
|---|---|---|---|
extract_pages | bool | false | Extract pages as separate array in results |
insert_page_markers | bool | false | Insert page markers in combined content string |
marker_format | String | "\n\n<!-- PAGE {page_num} -->\n\n" | Template for page markers (use {page_num} placeholder) |
Example¶
Field Details¶
extract_pages: When true, populates ExtractionResult.pages with per-page content. Each page contains its text, tables, and images separately.
insert_page_markers: When true, inserts page markers into the combined content string at page boundaries. Useful for LLMs to understand document structure.
marker_format: Template string for page markers. Use {page_num} placeholder for the page number. Default HTML comment format is LLM-friendly.
Format Support¶
- PDF: Full byte-accurate page tracking with O(1) lookup performance
- PPTX: Slide boundary tracking with per-slide content
- DOCX: Best-effort page break detection using explicit page breaks
- Other formats: Page tracking not available (returns
None/null)
ImageExtractionConfig¶
Configuration for extracting and processing images from documents.
| Field | Type | Default | Description |
|---|---|---|---|
extract_images | bool | true | Extract images from documents |
target_dpi | int | 300 | Target DPI for extracted/normalized images |
max_image_dimension | int | 4096 | Maximum image dimension (width or height) in pixels |
auto_adjust_dpi | bool | true | Automatically adjust DPI based on image size and content |
min_dpi | int | 72 | Minimum DPI when auto-adjusting |
max_dpi | int | 600 | Maximum DPI when auto-adjusting |
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
Images = new ImageExtractionConfig
{
ExtractImages = true,
TargetDpi = 200,
MaxImageDimension = 2048,
AutoAdjustDpi = true
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Extracted: {result.Content[..Math.Min(100, result.Content.Length)]}");
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
targetDPI := 200
maxDim := 2048
result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
ImageExtraction: &kreuzberg.ImageExtractionConfig{
ExtractImages: kreuzberg.BoolPtr(true),
TargetDPI: &targetDPI,
MaxImageDimension: &maxDim,
AutoAdjustDPI: kreuzberg.BoolPtr(true),
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImageExtractionConfig;
ExtractionConfig config = ExtractionConfig.builder()
.imageExtraction(ImageExtractionConfig.builder()
.extractImages(true)
.targetDpi(200)
.maxImageDimension(2048)
.autoAdjustDpi(true)
.build())
.build();
import asyncio
from kreuzberg import ExtractionConfig, ImageExtractionConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
images=ImageExtractionConfig(
extract_images=True,
target_dpi=200,
max_image_dimension=2048,
auto_adjust_dpi=True,
)
)
result = await extract_file("document.pdf", config=config)
print(f"Extracted: {result.content[:100]}")
asyncio.run(main())
use kreuzberg::{ExtractionConfig, ImageExtractionConfig};
fn main() {
let config = ExtractionConfig {
images: Some(ImageExtractionConfig {
extract_images: Some(true),
target_dpi: Some(200),
max_image_dimension: Some(2048),
auto_adjust_dpi: Some(true),
..Default::default()
}),
..Default::default()
};
println!("{:?}", config.images);
}
ImagePreprocessingConfig¶
Image preprocessing configuration for improving OCR quality on scanned documents.
| Field | Type | Default | Description |
|---|---|---|---|
target_dpi | int | 300 | Target DPI for OCR processing (300 standard, 600 for small text) |
auto_rotate | bool | true | Auto-detect and correct image rotation |
deskew | bool | true | Correct skew (tilted images) |
denoise | bool | false | Apply noise reduction filter |
contrast_enhance | bool | false | Enhance image contrast for better text visibility |
binarization_method | str | "otsu" | Binarization method: "otsu", "sauvola", "adaptive", "none" |
invert_colors | bool | false | Invert colors (useful for white text on black background) |
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
Ocr = new OcrConfig
{
TesseractConfig = new TesseractConfig
{
Preprocessing = new ImagePreprocessingConfig
{
TargetDpi = 300,
Denoise = true,
Deskew = true,
ContrastEnhance = true,
BinarizationMethod = "otsu"
}
}
}
};
var result = await KreuzbergClient.ExtractFileAsync("scanned.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
targetDPI := 300
config := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Tesseract: &kreuzberg.TesseractConfig{
Preprocessing: &kreuzberg.ImagePreprocessingConfig{
TargetDPI: &targetDPI,
Denoise: kreuzberg.BoolPtr(true),
Deskew: kreuzberg.BoolPtr(true),
ContrastEnhance: kreuzberg.BoolPtr(true),
BinarizationMode: kreuzberg.StringPtr("otsu"),
},
},
},
}
result, err := kreuzberg.ExtractFileSync("document.pdf", config)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.tesseractConfig(TesseractConfig.builder()
.preprocessing(ImagePreprocessingConfig.builder()
.targetDpi(300)
.denoise(true)
.deskew(true)
.contrastEnhance(true)
.binarizationMethod("otsu")
.build())
.build())
.build())
.build();
import asyncio
from kreuzberg import (
ExtractionConfig,
OcrConfig,
TesseractConfig,
ImagePreprocessingConfig,
extract_file,
)
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(
tesseract_config=TesseractConfig(
preprocessing=ImagePreprocessingConfig(
target_dpi=300,
denoise=True,
deskew=True,
contrast_enhance=True,
binarization_method="otsu",
)
)
)
)
result = await extract_file("scanned.pdf", config=config)
print(f"Content: {result.content[:100]}")
asyncio.run(main())
require 'kreuzberg'
config = Kreuzberg::Config::Extraction.new(
ocr: Kreuzberg::Config::OCR.new(
tesseract_config: Kreuzberg::Config::Tesseract.new(
preprocessing: Kreuzberg::Config::ImagePreprocessing.new(
target_dpi: 300,
denoise: true,
deskew: true,
contrast_enhance: true,
binarization_method: 'otsu'
)
)
)
)
use kreuzberg::{ExtractionConfig, ImagePreprocessingConfig, OcrConfig, TesseractConfig};
fn main() {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
tesseract_config: Some(TesseractConfig {
preprocessing: Some(ImagePreprocessingConfig {
target_dpi: Some(300),
denoise: Some(true),
deskew: Some(true),
contrast_enhance: Some(true),
binarization_method: Some("otsu".to_string()),
..Default::default()
}),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
println!("{:?}", config.ocr);
}
PostProcessorConfig¶
Configuration for the post-processing pipeline that runs after extraction.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable post-processing pipeline |
enabled_processors | list[str]? | None | Specific processors to enable (if None, all enabled by default) |
disabled_processors | list[str]? | None | Specific processors to disable (takes precedence over enabled_processors) |
Built-in post-processors include:
deduplication- Remove duplicate text blockswhitespace_normalization- Normalize whitespace and line breaksmojibake_fix- Fix mojibake (encoding corruption)quality_scoring- Score and filter low-quality text
Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
Postprocessor = new PostProcessorConfig
{
Enabled = true,
EnabledProcessors = new List<string> { "deduplication" }
}
};
var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
package main
import "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
func main() {
enabled := true
cfg := &kreuzberg.ExtractionConfig{
Postprocessor: &kreuzberg.PostProcessorConfig{
Enabled: &enabled,
EnabledProcessors: []string{"deduplication", "whitespace_normalization"},
DisabledProcessors: []string{"mojibake_fix"},
},
}
_ = cfg
}
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PostProcessorConfig;
import java.util.Arrays;
ExtractionConfig config = ExtractionConfig.builder()
.postprocessor(PostProcessorConfig.builder()
.enabled(true)
.enabledProcessors(Arrays.asList("deduplication", "whitespace_normalization"))
.disabledProcessors(Arrays.asList("mojibake_fix"))
.build())
.build();
import asyncio
from kreuzberg import ExtractionConfig, PostProcessorConfig, extract_file
async def main() -> None:
config: ExtractionConfig = ExtractionConfig(
postprocessor=PostProcessorConfig(
enabled=True,
enabled_processors=["deduplication"],
)
)
result = await extract_file("document.pdf", config=config)
print(f"Content: {result.content[:100]}")
asyncio.run(main())
use kreuzberg::{ExtractionConfig, PostProcessorConfig};
fn main() {
let config = ExtractionConfig {
postprocessor: Some(PostProcessorConfig {
enabled: Some(true),
enabled_processors: Some(vec![
"deduplication".to_string(),
"whitespace_normalization".to_string(),
]),
disabled_processors: Some(vec!["mojibake_fix".to_string()]),
}),
..Default::default()
};
println!("{:?}", config.postprocessor);
}
import { extractFile } from '@kreuzberg/node';
const config = {
postprocessor: {
enabled: true,
enabledProcessors: ['deduplication', 'whitespace_normalization'],
disabledProcessors: ['mojibake_fix'],
},
};
const result = await extractFile('document.pdf', null, config);
console.log(result.content);
TokenReductionConfig¶
Configuration for reducing token count in extracted text, useful for optimizing LLM context windows.
| Field | Type | Default | Description |
|---|---|---|---|
mode | str | "off" | Reduction mode: "off", "light", "moderate", "aggressive", "maximum" |
preserve_important_words | bool | true | Preserve important words (capitalized, technical terms) during reduction |
Reduction Modes¶
off: No token reductionlight: Remove redundant whitespace and line breaks (~5-10% reduction)moderate: Light + remove stopwords in low-information contexts (~15-25% reduction)aggressive: Moderate + abbreviate common phrases (~30-40% reduction)maximum: Aggressive + remove all stopwords (~50-60% reduction, may impact quality)
Example¶
package main
import (
"fmt"
"github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)
func main() {
config := &kreuzberg.ExtractionConfig{
TokenReduction: &kreuzberg.TokenReductionConfig{
Mode: "moderate",
PreserveImportantWords: kreuzberg.BoolPtr(true),
},
}
fmt.Printf("Mode: %s, Preserve Important Words: %v\n",
config.TokenReduction.Mode,
*config.TokenReduction.PreserveImportantWords)
}
use kreuzberg::{ExtractionConfig, TokenReductionConfig};
let config = ExtractionConfig {
token_reduction: Some(TokenReductionConfig {
mode: "moderate".to_string(),
preserve_markdown: true,
preserve_code: true,
language_hint: Some("eng".to_string()),
..Default::default()
}),
..Default::default()
};
Configuration File Examples¶
TOML Format¶
use_cache = true
enable_quality_processing = true
force_ocr = false
[ocr]
backend = "tesseract"
language = "eng+fra"
[ocr.tesseract_config]
psm = 6
oem = 1
min_confidence = 0.8
enable_table_detection = true
[ocr.tesseract_config.preprocessing]
target_dpi = 300
denoise = true
deskew = true
contrast_enhance = true
binarization_method = "otsu"
[pdf_options]
extract_images = true
extract_metadata = true
passwords = ["password1", "password2"]
[images]
extract_images = true
target_dpi = 200
max_image_dimension = 4096
[chunking]
max_chars = 1000
max_overlap = 200
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
[token_reduction]
mode = "moderate"
preserve_important_words = true
[postprocessor]
enabled = true
YAML Format¶
# kreuzberg.yaml
use_cache: true
enable_quality_processing: true
force_ocr: false
ocr:
backend: tesseract
language: eng+fra
tesseract_config:
psm: 6
oem: 1
min_confidence: 0.8
enable_table_detection: true
preprocessing:
target_dpi: 300
denoise: true
deskew: true
contrast_enhance: true
binarization_method: otsu
pdf_options:
extract_images: true
extract_metadata: true
passwords:
- password1
- password2
images:
extract_images: true
target_dpi: 200
max_image_dimension: 4096
chunking:
max_chars: 1000
max_overlap: 200
language_detection:
enabled: true
min_confidence: 0.8
detect_multiple: false
token_reduction:
mode: moderate
preserve_important_words: true
postprocessor:
enabled: true
JSON Format¶
{
"use_cache": true,
"enable_quality_processing": true,
"force_ocr": false,
"ocr": {
"backend": "tesseract",
"language": "eng+fra",
"tesseract_config": {
"psm": 6,
"oem": 1,
"min_confidence": 0.8,
"enable_table_detection": true,
"preprocessing": {
"target_dpi": 300,
"denoise": true,
"deskew": true,
"contrast_enhance": true,
"binarization_method": "otsu"
}
}
},
"pdf_options": {
"extract_images": true,
"extract_metadata": true,
"passwords": ["password1", "password2"]
},
"images": {
"extract_images": true,
"target_dpi": 200,
"max_image_dimension": 4096
},
"chunking": {
"max_chars": 1000,
"max_overlap": 200
},
"language_detection": {
"enabled": true,
"min_confidence": 0.8,
"detect_multiple": false
},
"token_reduction": {
"mode": "moderate",
"preserve_important_words": true
},
"postprocessor": {
"enabled": true
}
}
For complete working examples, see the examples directory.
Best Practices¶
When to Use Config Files vs Programmatic Config¶
Use config files when:
- Settings are shared across multiple scripts/applications
- Configuration needs to be version controlled
- Non-developers need to modify settings
- Deploying to multiple environments (dev/staging/prod)
Use programmatic config when:
- Settings vary per execution or are computed dynamically
- Configuration depends on runtime conditions
- Building SDKs or libraries that wrap Kreuzberg
- Rapid prototyping and experimentation
Performance Considerations¶
Caching:
- Keep
use_cache=truefor repeated processing of the same files - Cache is automatically invalidated when files change
- Cache location:
~/.cache/kreuzberg/(configurable via environment)
OCR Settings:
- Lower
target_dpi(e.g., 150-200) for faster processing of low-quality scans - Higher
target_dpi(e.g., 400-600) for small text or high-quality documents - Disable
enable_table_detectionif tables aren't needed (10-20% speedup) - Use
psm=6for clean single-column documents (faster thanpsm=3)
Batch Processing:
- Set
max_concurrent_extractionsto balance speed and memory usage - Default (num_cpus * 2) works well for most systems
- Reduce for memory-constrained environments
- Increase for I/O-bound workloads on systems with fast storage
Token Reduction:
- Use
"light"or"moderate"modes for minimal quality impact "aggressive"and"maximum"modes may affect semantic meaning- Benchmark with your specific LLM to measure quality vs. cost tradeoff
Security Considerations¶
API Keys and Secrets:
- Never commit config files containing API keys or passwords to version control
- Use environment variables for sensitive data:
- Add
kreuzberg.tomlto.gitignoreif it contains secrets - Use separate config files for development vs. production
PDF Passwords:
passwordsfield attempts passwords in order until one succeeds- Passwords are not logged or cached
- Use environment variables for sensitive passwords:
File System Access:
- Kreuzberg only reads files you explicitly pass to extraction functions
- Cache directory permissions should be restricted to the running user
- Temporary files are automatically cleaned up after extraction
Data Privacy:
- Extraction results are never sent to external services (except explicit OCR backends)
- Tesseract OCR runs locally with no network access
- EasyOCR and PaddleOCR may download models on first run (cached locally)
- Consider disabling cache for sensitive documents requiring ephemeral processing
Related Documentation¶
- Configuration Guide - Usage guide with examples
- OCR Guide - OCR-specific configuration and troubleshooting
- Examples Directory - Complete working examples