Skip to content

Configuration Reference

This page provides a comprehensive reference for all Kreuzberg configuration types. For usage guides and examples, see the Configuration Guide.

Overview

Kreuzberg supports multiple configuration methods:

  1. TOML files - Preferred format, clear syntax
  2. YAML files - Alternative format
  3. JSON files - For programmatic generation
  4. Programmatic - Direct object instantiation

Configuration Discovery

Kreuzberg automatically discovers configuration files in this order:

  1. Current directory: ./kreuzberg.{toml,yaml,yml,json}
  2. User config: ~/.config/kreuzberg/config.{toml,yaml,yml,json}
  3. System config: /etc/kreuzberg/config.{toml,yaml,yml,json}

For complete examples, see the examples directory.


ExtractionConfig

Main extraction configuration controlling all aspects of document processing.

Field Type Default Description
use_cache bool true Enable caching of extraction results for faster re-processing
enable_quality_processing bool true Enable quality post-processing (deduplication, mojibake fixing, etc.)
force_ocr bool false Force OCR even for searchable PDFs with text layers
ocr OcrConfig? None OCR configuration (if None, OCR disabled)
pdf_options PdfConfig? None PDF-specific configuration options
images ImageExtractionConfig? None Image extraction configuration
chunking ChunkingConfig? None Text chunking configuration for splitting into chunks
token_reduction TokenReductionConfig? None Token reduction configuration for optimizing LLM context
language_detection LanguageDetectionConfig? None Automatic language detection configuration
postprocessor PostProcessorConfig? None Post-processing pipeline configuration
pages PageConfig? None Page extraction and tracking configuration
max_concurrent_extractions int? None Maximum concurrent batch extractions (defaults to num_cpus * 2)

Example

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true,
    ForceOcr = false,
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    useCache := true
    enableQP := true

    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        UseCache:                &useCache,
        EnableQualityProcessing: &enableQP,
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .useCache(true)
    .enableQualityProcessing(true)
    .build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
Python
import asyncio
from kreuzberg import extract_file, ExtractionConfig

async def main() -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  use_cache: true,
  enable_quality_processing: true
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)
Rust
use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        use_cache: true,
        enable_quality_processing: true,
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    useCache: true,
    enableQualityProcessing: true,
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

OcrConfig

Configuration for OCR (Optical Character Recognition) processing on images and scanned PDFs.

Field Type Default Description
backend str "tesseract" OCR backend to use: "tesseract", "easyocr", "paddleocr"
language str "eng" Language code(s) for OCR, e.g., "eng", "eng+fra", "eng+deu+fra"
tesseract_config TesseractConfig? None Tesseract-specific configuration options

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng+fra",
        TesseractConfig = new TesseractConfig { Psm = 3 }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine(result.Content);
Go
package main

import "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"

func main() {
    language := "eng+fra"
    psm := 3

    _ = &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &language,
            Tesseract: &kreuzberg.TesseractConfig{
                PSM: &psm,
            },
        },
    }
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .language("eng+fra")
        .tesseractConfig(TesseractConfig.builder()
            .psm(3)
            .build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            backend="tesseract", language="eng+fra",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    backend: 'tesseract',
    language: 'eng+fra',
    tesseract_config: Kreuzberg::Config::Tesseract.new(psm: 3)
  )
)
Rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: Some("eng+deu+fra".to_string()),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("multilingual.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng+fra',
        tesseractConfig: {
            psm: 3,
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

TesseractConfig

Tesseract OCR engine configuration with fine-grained control over recognition parameters.

Field Type Default Description
language str "eng" Language code(s), e.g., "eng", "eng+fra"
psm int 3 Page Segmentation Mode (0-13, see below)
output_format str "markdown" Output format: "text", "markdown", "hocr"
oem int 3 OCR Engine Mode (0-3, see below)
min_confidence float 0.0 Minimum confidence threshold (0.0-100.0)
preprocessing ImagePreprocessingConfig? None Image preprocessing configuration
enable_table_detection bool true Enable automatic table detection and reconstruction
table_min_confidence float 0.0 Minimum confidence for table cell recognition (0.0-1.0)
table_column_threshold int 50 Pixel threshold for detecting table columns
table_row_threshold_ratio float 0.5 Row threshold ratio for table detection (0.0-1.0)
use_cache bool true Enable OCR result caching for faster re-processing
classify_use_pre_adapted_templates bool true Use pre-adapted templates for character classification
language_model_ngram_on bool false Enable N-gram language model for better word recognition
tessedit_dont_blkrej_good_wds bool true Don't reject good words during block-level processing
tessedit_dont_rowrej_good_wds bool true Don't reject good words during row-level processing
tessedit_enable_dict_correction bool true Enable dictionary-based word correction
tessedit_char_whitelist str "" Allowed characters (empty = all allowed)
tessedit_char_blacklist str "" Forbidden characters (empty = none forbidden)
tessedit_use_primary_params_model bool true Use primary language params model
textord_space_size_is_variable bool true Enable variable-width space detection
thresholding_method bool false Use adaptive thresholding method

Page Segmentation Modes (PSM)

  • 0: Orientation and script detection only (no OCR)
  • 1: Automatic page segmentation with OSD (Orientation and Script Detection)
  • 2: Automatic page segmentation (no OSD, no OCR)
  • 3: Fully automatic page segmentation (default, best for most documents)
  • 4: Single column of text of variable sizes
  • 5: Single uniform block of vertically aligned text
  • 6: Single uniform block of text (best for clean documents)
  • 7: Single text line
  • 8: Single word
  • 9: Single word in a circle
  • 10: Single character
  • 11: Sparse text with no particular order (best for forms, invoices)
  • 12: Sparse text with OSD
  • 13: Raw line (bypass Tesseract's layout analysis)

OCR Engine Modes (OEM)

  • 0: Legacy Tesseract engine only (pre-2016)
  • 1: Neural nets LSTM engine only (recommended for best quality)
  • 2: Legacy + LSTM engines combined
  • 3: Default based on what's available (recommended for compatibility)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Language = "eng+fra+deu",
        TesseractConfig = new TesseractConfig
        {
            Psm = 6,
            Oem = 1,
            MinConfidence = 0.8m,
            EnableTableDetection = true
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    psm := 6
    oem := 1
    minConf := 0.8
    lang := "eng+fra+deu"
    whitelist := "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?"

    config := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Backend:  "tesseract",
            Language: &lang,
            Tesseract: &kreuzberg.TesseractConfig{
                PSM:              &psm,
                OEM:              &oem,
                MinConfidence:    &minConf,
                EnableTableDetection: kreuzberg.BoolPtr(true),
                TesseditCharWhitelist: whitelist,
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .language("eng+fra+deu")
        .tesseractConfig(TesseractConfig.builder()
            .psm(6)
            .oem(1)
            .minConfidence(0.8)
            .tesseditCharWhitelist("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?")
            .enableTableDetection(true)
            .build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            language="eng+fra+deu",
            tesseract_config=TesseractConfig(
                psm=6,
                oem=1,
                min_confidence=0.8,
                enable_table_detection=True,
            ),
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    language: 'eng+fra+deu',
    tesseract_config: Kreuzberg::Config::Tesseract.new(
      psm: 6,
      oem: 1,
      min_confidence: 0.8,
      tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
      enable_table_detection: true
    )
  )
)
Rust
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};

fn main() {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            language: Some("eng+fra+deu".to_string()),
            tesseract_config: Some(TesseractConfig {
                psm: Some(6),
                oem: Some(1),
                min_confidence: Some(0.8),
                tessedit_char_whitelist: Some("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?".to_string()),
                enable_table_detection: Some(true),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };
    println!("{:?}", config.ocr);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        language: 'eng+fra+deu',
        tesseractConfig: {
            psm: 6,
            tesseditCharWhitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
            enableTableDetection: true,
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

ChunkingConfig

Configuration for splitting extracted text into overlapping chunks, useful for vector databases and LLM processing.

Field Type Default Description
max_chars int 1000 Maximum characters per chunk
max_overlap int 200 Overlap between consecutive chunks in characters
embedding EmbeddingConfig? None Optional embedding generation for each chunk
preset str? None Chunking preset: "small" (500/100), "medium" (1000/200), "large" (2000/400)

Example

using Kreuzberg;

class Program { static async Task Main() { var config = new ExtractionConfig { Chunking = new ChunkingConfig { MaxChars = 1000, MaxOverlap = 200, Embedding = new EmbeddingConfig { Model = EmbeddingModelType.Preset("all-minilm-l6-v2"), Normalize = true, BatchSize = 32 } } };

    try
    {
        var result = await KreuzbergClient.ExtractFileAsync(
            "document.pdf",
            config
        ).ConfigureAwait(false);

        Console.WriteLine($"Chunks: {result.Chunks.Count}");
        foreach (var chunk in result.Chunks)
        {
            Console.WriteLine($"Content length: {chunk.Content.Length}");
            if (chunk.Embedding != null)
            {
                Console.WriteLine($"Embedding dimensions: {chunk.Embedding.Length}");
            }
        }
    }
    catch (KreuzbergException ex)
    {
        Console.WriteLine($"Error: {ex.Message}");
    }
}

}

Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    maxChars := 1000
    maxOverlap := 200
    config := &kreuzberg.ExtractionConfig{
        Chunking: &kreuzberg.ChunkingConfig{
            MaxChars:   &maxChars,
            MaxOverlap: &maxOverlap,
        },
    }

    fmt.Printf("Config: MaxChars=%d, MaxOverlap=%d\n", *config.Chunking.MaxChars, *config.Chunking.MaxOverlap)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ChunkingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .chunking(ChunkingConfig.builder()
        .maxChars(1000)
        .maxOverlap(200)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, ChunkingConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        chunking=ChunkingConfig(
            max_chars=1000,
            max_overlap=200,
            separator="sentence"
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Chunks: {len(result.chunks or [])}")
    for chunk in result.chunks or []:
        print(f"Length: {len(chunk.content)}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  chunking: Kreuzberg::Config::Chunking.new(
    max_chars: 1000,
    max_overlap: 200
  )
)
Rust
use kreuzberg::{ExtractionConfig, ChunkingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_chars: 1000,
        max_overlap: 200,
        embedding: None,
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    chunking: {
        maxChars: 1000,
        maxOverlap: 200,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Total chunks: ${result.chunks?.length ?? 0}`);

LanguageDetectionConfig

Configuration for automatic language detection in extracted text.

Field Type Default Description
enabled bool true Enable language detection
min_confidence float 0.8 Minimum confidence threshold (0.0-1.0) for reporting detected languages
detect_multiple bool false Detect multiple languages (vs. dominant language only)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    LanguageDetection = new LanguageDetectionConfig
    {
        Enabled = true,
        MinConfidence = 0.9m,
        DetectMultiple = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Languages: {string.Join(", ", result.DetectedLanguages ?? new List<string>())}");
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    minConfidence := 0.8
    config := &kreuzberg.ExtractionConfig{
        LanguageDetection: &kreuzberg.LanguageDetectionConfig{
            Enabled:        true,
            MinConfidence:  &minConfidence,
            DetectMultiple: false,
        },
    }

    fmt.Printf("Language detection enabled: %v\n", config.LanguageDetection.Enabled)
    fmt.Printf("Min confidence: %f\n", *config.LanguageDetection.MinConfidence)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.LanguageDetectionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .languageDetection(LanguageDetectionConfig.builder()
        .enabled(true)
        .minConfidence(0.8)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        language_detection=LanguageDetectionConfig(
            enabled=True,
            min_confidence=0.85,
            detect_multiple=False
        )
    )
    result = await extract_file("document.pdf", config=config)
    if result.detected_languages:
        print(f"Primary language: {result.detected_languages[0]}")
    print(f"Content length: {len(result.content)} chars")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  language_detection: Kreuzberg::Config::LanguageDetection.new(
    enabled: true,
    min_confidence: 0.8,
    detect_multiple: false
  )
)
Rust
use kreuzberg::{ExtractionConfig, LanguageDetectionConfig};

let config = ExtractionConfig {
    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        min_confidence: 0.8,
        detect_multiple: false,
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    languageDetection: {
        enabled: true,
        minConfidence: 0.8,
        detectMultiple: false,
    },
};

const result = await extractFile('document.pdf', null, config);
if (result.detectedLanguages) {
    console.log(`Detected languages: ${result.detectedLanguages.join(', ')}`);
}

PdfConfig

PDF-specific extraction configuration.

Field Type Default Description
extract_images bool false Extract embedded images from PDF pages
extract_metadata bool true Extract PDF metadata (title, author, creation date, etc.)
passwords list[str]? None List of passwords to try for encrypted PDFs (tries in order)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        ExtractImages = true,
        ExtractMetadata = true,
        Passwords = new List<string> { "password1", "password2" }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    pw := []string{"password1", "password2"}
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        PdfOptions: &kreuzberg.PdfConfig{
            ExtractImages:   kreuzberg.BoolPtr(true),
            ExtractMetadata: kreuzberg.BoolPtr(true),
            Passwords:       pw,
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PdfConfig;
import java.util.Arrays;

ExtractionConfig config = ExtractionConfig.builder()
    .pdfOptions(PdfConfig.builder()
        .extractImages(true)
        .extractMetadata(true)
        .passwords(Arrays.asList("password1", "password2"))
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, PdfConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        pdf_options=PdfConfig(
            extract_images=True,
            extract_metadata=True,
            passwords=["password1", "password2"],
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  pdf_options: Kreuzberg::Config::PDF.new(
    extract_images: true,
    extract_metadata: true,
    passwords: ['password1', 'password2']
  )
)
Rust
use kreuzberg::{ExtractionConfig, PdfConfig};

fn main() {
    let config = ExtractionConfig {
        pdf_options: Some(PdfConfig {
            extract_images: Some(true),
            extract_metadata: Some(true),
            passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
        }),
        ..Default::default()
    };
    println!("{:?}", config.pdf_options);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    pdfOptions: {
        extractImages: true,
        extractMetadata: true,
        passwords: ['password1', 'password2'],
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

PageConfig

Configuration for page extraction and tracking.

Controls whether to extract per-page content and how to mark page boundaries in the combined text output.

Configuration

Field Type Default Description
extract_pages bool false Extract pages as separate array in results
insert_page_markers bool false Insert page markers in combined content string
marker_format String "\n\n<!-- PAGE {page_num} -->\n\n" Template for page markers (use {page_num} placeholder)

Example

page_config.cs
var config = new ExtractionConfig
{
    Pages = new PageConfig
    {
        ExtractPages = true,
        InsertPageMarkers = true,
        MarkerFormat = "\n\n--- Page {page_num} ---\n\n"
    }
};
page_config.go
config := &ExtractionConfig{
    Pages: &PageConfig{
        ExtractPages:      true,
        InsertPageMarkers: true,
        MarkerFormat:      "\n\n--- Page {page_num} ---\n\n",
    },
}
PageConfig.java
var config = ExtractionConfig.builder()
    .pages(PageConfig.builder()
        .extractPages(true)
        .insertPageMarkers(true)
        .markerFormat("\n\n--- Page {page_num} ---\n\n")
        .build())
    .build();
page_config.py
config = ExtractionConfig(
    pages=PageConfig(
        extract_pages=True,
        insert_page_markers=True,
        marker_format="\n\n--- Page {page_num} ---\n\n"
    )
)
page_config.rb
config = ExtractionConfig.new(
  pages: PageConfig.new(
    extract_pages: true,
    insert_page_markers: true,
    marker_format: "\n\n--- Page {page_num} ---\n\n"
  )
)
page_config.rs
let config = ExtractionConfig {
    pages: Some(PageConfig {
        extract_pages: true,
        insert_page_markers: true,
        marker_format: "\n\n--- Page {page_num} ---\n\n".to_string(),
    }),
    ..Default::default()
};
page_config.ts
const config: ExtractionConfig = {
  pages: {
    extractPages: true,
    insertPageMarkers: true,
    markerFormat: "\n\n--- Page {page_num} ---\n\n"
  }
};

Field Details

extract_pages: When true, populates ExtractionResult.pages with per-page content. Each page contains its text, tables, and images separately.

insert_page_markers: When true, inserts page markers into the combined content string at page boundaries. Useful for LLMs to understand document structure.

marker_format: Template string for page markers. Use {page_num} placeholder for the page number. Default HTML comment format is LLM-friendly.

Format Support

  • PDF: Full byte-accurate page tracking with O(1) lookup performance
  • PPTX: Slide boundary tracking with per-slide content
  • DOCX: Best-effort page break detection using explicit page breaks
  • Other formats: Page tracking not available (returns None/null)

ImageExtractionConfig

Configuration for extracting and processing images from documents.

Field Type Default Description
extract_images bool true Extract images from documents
target_dpi int 300 Target DPI for extracted/normalized images
max_image_dimension int 4096 Maximum image dimension (width or height) in pixels
auto_adjust_dpi bool true Automatically adjust DPI based on image size and content
min_dpi int 72 Minimum DPI when auto-adjusting
max_dpi int 600 Maximum DPI when auto-adjusting

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Images = new ImageExtractionConfig
    {
        ExtractImages = true,
        TargetDpi = 200,
        MaxImageDimension = 2048,
        AutoAdjustDpi = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Extracted: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    targetDPI := 200
    maxDim := 2048
    result, err := kreuzberg.ExtractFileSync("document.pdf", &kreuzberg.ExtractionConfig{
        ImageExtraction: &kreuzberg.ImageExtractionConfig{
            ExtractImages:     kreuzberg.BoolPtr(true),
            TargetDPI:         &targetDPI,
            MaxImageDimension: &maxDim,
            AutoAdjustDPI:     kreuzberg.BoolPtr(true),
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImageExtractionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .imageExtraction(ImageExtractionConfig.builder()
        .extractImages(true)
        .targetDpi(200)
        .maxImageDimension(2048)
        .autoAdjustDpi(true)
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, ImageExtractionConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        images=ImageExtractionConfig(
            extract_images=True,
            target_dpi=200,
            max_image_dimension=2048,
            auto_adjust_dpi=True,
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Extracted: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  images: Kreuzberg::Config::ImageExtraction.new(
    extract_images: true,
    target_dpi: 200,
    max_image_dimension: 2048,
    auto_adjust_dpi: true
  )
)
Rust
use kreuzberg::{ExtractionConfig, ImageExtractionConfig};

fn main() {
    let config = ExtractionConfig {
        images: Some(ImageExtractionConfig {
            extract_images: Some(true),
            target_dpi: Some(200),
            max_image_dimension: Some(2048),
            auto_adjust_dpi: Some(true),
            ..Default::default()
        }),
        ..Default::default()
    };
    println!("{:?}", config.images);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    images: {
        extractImages: true,
        targetDpi: 200,
        maxImageDimension: 2048,
        autoAdjustDpi: true,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(`Extracted ${result.images?.length ?? 0} images`);

ImagePreprocessingConfig

Image preprocessing configuration for improving OCR quality on scanned documents.

Field Type Default Description
target_dpi int 300 Target DPI for OCR processing (300 standard, 600 for small text)
auto_rotate bool true Auto-detect and correct image rotation
deskew bool true Correct skew (tilted images)
denoise bool false Apply noise reduction filter
contrast_enhance bool false Enhance image contrast for better text visibility
binarization_method str "otsu" Binarization method: "otsu", "sauvola", "adaptive", "none"
invert_colors bool false Invert colors (useful for white text on black background)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        TesseractConfig = new TesseractConfig
        {
            Preprocessing = new ImagePreprocessingConfig
            {
                TargetDpi = 300,
                Denoise = true,
                Deskew = true,
                ContrastEnhance = true,
                BinarizationMethod = "otsu"
            }
        }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("scanned.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    targetDPI := 300
    config := &kreuzberg.ExtractionConfig{
        OCR: &kreuzberg.OCRConfig{
            Tesseract: &kreuzberg.TesseractConfig{
                Preprocessing: &kreuzberg.ImagePreprocessingConfig{
                    TargetDPI:         &targetDPI,
                    Denoise:           kreuzberg.BoolPtr(true),
                    Deskew:            kreuzberg.BoolPtr(true),
                    ContrastEnhance:   kreuzberg.BoolPtr(true),
                    BinarizationMode:  kreuzberg.StringPtr("otsu"),
                },
            },
        },
    }

    result, err := kreuzberg.ExtractFileSync("document.pdf", config)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.ImagePreprocessingConfig;
import dev.kreuzberg.config.OcrConfig;
import dev.kreuzberg.config.TesseractConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .tesseractConfig(TesseractConfig.builder()
            .preprocessing(ImagePreprocessingConfig.builder()
                .targetDpi(300)
                .denoise(true)
                .deskew(true)
                .contrastEnhance(true)
                .binarizationMethod("otsu")
                .build())
            .build())
        .build())
    .build();
Python
import asyncio
from kreuzberg import (
    ExtractionConfig,
    OcrConfig,
    TesseractConfig,
    ImagePreprocessingConfig,
    extract_file,
)

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        ocr=OcrConfig(
            tesseract_config=TesseractConfig(
                preprocessing=ImagePreprocessingConfig(
                    target_dpi=300,
                    denoise=True,
                    deskew=True,
                    contrast_enhance=True,
                    binarization_method="otsu",
                )
            )
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  ocr: Kreuzberg::Config::OCR.new(
    tesseract_config: Kreuzberg::Config::Tesseract.new(
      preprocessing: Kreuzberg::Config::ImagePreprocessing.new(
        target_dpi: 300,
        denoise: true,
        deskew: true,
        contrast_enhance: true,
        binarization_method: 'otsu'
      )
    )
  )
)
Rust
use kreuzberg::{ExtractionConfig, ImagePreprocessingConfig, OcrConfig, TesseractConfig};

fn main() {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            tesseract_config: Some(TesseractConfig {
                preprocessing: Some(ImagePreprocessingConfig {
                    target_dpi: Some(300),
                    denoise: Some(true),
                    deskew: Some(true),
                    contrast_enhance: Some(true),
                    binarization_method: Some("otsu".to_string()),
                    ..Default::default()
                }),
                ..Default::default()
            }),
            ..Default::default()
        }),
        ..Default::default()
    };

    println!("{:?}", config.ocr);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    ocr: {
        backend: 'tesseract',
        tesseractConfig: {
            psm: 6,
            enableTableDetection: true,
        },
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

PostProcessorConfig

Configuration for the post-processing pipeline that runs after extraction.

Field Type Default Description
enabled bool true Enable post-processing pipeline
enabled_processors list[str]? None Specific processors to enable (if None, all enabled by default)
disabled_processors list[str]? None Specific processors to disable (takes precedence over enabled_processors)

Built-in post-processors include:

  • deduplication - Remove duplicate text blocks
  • whitespace_normalization - Normalize whitespace and line breaks
  • mojibake_fix - Fix mojibake (encoding corruption)
  • quality_scoring - Score and filter low-quality text

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    Postprocessor = new PostProcessorConfig
    {
        Enabled = true,
        EnabledProcessors = new List<string> { "deduplication" }
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content: {result.Content[..Math.Min(100, result.Content.Length)]}");
Go
package main

import "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"

func main() {
    enabled := true
    cfg := &kreuzberg.ExtractionConfig{
        Postprocessor: &kreuzberg.PostProcessorConfig{
            Enabled:            &enabled,
            EnabledProcessors:  []string{"deduplication", "whitespace_normalization"},
            DisabledProcessors: []string{"mojibake_fix"},
        },
    }

    _ = cfg
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.PostProcessorConfig;
import java.util.Arrays;

ExtractionConfig config = ExtractionConfig.builder()
    .postprocessor(PostProcessorConfig.builder()
        .enabled(true)
        .enabledProcessors(Arrays.asList("deduplication", "whitespace_normalization"))
        .disabledProcessors(Arrays.asList("mojibake_fix"))
        .build())
    .build();
Python
import asyncio
from kreuzberg import ExtractionConfig, PostProcessorConfig, extract_file

async def main() -> None:
    config: ExtractionConfig = ExtractionConfig(
        postprocessor=PostProcessorConfig(
            enabled=True,
            enabled_processors=["deduplication"],
        )
    )
    result = await extract_file("document.pdf", config=config)
    print(f"Content: {result.content[:100]}")

asyncio.run(main())
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  postprocessor: Kreuzberg::Config::PostProcessor.new(
    enabled: true,
    enabled_processors: ['deduplication', 'whitespace_normalization'],
    disabled_processors: ['mojibake_fix']
  )
)
Rust
use kreuzberg::{ExtractionConfig, PostProcessorConfig};

fn main() {
    let config = ExtractionConfig {
        postprocessor: Some(PostProcessorConfig {
            enabled: Some(true),
            enabled_processors: Some(vec![
                "deduplication".to_string(),
                "whitespace_normalization".to_string(),
            ]),
            disabled_processors: Some(vec!["mojibake_fix".to_string()]),
        }),
        ..Default::default()
    };
    println!("{:?}", config.postprocessor);
}
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    postprocessor: {
        enabled: true,
        enabledProcessors: ['deduplication', 'whitespace_normalization'],
        disabledProcessors: ['mojibake_fix'],
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

TokenReductionConfig

Configuration for reducing token count in extracted text, useful for optimizing LLM context windows.

Field Type Default Description
mode str "off" Reduction mode: "off", "light", "moderate", "aggressive", "maximum"
preserve_important_words bool true Preserve important words (capitalized, technical terms) during reduction

Reduction Modes

  • off: No token reduction
  • light: Remove redundant whitespace and line breaks (~5-10% reduction)
  • moderate: Light + remove stopwords in low-information contexts (~15-25% reduction)
  • aggressive: Moderate + abbreviate common phrases (~30-40% reduction)
  • maximum: Aggressive + remove all stopwords (~50-60% reduction, may impact quality)

Example

C#
using Kreuzberg;

var config = new ExtractionConfig
{
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",
        PreserveImportantWords = true
    }
};

var result = await KreuzbergClient.ExtractFileAsync("document.pdf", config);
Console.WriteLine($"Content length: {result.Content.Length}");
Go
package main

import (
    "fmt"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg"
)

func main() {
    config := &kreuzberg.ExtractionConfig{
        TokenReduction: &kreuzberg.TokenReductionConfig{
            Mode:                   "moderate",
            PreserveImportantWords: kreuzberg.BoolPtr(true),
        },
    }

    fmt.Printf("Mode: %s, Preserve Important Words: %v\n",
        config.TokenReduction.Mode,
        *config.TokenReduction.PreserveImportantWords)
}
Java
import dev.kreuzberg.config.ExtractionConfig;
import dev.kreuzberg.config.TokenReductionConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .tokenReduction(TokenReductionConfig.builder()
        .mode("moderate")
        .preserveImportantWords(true)
        .build())
    .build();
Python
from kreuzberg import ExtractionConfig, TokenReductionConfig

config: ExtractionConfig = ExtractionConfig(
    token_reduction=TokenReductionConfig(
        mode="moderate",
        preserve_markdown=True,
        preserve_code=True,
        language_hint="eng"
    )
)
Ruby
require 'kreuzberg'

config = Kreuzberg::Config::Extraction.new(
  token_reduction: Kreuzberg::Config::TokenReduction.new(
    mode: 'moderate',
    preserve_markdown: true,
    preserve_code: true,
    language_hint: 'eng'
  )
)
Rust
use kreuzberg::{ExtractionConfig, TokenReductionConfig};

let config = ExtractionConfig {
    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_markdown: true,
        preserve_code: true,
        language_hint: Some("eng".to_string()),
        ..Default::default()
    }),
    ..Default::default()
};
TypeScript
import { extractFile } from '@kreuzberg/node';

const config = {
    tokenReduction: {
        mode: 'moderate',
        preserveImportantWords: true,
    },
};

const result = await extractFile('document.pdf', null, config);
console.log(result.content);

Configuration File Examples

TOML Format

kreuzberg.toml
use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng+fra"

[ocr.tesseract_config]
psm = 6
oem = 1
min_confidence = 0.8
enable_table_detection = true

[ocr.tesseract_config.preprocessing]
target_dpi = 300
denoise = true
deskew = true
contrast_enhance = true
binarization_method = "otsu"

[pdf_options]
extract_images = true
extract_metadata = true
passwords = ["password1", "password2"]

[images]
extract_images = true
target_dpi = 200
max_image_dimension = 4096

[chunking]
max_chars = 1000
max_overlap = 200

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

[token_reduction]
mode = "moderate"
preserve_important_words = true

[postprocessor]
enabled = true

YAML Format

kreuzberg.yaml
# kreuzberg.yaml
use_cache: true
enable_quality_processing: true
force_ocr: false

ocr:
  backend: tesseract
  language: eng+fra
  tesseract_config:
    psm: 6
    oem: 1
    min_confidence: 0.8
    enable_table_detection: true
    preprocessing:
      target_dpi: 300
      denoise: true
      deskew: true
      contrast_enhance: true
      binarization_method: otsu

pdf_options:
  extract_images: true
  extract_metadata: true
  passwords:
    - password1
    - password2

images:
  extract_images: true
  target_dpi: 200
  max_image_dimension: 4096

chunking:
  max_chars: 1000
  max_overlap: 200

language_detection:
  enabled: true
  min_confidence: 0.8
  detect_multiple: false

token_reduction:
  mode: moderate
  preserve_important_words: true

postprocessor:
  enabled: true

JSON Format

kreuzberg.json
{
  "use_cache": true,
  "enable_quality_processing": true,
  "force_ocr": false,
  "ocr": {
    "backend": "tesseract",
    "language": "eng+fra",
    "tesseract_config": {
      "psm": 6,
      "oem": 1,
      "min_confidence": 0.8,
      "enable_table_detection": true,
      "preprocessing": {
        "target_dpi": 300,
        "denoise": true,
        "deskew": true,
        "contrast_enhance": true,
        "binarization_method": "otsu"
      }
    }
  },
  "pdf_options": {
    "extract_images": true,
    "extract_metadata": true,
    "passwords": ["password1", "password2"]
  },
  "images": {
    "extract_images": true,
    "target_dpi": 200,
    "max_image_dimension": 4096
  },
  "chunking": {
    "max_chars": 1000,
    "max_overlap": 200
  },
  "language_detection": {
    "enabled": true,
    "min_confidence": 0.8,
    "detect_multiple": false
  },
  "token_reduction": {
    "mode": "moderate",
    "preserve_important_words": true
  },
  "postprocessor": {
    "enabled": true
  }
}

For complete working examples, see the examples directory.


Best Practices

When to Use Config Files vs Programmatic Config

Use config files when:

  • Settings are shared across multiple scripts/applications
  • Configuration needs to be version controlled
  • Non-developers need to modify settings
  • Deploying to multiple environments (dev/staging/prod)

Use programmatic config when:

  • Settings vary per execution or are computed dynamically
  • Configuration depends on runtime conditions
  • Building SDKs or libraries that wrap Kreuzberg
  • Rapid prototyping and experimentation

Performance Considerations

Caching:

  • Keep use_cache=true for repeated processing of the same files
  • Cache is automatically invalidated when files change
  • Cache location: ~/.cache/kreuzberg/ (configurable via environment)

OCR Settings:

  • Lower target_dpi (e.g., 150-200) for faster processing of low-quality scans
  • Higher target_dpi (e.g., 400-600) for small text or high-quality documents
  • Disable enable_table_detection if tables aren't needed (10-20% speedup)
  • Use psm=6 for clean single-column documents (faster than psm=3)

Batch Processing:

  • Set max_concurrent_extractions to balance speed and memory usage
  • Default (num_cpus * 2) works well for most systems
  • Reduce for memory-constrained environments
  • Increase for I/O-bound workloads on systems with fast storage

Token Reduction:

  • Use "light" or "moderate" modes for minimal quality impact
  • "aggressive" and "maximum" modes may affect semantic meaning
  • Benchmark with your specific LLM to measure quality vs. cost tradeoff

Security Considerations

API Keys and Secrets:

  • Never commit config files containing API keys or passwords to version control
  • Use environment variables for sensitive data:
    Terminal
    export KREUZBERG_OCR_API_KEY="your-key-here"
    
  • Add kreuzberg.toml to .gitignore if it contains secrets
  • Use separate config files for development vs. production

PDF Passwords:

  • passwords field attempts passwords in order until one succeeds
  • Passwords are not logged or cached
  • Use environment variables for sensitive passwords:
    secure_config.py
    import os
    config = PdfConfig(passwords=[os.getenv("PDF_PASSWORD")])
    

File System Access:

  • Kreuzberg only reads files you explicitly pass to extraction functions
  • Cache directory permissions should be restricted to the running user
  • Temporary files are automatically cleaned up after extraction

Data Privacy:

  • Extraction results are never sent to external services (except explicit OCR backends)
  • Tesseract OCR runs locally with no network access
  • EasyOCR and PaddleOCR may download models on first run (cached locally)
  • Consider disabling cache for sensitive documents requiring ephemeral processing