Configuration Reference¶

This page provides a comprehensive reference for all Kreuzberg configuration types. For usage guides and examples, see the Configuration Guide.

Overview¶

Kreuzberg supports multiple configuration methods:

TOML files - Preferred format, clear syntax
YAML files - Alternative format
JSON files - For programmatic generation
Programmatic - Direct object instantiation

Configuration Discovery¶

Kreuzberg automatically discovers configuration files in this order:

Current directory: ./kreuzberg.{toml,yaml,yml,json}
User config: ~/.config/kreuzberg/config.{toml,yaml,yml,json}
System config: /etc/kreuzberg/config.{toml,yaml,yml,json}

For complete examples, see the examples directory.

ExtractionConfig¶

Main extraction configuration controlling all aspects of document processing.

Field	Type	Default	Description
`use_cache`	`bool`	`true`	Enable caching of extraction results for faster re-processing
`enable_quality_processing`	`bool`	`true`	Enable quality post-processing (deduplication, mojibake fixing, etc.)
`force_ocr`	`bool`	`false`	Force OCR even for searchable PDFs with text layers
`ocr`	`OcrConfig?`	`None`	OCR configuration (if None, OCR disabled)
`pdf_options`	`PdfConfig?`	`None`	PDF-specific configuration options
`images`	`ImageExtractionConfig?`	`None`	Image extraction configuration
`chunking`	`ChunkingConfig?`	`None`	Text chunking configuration for splitting into chunks
`token_reduction`	`TokenReductionConfig?`	`None`	Token reduction configuration for optimizing LLM context
`language_detection`	`LanguageDetectionConfig?`	`None`	Automatic language detection configuration
`postprocessor`	`PostProcessorConfig?`	`None`	Post-processing pipeline configuration
`max_concurrent_extractions`	`int?`	`None`	Maximum concurrent batch extractions (defaults to num_cpus * 2)

Example¶

PythonTypeScriptRust

from kreuzberg import extract_file, ExtractionConfig

config = ExtractionConfig(
    use_cache=True,
    enable_quality_processing=True,
    force_ocr=False
)

result = extract_file("document.pdf", config=config)

import { extractFile, ExtractionConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  useCache: true,
  enableQualityProcessing: true,
  forceOcr: false
};

const result = await extractFile('document.pdf', { config });

use kreuzberg::{extract_file, ExtractionConfig};

let config = ExtractionConfig {
    use_cache: true,
    enable_quality_processing: true,
    force_ocr: false,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;

OcrConfig¶

Configuration for OCR (Optical Character Recognition) processing on images and scanned PDFs.

Field	Type	Default	Description
`backend`	`str`	`"tesseract"`	OCR backend to use: `"tesseract"`, `"easyocr"`, `"paddleocr"`
`language`	`str`	`"eng"`	Language code(s) for OCR, e.g., `"eng"`, `"eng+fra"`, `"eng+deu+fra"`
`tesseract_config`	`TesseractConfig?`	`None`	Tesseract-specific configuration options

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng+fra",
        tesseract_config=TesseractConfig(
            psm=3,
            oem=1,
            min_confidence=0.8
        )
    )
)

import { ExtractionConfig, OcrConfig, TesseractConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  ocr: {
    backend: 'tesseract',
    language: 'eng+fra',
    tesseractConfig: {
      psm: 3,
      oem: 1,
      minConfidence: 0.8
    }
  }
};

use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".to_string(),
        language: "eng+fra".to_string(),
        tesseract_config: Some(TesseractConfig {
            psm: 3,
            oem: 1,
            min_confidence: 0.8,
            ..Default::default()
        }),
    }),
    ..Default::default()
};

TesseractConfig¶

Tesseract OCR engine configuration with fine-grained control over recognition parameters.

Field	Type	Default	Description
`language`	`str`	`"eng"`	Language code(s), e.g., `"eng"`, `"eng+fra"`
`psm`	`int`	`3`	Page Segmentation Mode (0-13, see below)
`output_format`	`str`	`"markdown"`	Output format: `"text"`, `"markdown"`, `"hocr"`
`oem`	`int`	`3`	OCR Engine Mode (0-3, see below)
`min_confidence`	`float`	`0.0`	Minimum confidence threshold (0.0-100.0)
`preprocessing`	`ImagePreprocessingConfig?`	`None`	Image preprocessing configuration
`enable_table_detection`	`bool`	`true`	Enable automatic table detection and reconstruction
`table_min_confidence`	`float`	`0.0`	Minimum confidence for table cell recognition (0.0-1.0)
`table_column_threshold`	`int`	`50`	Pixel threshold for detecting table columns
`table_row_threshold_ratio`	`float`	`0.5`	Row threshold ratio for table detection (0.0-1.0)
`use_cache`	`bool`	`true`	Enable OCR result caching for faster re-processing
`classify_use_pre_adapted_templates`	`bool`	`true`	Use pre-adapted templates for character classification
`language_model_ngram_on`	`bool`	`false`	Enable N-gram language model for better word recognition
`tessedit_dont_blkrej_good_wds`	`bool`	`true`	Don't reject good words during block-level processing
`tessedit_dont_rowrej_good_wds`	`bool`	`true`	Don't reject good words during row-level processing
`tessedit_enable_dict_correction`	`bool`	`true`	Enable dictionary-based word correction
`tessedit_char_whitelist`	`str`	`""`	Allowed characters (empty = all allowed)
`tessedit_char_blacklist`	`str`	`""`	Forbidden characters (empty = none forbidden)
`tessedit_use_primary_params_model`	`bool`	`true`	Use primary language params model
`textord_space_size_is_variable`	`bool`	`true`	Enable variable-width space detection
`thresholding_method`	`bool`	`false`	Use adaptive thresholding method

Page Segmentation Modes (PSM)¶

0: Orientation and script detection only (no OCR)
1: Automatic page segmentation with OSD (Orientation and Script Detection)
2: Automatic page segmentation (no OSD, no OCR)
3: Fully automatic page segmentation (default, best for most documents)
4: Single column of text of variable sizes
5: Single uniform block of vertically aligned text
6: Single uniform block of text (best for clean documents)
7: Single text line
8: Single word
9: Single word in a circle
10: Single character
11: Sparse text with no particular order (best for forms, invoices)
12: Sparse text with OSD
13: Raw line (bypass Tesseract's layout analysis)

OCR Engine Modes (OEM)¶

0: Legacy Tesseract engine only (pre-2016)
1: Neural nets LSTM engine only (recommended for best quality)
2: Legacy + LSTM engines combined
3: Default based on what's available (recommended for compatibility)

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng+fra+deu",
        tesseract_config=TesseractConfig(
            psm=6,
            oem=1,
            min_confidence=0.8,
            enable_table_detection=True,
            tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?-"
        )
    )
)

import { ExtractionConfig, OcrConfig, TesseractConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  ocr: {
    backend: 'tesseract',
    language: 'eng+fra+deu',
    tesseractConfig: {
      psm: 6,
      oem: 1,
      minConfidence: 0.8,
      enableTableDetection: true,
      tesseditCharWhitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?-'
    }
  }
};

use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".to_string(),
        language: "eng+fra+deu".to_string(),
        tesseract_config: Some(TesseractConfig {
            psm: 6,
            oem: 1,
            min_confidence: 0.8,
            enable_table_detection: true,
            tessedit_char_whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?-".to_string(),
            ..Default::default()
        }),
    }),
    ..Default::default()
};

ChunkingConfig¶

Configuration for splitting extracted text into overlapping chunks, useful for vector databases and LLM processing.

Field	Type	Default	Description
`max_chars`	`int`	`1000`	Maximum characters per chunk
`max_overlap`	`int`	`200`	Overlap between consecutive chunks in characters
`embedding`	`EmbeddingConfig?`	`None`	Optional embedding generation for each chunk
`preset`	`str?`	`None`	Chunking preset: `"small"` (500/100), `"medium"` (1000/200), `"large"` (2000/400)

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=1000,
        max_overlap=200
    )
)

import { ExtractionConfig, ChunkingConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  chunking: {
    maxChars: 1000,
    maxOverlap: 200
  }
};

use kreuzberg::{ExtractionConfig, ChunkingConfig};

let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_chars: 1000,
        max_overlap: 200,
        ..Default::default()
    }),
    ..Default::default()
};

LanguageDetectionConfig¶

Configuration for automatic language detection in extracted text.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable language detection
`min_confidence`	`float`	`0.8`	Minimum confidence threshold (0.0-1.0) for reporting detected languages
`detect_multiple`	`bool`	`false`	Detect multiple languages (vs. dominant language only)

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, LanguageDetectionConfig

config = ExtractionConfig(
    language_detection=LanguageDetectionConfig(
        enabled=True,
        min_confidence=0.9,
        detect_multiple=True
    )
)

import { ExtractionConfig, LanguageDetectionConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  languageDetection: {
    enabled: true,
    minConfidence: 0.9,
    detectMultiple: true
  }
};

use kreuzberg::{ExtractionConfig, LanguageDetectionConfig};

let config = ExtractionConfig {
    language_detection: Some(LanguageDetectionConfig {
        enabled: true,
        min_confidence: 0.9,
        detect_multiple: true,
    }),
    ..Default::default()
};

PdfConfig¶

PDF-specific extraction configuration.

Field	Type	Default	Description
`extract_images`	`bool`	`false`	Extract embedded images from PDF pages
`extract_metadata`	`bool`	`true`	Extract PDF metadata (title, author, creation date, etc.)
`passwords`	`list[str]?`	`None`	List of passwords to try for encrypted PDFs (tries in order)

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, PdfConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        extract_images=True,
        extract_metadata=True,
        passwords=["password1", "password2", "secret123"]
    )
)

import { ExtractionConfig, PdfConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  pdfOptions: {
    extractImages: true,
    extractMetadata: true,
    passwords: ['password1', 'password2', 'secret123']
  }
};

use kreuzberg::{ExtractionConfig, PdfConfig};

let config = ExtractionConfig {
    pdf_options: Some(PdfConfig {
        extract_images: true,
        extract_metadata: true,
        passwords: Some(vec![
            "password1".to_string(),
            "password2".to_string(),
            "secret123".to_string()
        ]),
    }),
    ..Default::default()
};

ImageExtractionConfig¶

Configuration for extracting and processing images from documents.

Field	Type	Default	Description
`extract_images`	`bool`	`true`	Extract images from documents
`target_dpi`	`int`	`300`	Target DPI for extracted/normalized images
`max_image_dimension`	`int`	`4096`	Maximum image dimension (width or height) in pixels
`auto_adjust_dpi`	`bool`	`true`	Automatically adjust DPI based on image size and content
`min_dpi`	`int`	`72`	Minimum DPI when auto-adjusting
`max_dpi`	`int`	`600`	Maximum DPI when auto-adjusting

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, ImageExtractionConfig

config = ExtractionConfig(
    images=ImageExtractionConfig(
        extract_images=True,
        target_dpi=200,
        max_image_dimension=2048,
        auto_adjust_dpi=True,
        min_dpi=100,
        max_dpi=400
    )
)

import { ExtractionConfig, ImageExtractionConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  images: {
    extractImages: true,
    targetDpi: 200,
    maxImageDimension: 2048,
    autoAdjustDpi: true,
    minDpi: 100,
    maxDpi: 400
  }
};

use kreuzberg::{ExtractionConfig, ImageExtractionConfig};

let config = ExtractionConfig {
    images: Some(ImageExtractionConfig {
        extract_images: true,
        target_dpi: 200,
        max_image_dimension: 2048,
        auto_adjust_dpi: true,
        min_dpi: 100,
        max_dpi: 400,
    }),
    ..Default::default()
};

ImagePreprocessingConfig¶

Image preprocessing configuration for improving OCR quality on scanned documents.

Field	Type	Default	Description
`target_dpi`	`int`	`300`	Target DPI for OCR processing (300 standard, 600 for small text)
`auto_rotate`	`bool`	`true`	Auto-detect and correct image rotation
`deskew`	`bool`	`true`	Correct skew (tilted images)
`denoise`	`bool`	`false`	Apply noise reduction filter
`contrast_enhance`	`bool`	`false`	Enhance image contrast for better text visibility
`binarization_method`	`str`	`"otsu"`	Binarization method: `"otsu"`, `"sauvola"`, `"adaptive"`, `"none"`
`invert_colors`	`bool`	`false`	Invert colors (useful for white text on black background)

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, ImagePreprocessingConfig

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        tesseract_config=TesseractConfig(
            preprocessing=ImagePreprocessingConfig(
                target_dpi=300,
                auto_rotate=True,
                deskew=True,
                denoise=True,
                contrast_enhance=True,
                binarization_method="otsu",
                invert_colors=False
            )
        )
    )
)

import { ExtractionConfig, OcrConfig, TesseractConfig, ImagePreprocessingConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  ocr: {
    backend: 'tesseract',
    tesseractConfig: {
      preprocessing: {
        targetDpi: 300,
        autoRotate: true,
        deskew: true,
        denoise: true,
        contrastEnhance: true,
        binarizationMethod: 'otsu',
        invertColors: false
      }
    }
  }
};

use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig, ImagePreprocessingConfig};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".to_string(),
        tesseract_config: Some(TesseractConfig {
            preprocessing: Some(ImagePreprocessingConfig {
                target_dpi: 300,
                auto_rotate: true,
                deskew: true,
                denoise: true,
                contrast_enhance: true,
                binarization_method: "otsu".to_string(),
                invert_colors: false,
            }),
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};

PostProcessorConfig¶

Configuration for the post-processing pipeline that runs after extraction.

Field	Type	Default	Description
`enabled`	`bool`	`true`	Enable post-processing pipeline
`enabled_processors`	`list[str]?`	`None`	Specific processors to enable (if None, all enabled by default)
`disabled_processors`	`list[str]?`	`None`	Specific processors to disable (takes precedence over enabled_processors)

Built-in post-processors include:

deduplication - Remove duplicate text blocks
whitespace_normalization - Normalize whitespace and line breaks
mojibake_fix - Fix mojibake (encoding corruption)
quality_scoring - Score and filter low-quality text

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, PostProcessorConfig

config = ExtractionConfig(
    postprocessor=PostProcessorConfig(
        enabled=True,
        enabled_processors=["deduplication", "whitespace_normalization"],
        disabled_processors=["mojibake_fix"]
    )
)

import { ExtractionConfig, PostProcessorConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  postprocessor: {
    enabled: true,
    enabledProcessors: ['deduplication', 'whitespace_normalization'],
    disabledProcessors: ['mojibake_fix']
  }
};

use kreuzberg::{ExtractionConfig, PostProcessorConfig};

let config = ExtractionConfig {
    postprocessor: Some(PostProcessorConfig {
        enabled: true,
        enabled_processors: Some(vec![
            "deduplication".to_string(),
            "whitespace_normalization".to_string()
        ]),
        disabled_processors: Some(vec!["mojibake_fix".to_string()]),
    }),
    ..Default::default()
};

TokenReductionConfig¶

Configuration for reducing token count in extracted text, useful for optimizing LLM context windows.

Field	Type	Default	Description
`mode`	`str`	`"off"`	Reduction mode: `"off"`, `"light"`, `"moderate"`, `"aggressive"`, `"maximum"`
`preserve_important_words`	`bool`	`true`	Preserve important words (capitalized, technical terms) during reduction

Reduction Modes¶

off: No token reduction
light: Remove redundant whitespace and line breaks (~5-10% reduction)
moderate: Light + remove stopwords in low-information contexts (~15-25% reduction)
aggressive: Moderate + abbreviate common phrases (~30-40% reduction)
maximum: Aggressive + remove all stopwords (~50-60% reduction, may impact quality)

Example¶

PythonTypeScriptRust

from kreuzberg import ExtractionConfig, TokenReductionConfig

config = ExtractionConfig(
    token_reduction=TokenReductionConfig(
        mode="moderate",
        preserve_important_words=True
    )
)

import { ExtractionConfig, TokenReductionConfig } from '@kreuzberg/sdk';

const config: ExtractionConfig = {
  tokenReduction: {
    mode: 'moderate',
    preserveImportantWords: true
  }
};

use kreuzberg::{ExtractionConfig, TokenReductionConfig};

let config = ExtractionConfig {
    token_reduction: Some(TokenReductionConfig {
        mode: "moderate".to_string(),
        preserve_important_words: true,
    }),
    ..Default::default()
};

Configuration File Examples¶

TOML Format¶

# kreuzberg.toml
use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng+fra"

[ocr.tesseract_config]
psm = 6
oem = 1
min_confidence = 0.8
enable_table_detection = true

[ocr.tesseract_config.preprocessing]
target_dpi = 300
denoise = true
deskew = true
contrast_enhance = true
binarization_method = "otsu"

[pdf_options]
extract_images = true
extract_metadata = true
passwords = ["password1", "password2"]

[images]
extract_images = true
target_dpi = 200
max_image_dimension = 4096

[chunking]
max_chars = 1000
max_overlap = 200

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

[token_reduction]
mode = "moderate"
preserve_important_words = true

[postprocessor]
enabled = true

YAML Format¶

# kreuzberg.yaml
use_cache: true
enable_quality_processing: true
force_ocr: false

ocr:
  backend: tesseract
  language: eng+fra
  tesseract_config:
    psm: 6
    oem: 1
    min_confidence: 0.8
    enable_table_detection: true
    preprocessing:
      target_dpi: 300
      denoise: true
      deskew: true
      contrast_enhance: true
      binarization_method: otsu

pdf_options:
  extract_images: true
  extract_metadata: true
  passwords:
    - password1
    - password2

images:
  extract_images: true
  target_dpi: 200
  max_image_dimension: 4096

chunking:
  max_chars: 1000
  max_overlap: 200

language_detection:
  enabled: true
  min_confidence: 0.8
  detect_multiple: false

token_reduction:
  mode: moderate
  preserve_important_words: true

postprocessor:
  enabled: true

JSON Format¶

{
  "use_cache": true,
  "enable_quality_processing": true,
  "force_ocr": false,
  "ocr": {
    "backend": "tesseract",
    "language": "eng+fra",
    "tesseract_config": {
      "psm": 6,
      "oem": 1,
      "min_confidence": 0.8,
      "enable_table_detection": true,
      "preprocessing": {
        "target_dpi": 300,
        "denoise": true,
        "deskew": true,
        "contrast_enhance": true,
        "binarization_method": "otsu"
      }
    }
  },
  "pdf_options": {
    "extract_images": true,
    "extract_metadata": true,
    "passwords": ["password1", "password2"]
  },
  "images": {
    "extract_images": true,
    "target_dpi": 200,
    "max_image_dimension": 4096
  },
  "chunking": {
    "max_chars": 1000,
    "max_overlap": 200
  },
  "language_detection": {
    "enabled": true,
    "min_confidence": 0.8,
    "detect_multiple": false
  },
  "token_reduction": {
    "mode": "moderate",
    "preserve_important_words": true
  },
  "postprocessor": {
    "enabled": true
  }
}

For complete working examples, see the examples directory.

Best Practices¶

When to Use Config Files vs Programmatic Config¶

Use config files when:

Settings are shared across multiple scripts/applications
Configuration needs to be version controlled
Non-developers need to modify settings
Deploying to multiple environments (dev/staging/prod)

Use programmatic config when:

Settings vary per execution or are computed dynamically
Configuration depends on runtime conditions
Building SDKs or libraries that wrap Kreuzberg
Rapid prototyping and experimentation

Performance Considerations¶

Caching:

Keep use_cache=true for repeated processing of the same files
Cache is automatically invalidated when files change
Cache location: ~/.cache/kreuzberg/ (configurable via environment)

OCR Settings:

Lower target_dpi (e.g., 150-200) for faster processing of low-quality scans
Higher target_dpi (e.g., 400-600) for small text or high-quality documents
Disable enable_table_detection if tables aren't needed (10-20% speedup)
Use psm=6 for clean single-column documents (faster than psm=3)

Batch Processing:

Set max_concurrent_extractions to balance speed and memory usage
Default (num_cpus * 2) works well for most systems
Reduce for memory-constrained environments
Increase for I/O-bound workloads on systems with fast storage

Token Reduction:

Use "light" or "moderate" modes for minimal quality impact
"aggressive" and "maximum" modes may affect semantic meaning
Benchmark with your specific LLM to measure quality vs. cost tradeoff

Security Considerations¶

API Keys and Secrets:

Never commit config files containing API keys or passwords to version control

Use environment variables for sensitive data:

export KREUZBERG_OCR_API_KEY="your-key-here"

Add kreuzberg.toml to .gitignore if it contains secrets
Use separate config files for development vs. production

PDF Passwords:

passwords field attempts passwords in order until one succeeds
Passwords are not logged or cached

Use environment variables for sensitive passwords:

import os
config = PdfConfig(passwords=[os.getenv("PDF_PASSWORD")])

File System Access:

Kreuzberg only reads files you explicitly pass to extraction functions
Cache directory permissions should be restricted to the running user
Temporary files are automatically cleaned up after extraction

Data Privacy:

Extraction results are never sent to external services (except explicit OCR backends)
Tesseract OCR runs locally with no network access
EasyOCR and PaddleOCR may download models on first run (cached locally)
Consider disabling cache for sensitive documents requiring ephemeral processing

Configuration Guide - Usage guide with examples
OCR Guide - OCR-specific configuration and troubleshooting
Examples Directory - Complete working examples

Configuration Reference¶

Overview¶

Configuration Discovery¶

ExtractionConfig¶

Example¶

OcrConfig¶

Example¶

TesseractConfig¶

Page Segmentation Modes (PSM)¶

OCR Engine Modes (OEM)¶

Example¶

ChunkingConfig¶

Example¶

LanguageDetectionConfig¶

Example¶

PdfConfig¶

Example¶

ImageExtractionConfig¶

Example¶

ImagePreprocessingConfig¶

Example¶

PostProcessorConfig¶

Example¶

TokenReductionConfig¶

Reduction Modes¶

Example¶

Configuration File Examples¶

TOML Format¶

YAML Format¶

JSON Format¶

Best Practices¶

When to Use Config Files vs Programmatic Config¶

Performance Considerations¶

Security Considerations¶

Related Documentation¶