Configuration¶
graph TD
ExtractionConfig[ExtractionConfig<br/>Main Configuration]
ExtractionConfig --> OCR[OcrConfig<br/>OCR Backend Settings]
ExtractionConfig --> PDF[PdfConfig<br/>PDF Options]
ExtractionConfig --> Images[ImageExtractionConfig<br/>Image Settings]
ExtractionConfig --> Chunking[ChunkingConfig<br/>Text Chunking]
ExtractionConfig --> TokenRed[TokenReductionConfig<br/>Token Optimization]
ExtractionConfig --> LangDet[LanguageDetectionConfig<br/>Language Detection]
ExtractionConfig --> PostProc[PostProcessorConfig<br/>Post-Processing]
OCR --> Tesseract[TesseractConfig<br/>Tesseract Options]
Tesseract --> ImgPreproc[ImagePreprocessingConfig<br/>Image Enhancement]
Chunking --> Embedding[EmbeddingConfig<br/>Vector Embeddings]
Embedding --> Model[EmbeddingModelType<br/>Model Selection]
style ExtractionConfig fill:#4CAF50,color:#fff
style OCR fill:#87CEEB
style Chunking fill:#FFD700
style Embedding fill:#FFB6C1 Kreuzberg's behavior is controlled through configuration objects. All settings are optional with sensible defaults, allowing you to configure only what you need.
Configuration Methods¶
Kreuzberg supports four ways to configure extraction:
Configuration Discovery¶
flowchart TD
Start[ExtractionConfig.discover] --> Current{Check Current Directory}
Current -->|Found| LoadCurrent[Load ./kreuzberg.*]
Current -->|Not Found| User{Check User Config}
User -->|Found| LoadUser[Load ~/.config/kreuzberg/config.*]
User -->|Not Found| System{Check System Config}
System -->|Found| LoadSystem[Load /etc/kreuzberg/config.*]
System -->|Not Found| Default[Use Default Config]
LoadCurrent --> Merge[Merge with Defaults]
LoadUser --> Merge
LoadSystem --> Merge
Default --> Return[Return Config]
Merge --> Return
style LoadCurrent fill:#90EE90
style LoadUser fill:#87CEEB
style LoadSystem fill:#FFD700
style Default fill:#FFB6C1 Kreuzberg automatically discovers configuration files in the following locations (in order):
- Current directory:
./kreuzberg.{toml,yaml,yml,json} - User config:
~/.config/kreuzberg/config.{toml,yaml,yml,json} - System config:
/etc/kreuzberg/config.{toml,yaml,yml,json}
ExtractionConfig¶
The main configuration object controlling extraction behavior.
| Field | Type | Default | Description |
|---|---|---|---|
use_cache | bool | true | Enable caching of extraction results |
enable_quality_processing | bool | true | Enable quality post-processing |
force_ocr | bool | false | Force OCR even for text-based PDFs |
ocr | OcrConfig? | None | OCR configuration (if None, OCR disabled) |
pdf_options | PdfConfig? | None | PDF-specific configuration |
images | ImageExtractionConfig? | None | Image extraction configuration |
chunking | ChunkingConfig? | None | Text chunking configuration |
token_reduction | TokenReductionConfig? | None | Token reduction configuration |
language_detection | LanguageDetectionConfig? | None | Language detection configuration |
keywords | KeywordConfig? | None | Keyword extraction configuration (requires keywords-yake or keywords-rake feature flag) |
postprocessor | PostProcessorConfig? | None | Post-processing pipeline configuration |
Basic Example¶
OcrConfig¶
Configuration for OCR processing. Set to enable OCR on images and scanned PDFs.
| Field | Type | Default | Description |
|---|---|---|---|
backend | str | "tesseract" | OCR backend: "tesseract", "easyocr", "paddleocr" |
language | str | "eng" | Language code(s), e.g., "eng", "eng+fra" |
tesseract_config | TesseractConfig? | None | Tesseract-specific configuration |
Example¶
TesseractConfig¶
Tesseract OCR engine configuration.
| Field | Type | Default | Description |
|---|---|---|---|
language | str | "eng" | Language code(s), e.g., "eng", "eng+fra" |
psm | int | 3 | Page segmentation mode (0-13) |
output_format | str | "text" | Output format: "text", "hocr" |
oem | int | 3 | OCR engine mode (0-3) |
min_confidence | float | 0.0 | Minimum confidence threshold (0.0-1.0) |
preprocessing | ImagePreprocessingConfig? | None | Image preprocessing configuration |
enable_table_detection | bool | false | Enable table detection and extraction |
table_min_confidence | float | 0.5 | Minimum confidence for table cells |
table_column_threshold | int | 50 | Pixel threshold for column detection |
table_row_threshold_ratio | float | 0.5 | Row threshold ratio |
use_cache | bool | true | Enable OCR result caching |
classify_use_pre_adapted_templates | bool | false | Tesseract variable |
language_model_ngram_on | bool | false | Tesseract variable |
tessedit_dont_blkrej_good_wds | bool | false | Tesseract variable |
tessedit_dont_rowrej_good_wds | bool | false | Tesseract variable |
tessedit_enable_dict_correction | bool | false | Tesseract variable |
tessedit_char_whitelist | str | "" | Allowed characters |
tessedit_char_blacklist | str | "" | Disallowed characters |
tessedit_use_primary_params_model | bool | false | Tesseract variable |
textord_space_size_is_variable | bool | false | Tesseract variable |
thresholding_method | bool | false | Tesseract variable |
Page Segmentation Modes (PSM)¶
0: Orientation and script detection only1: Automatic page segmentation with OSD2: Automatic page segmentation (no OSD, no OCR)3: Fully automatic page segmentation (default)4: Single column of text5: Single uniform block of vertically aligned text6: Single uniform block of text7: Single text line8: Single word9: Single word in a circle10: Single character11: Sparse text, no particular order12: Sparse text with OSD13: Raw line (no assumptions about text layout)
OCR Engine Modes (OEM)¶
0: Legacy engine only1: Neural nets LSTM engine only2: Legacy + LSTM engines3: Default based on what's available (default)
Example¶
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig
config = ExtractionConfig(
ocr=OcrConfig(
language="eng+fra+deu",
tesseract_config=TesseractConfig(
psm=6,
oem=1,
min_confidence=0.8,
tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?",
enable_table_detection=True
)
)
)
import { ExtractionConfig, OcrConfig, TesseractConfig } from '@kreuzberg/sdk';
const config = new ExtractionConfig({
ocr: new OcrConfig({
language: 'eng+fra+deu',
tesseractConfig: new TesseractConfig({
psm: 6,
oem: 1,
minConfidence: 0.8,
tesseditCharWhitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
enableTableDetection: true
})
})
});
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
language: "eng+fra+deu".to_string(),
tesseract_config: Some(TesseractConfig {
psm: 6,
oem: 1,
min_confidence: 0.8,
tessedit_char_whitelist: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?".to_string(),
enable_table_detection: true,
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
ocr: Kreuzberg::OcrConfig.new(
language: 'eng+fra+deu',
tesseract_config: Kreuzberg::TesseractConfig.new(
psm: 6,
oem: 1,
min_confidence: 0.8,
tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?',
enable_table_detection: true
)
)
)
ImagePreprocessingConfig¶
Image preprocessing configuration for OCR.
| Field | Type | Default | Description |
|---|---|---|---|
target_dpi | int | 300 | Target DPI for OCR processing |
auto_rotate | bool | true | Automatically rotate images based on orientation |
deskew | bool | true | Apply deskewing to straighten tilted text |
denoise | bool | true | Apply denoising filter |
contrast_enhance | bool | true | Enhance image contrast |
binarization_method | str | "otsu" | Binarization method: "otsu", "adaptive", "none" |
invert_colors | bool | false | Invert image colors (useful for white-on-black text) |
Example¶
from kreuzberg import ExtractionConfig, OcrConfig, TesseractConfig, ImagePreprocessingConfig
config = ExtractionConfig(
ocr=OcrConfig(
tesseract_config=TesseractConfig(
preprocessing=ImagePreprocessingConfig(
target_dpi=300,
denoise=True,
deskew=True,
contrast_enhance=True,
binarization_method="otsu"
)
)
)
)
import { ExtractionConfig, OcrConfig, TesseractConfig, ImagePreprocessingConfig } from '@kreuzberg/sdk';
const config = new ExtractionConfig({
ocr: new OcrConfig({
tesseractConfig: new TesseractConfig({
preprocessing: new ImagePreprocessingConfig({
targetDpi: 300,
denoise: true,
deskew: true,
contrastEnhance: true,
binarizationMethod: 'otsu'
})
})
})
});
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig, ImagePreprocessingConfig};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
tesseract_config: Some(TesseractConfig {
preprocessing: Some(ImagePreprocessingConfig {
target_dpi: 300,
denoise: true,
deskew: true,
contrast_enhance: true,
binarization_method: "otsu".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
ocr: Kreuzberg::OcrConfig.new(
tesseract_config: Kreuzberg::TesseractConfig.new(
preprocessing: Kreuzberg::ImagePreprocessingConfig.new(
target_dpi: 300,
denoise: true,
deskew: true,
contrast_enhance: true,
binarization_method: 'otsu'
)
)
)
)
PdfConfig¶
PDF-specific extraction configuration.
| Field | Type | Default | Description |
|---|---|---|---|
extract_images | bool | true | Extract embedded images from PDF |
extract_metadata | bool | true | Extract PDF metadata (title, author, etc.) |
passwords | list[str]? | None | List of passwords to try for encrypted PDFs |
Example¶
ImageExtractionConfig¶
Configuration for extracting images from documents.
| Field | Type | Default | Description |
|---|---|---|---|
extract_images | bool | true | Extract images from documents |
target_dpi | int | 300 | Target DPI for extracted images |
max_image_dimension | int | 4096 | Maximum image dimension (width or height) in pixels |
auto_adjust_dpi | bool | true | Automatically adjust DPI based on image size |
min_dpi | int | 72 | Minimum DPI when auto-adjusting |
max_dpi | int | 600 | Maximum DPI when auto-adjusting |
Example¶
ChunkingConfig¶
Text chunking configuration for splitting extracted text into chunks.
| Field | Type | Default | Description |
|---|---|---|---|
max_chars | int | 1000 | Maximum chunk size in characters |
max_overlap | int | 200 | Overlap between chunks in characters |
embedding | EmbeddingConfig? | None | Embedding configuration for chunks |
preset | str? | None | Chunking preset: "small", "medium", "large" |
Example¶
EmbeddingConfig¶
Configuration for generating embeddings from extracted text or chunks.
| Field | Type | Default | Description |
|---|---|---|---|
model | EmbeddingModelType | preset("all-MiniLM-L6-v2") | Embedding model configuration |
normalize | bool | true | Normalize embeddings to unit length |
batch_size | int | 32 | Batch size for embedding generation |
show_download_progress | bool | true | Show download progress for models |
cache_dir | str? | None | Custom cache directory for models |
EmbeddingModelType¶
Create embedding models using these factory methods:
EmbeddingModelType.preset(name): Use a preset model"all-MiniLM-L6-v2": Fast, 384-dimensional embeddings (default)"all-mpnet-base-v2": High quality, 768-dimensional embeddings-
"paraphrase-multilingual-MiniLM-L12-v2": Multilingual support -
EmbeddingModelType.fastembed(model, dimensions): Use a FastEmbed model -
Example:
fastembed("BAAI/bge-small-en-v1.5", 384) -
EmbeddingModelType.custom(model_id, dimensions): Use a custom model - Example:
custom("sentence-transformers/all-MiniLM-L6-v2", 384)
Example¶
from kreuzberg import ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType
config = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1000,
embedding=EmbeddingConfig(
model=EmbeddingModelType.preset("all-mpnet-base-v2"),
batch_size=16,
normalize=True,
show_download_progress=True
)
)
)
import { ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType } from '@kreuzberg/sdk';
const config = new ExtractionConfig({
chunking: new ChunkingConfig({
maxChars: 1000,
embedding: new EmbeddingConfig({
model: EmbeddingModelType.preset('all-mpnet-base-v2'),
batchSize: 16,
normalize: true,
showDownloadProgress: true
})
})
});
use kreuzberg::{ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType};
let config = ExtractionConfig {
chunking: Some(ChunkingConfig {
max_chars: 1000,
embedding: Some(EmbeddingConfig {
model: EmbeddingModelType::preset("all-mpnet-base-v2"),
batch_size: 16,
normalize: true,
show_download_progress: true,
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
TokenReductionConfig¶
Configuration for reducing token count in extracted text.
| Field | Type | Default | Description |
|---|---|---|---|
mode | str | "off" | Reduction mode: "off", "moderate", "aggressive" |
preserve_important_words | bool | true | Preserve important words during reduction |
Example¶
LanguageDetectionConfig¶
Configuration for automatic language detection.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable language detection |
min_confidence | float | 0.8 | Minimum confidence threshold (0.0-1.0) |
detect_multiple | bool | false | Detect multiple languages (vs. dominant only) |
Example¶
PostProcessorConfig¶
Configuration for post-processing pipeline.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable post-processing pipeline |
enabled_processors | list[str]? | None | Specific processors to enable (if None, all enabled) |
disabled_processors | list[str]? | None | Specific processors to disable |
Example¶
use kreuzberg::{ExtractionConfig, PostProcessorConfig};
let config = ExtractionConfig {
postprocessor: Some(PostProcessorConfig {
enabled: true,
enabled_processors: Some(vec![
"deduplication".to_string(),
"whitespace_normalization".to_string()
]),
disabled_processors: Some(vec!["mojibake_fix".to_string()]),
}),
..Default::default()
};
Complete Example¶
Here's a complete example showing all configuration options together:
from kreuzberg import (
extract_file,
ExtractionConfig,
OcrConfig,
TesseractConfig,
ImagePreprocessingConfig,
PdfConfig,
ImageExtractionConfig,
ChunkingConfig,
EmbeddingConfig,
EmbeddingModelType,
TokenReductionConfig,
LanguageDetectionConfig,
PostProcessorConfig,
)
config = ExtractionConfig(
use_cache=True,
enable_quality_processing=True,
force_ocr=False,
ocr=OcrConfig(
backend="tesseract",
language="eng+fra",
tesseract_config=TesseractConfig(
psm=3,
oem=3,
min_confidence=0.8,
preprocessing=ImagePreprocessingConfig(
target_dpi=300,
denoise=True,
deskew=True,
contrast_enhance=True,
),
enable_table_detection=True,
),
),
pdf_options=PdfConfig(
extract_images=True,
extract_metadata=True,
),
images=ImageExtractionConfig(
extract_images=True,
target_dpi=150,
max_image_dimension=4096,
),
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
embedding=EmbeddingConfig(
model=EmbeddingModelType.preset("all-MiniLM-L6-v2"),
batch_size=32,
),
),
token_reduction=TokenReductionConfig(
mode="moderate",
preserve_important_words=True,
),
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.8,
detect_multiple=False,
),
postprocessor=PostProcessorConfig(
enabled=True,
),
)
result = extract_file("document.pdf", config=config)
# kreuzberg.toml
use_cache = true
enable_quality_processing = true
force_ocr = false
[ocr]
backend = "tesseract"
language = "eng+fra"
[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.8
enable_table_detection = true
[ocr.tesseract_config.preprocessing]
target_dpi = 300
denoise = true
deskew = true
contrast_enhance = true
[pdf_options]
extract_images = true
extract_metadata = true
[images]
extract_images = true
target_dpi = 150
max_image_dimension = 4096
[chunking]
max_chars = 1000
max_overlap = 200
[chunking.embedding]
batch_size = 32
[token_reduction]
mode = "moderate"
preserve_important_words = true
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
[postprocessor]
enabled = true