Types¶
Core data structures for extraction results, configuration, and metadata.
ExtractionResult¶
The result of a file extraction, containing the extracted text, MIME type, metadata, and table data:
kreuzberg.ExtractionResult
dataclass
¶
Source code in kreuzberg/_types.py
Attributes¶
chunks: list[str] = field(default_factory=list)
class-attribute
instance-attribute
¶
The extracted content chunks. This is an empty list if 'chunk_content' is not set to True in the ExtractionConfig.
content: str
instance-attribute
¶
The extracted content.
detected_languages: list[str] | None = None
class-attribute
instance-attribute
¶
Languages detected in the extracted content, if language detection is enabled.
document_type: str | None = None
class-attribute
instance-attribute
¶
Detected document type, if document type detection is enabled.
document_type_confidence: float | None = None
class-attribute
instance-attribute
¶
Confidence of the detected document type.
entities: list[Entity] | None = None
class-attribute
instance-attribute
¶
Extracted entities, if entity extraction is enabled.
image_ocr_results: list[ImageOCRResult] = field(default_factory=list)
class-attribute
instance-attribute
¶
OCR results from extracted images. Empty list if disabled or none processed.
images: list[ExtractedImage] = field(default_factory=list)
class-attribute
instance-attribute
¶
Extracted images. Empty list if 'extract_images' is not enabled.
keywords: list[tuple[str, float]] | None = None
class-attribute
instance-attribute
¶
Extracted keywords and their scores, if keyword extraction is enabled.
layout: DataFrame | None = field(default=None, repr=False, hash=False)
class-attribute
instance-attribute
¶
Internal layout data from OCR, not for public use.
metadata: Metadata = field(default_factory=(lambda: Metadata()))
class-attribute
instance-attribute
¶
The metadata of the content.
mime_type: str
instance-attribute
¶
The mime type of the extracted content. Is either text/plain or text/markdown.
tables: list[TableData] = field(default_factory=list)
class-attribute
instance-attribute
¶
Extracted tables. Is an empty list if 'extract_tables' is not set to True in the ExtractionConfig.
ExtractionConfig¶
Configuration options for extraction functions:
kreuzberg.ExtractionConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 |
|
Attributes¶
auto_adjust_dpi: bool = True
class-attribute
instance-attribute
¶
Whether to automatically adjust DPI based on image dimensions to stay within max_image_dimension limits.
auto_detect_document_type: bool = False
class-attribute
instance-attribute
¶
Whether to automatically detect the document type.
auto_detect_language: bool = False
class-attribute
instance-attribute
¶
Whether to automatically detect language and configure OCR accordingly.
chunk_content: bool = False
class-attribute
instance-attribute
¶
Whether to chunk the content into smaller chunks.
custom_entity_patterns: frozenset[tuple[str, str]] | None = None
class-attribute
instance-attribute
¶
Custom entity patterns as a frozenset of (entity_type, regex_pattern) tuples.
deduplicate_images: bool = True
class-attribute
instance-attribute
¶
Whether to remove duplicate images using CRC32 checksums.
document_classification_mode: Literal['text', 'vision'] = 'text'
class-attribute
instance-attribute
¶
The mode to use for document classification.
document_type_confidence_threshold: float = 0.5
class-attribute
instance-attribute
¶
Confidence threshold for document type detection.
enable_quality_processing: bool = True
class-attribute
instance-attribute
¶
Whether to apply quality post-processing to improve extraction results.
extract_entities: bool = False
class-attribute
instance-attribute
¶
Whether to extract named entities from the content.
extract_images: bool = False
class-attribute
instance-attribute
¶
Whether to extract images from documents.
extract_keywords: bool = False
class-attribute
instance-attribute
¶
Whether to extract keywords from the content.
extract_tables: bool = False
class-attribute
instance-attribute
¶
Whether to extract tables from the content. This requires the 'gmft' dependency.
extract_tables_from_ocr: bool = False
class-attribute
instance-attribute
¶
Extract tables from OCR output using TSV format (Tesseract only).
force_ocr: bool = False
class-attribute
instance-attribute
¶
Whether to force OCR.
gmft_config: GMFTConfig | None = None
class-attribute
instance-attribute
¶
GMFT configuration.
html_to_markdown_config: HTMLToMarkdownConfig | None = None
class-attribute
instance-attribute
¶
Configuration for HTML to Markdown conversion. If None, uses default settings.
image_ocr_backend: OcrBackendType | None = None
class-attribute
instance-attribute
¶
Deprecated: Use image_ocr_config.backend instead.
image_ocr_config: ImageOCRConfig | None = None
class-attribute
instance-attribute
¶
Configuration for OCR processing of extracted images.
image_ocr_formats: frozenset[str] = frozenset({'jpg', 'jpeg', 'png', 'gif', 'bmp', 'tiff', 'tif', 'webp', 'jp2', 'jpx', 'jpm', 'mj2', 'pnm', 'pbm', 'pgm', 'ppm'})
class-attribute
instance-attribute
¶
Deprecated: Use image_ocr_config.allowed_formats instead.
image_ocr_max_dimensions: tuple[int, int] = (10000, 10000)
class-attribute
instance-attribute
¶
Deprecated: Use image_ocr_config.max_dimensions instead.
image_ocr_min_dimensions: tuple[int, int] = (50, 50)
class-attribute
instance-attribute
¶
Deprecated: Use image_ocr_config.min_dimensions instead.
json_config: JSONExtractionConfig | None = None
class-attribute
instance-attribute
¶
Configuration for enhanced JSON extraction features. If None, uses standard JSON processing.
keyword_count: int = 10
class-attribute
instance-attribute
¶
Number of keywords to extract if extract_keywords is True.
language_detection_config: LanguageDetectionConfig | None = None
class-attribute
instance-attribute
¶
Configuration for language detection. If None, uses default settings with language_detection_model.
language_detection_model: Literal['lite', 'full', 'auto'] = 'auto'
class-attribute
instance-attribute
¶
Language detection model to use when auto_detect_language is True. - 'lite': Smaller, faster model with good accuracy - 'full': Larger model with highest accuracy - 'auto': Automatically choose based on memory availability (default)
max_chars: int = DEFAULT_MAX_CHARACTERS
class-attribute
instance-attribute
¶
The size of each chunk in characters.
max_dpi: int = 600
class-attribute
instance-attribute
¶
Maximum DPI threshold when auto-adjusting DPI.
max_image_dimension: int = 25000
class-attribute
instance-attribute
¶
Maximum allowed pixel dimension (width or height) for processed images to prevent memory issues.
max_overlap: int = DEFAULT_MAX_OVERLAP
class-attribute
instance-attribute
¶
The overlap between chunks in characters.
min_dpi: int = 72
class-attribute
instance-attribute
¶
Minimum DPI threshold when auto-adjusting DPI.
ocr_backend: OcrBackendType | None = 'tesseract'
class-attribute
instance-attribute
¶
The OCR backend to use.
Notes
- If set to 'None', OCR will not be performed.
ocr_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
class-attribute
instance-attribute
¶
Configuration to pass to the OCR backend.
ocr_extracted_images: bool = False
class-attribute
instance-attribute
¶
Deprecated: Use image_ocr_config.enabled instead.
pdf_password: str | list[str] = ''
class-attribute
instance-attribute
¶
Password(s) for encrypted PDF files. Can be a single password or list of passwords to try in sequence. Only used when crypto extra is installed.
post_processing_hooks: list[PostProcessingHook] | None = None
class-attribute
instance-attribute
¶
Post processing hooks to call after processing is done and before the final result is returned.
spacy_entity_extraction_config: SpacyEntityExtractionConfig | None = None
class-attribute
instance-attribute
¶
Configuration for spaCy entity extraction. If None, uses default settings.
target_dpi: int = 150
class-attribute
instance-attribute
¶
Target DPI for OCR processing. Images and PDF pages will be scaled to this DPI for optimal OCR results.
token_reduction: TokenReductionConfig | None = None
class-attribute
instance-attribute
¶
Configuration for token reduction to optimize output size while preserving meaning.
use_cache: bool = True
class-attribute
instance-attribute
¶
Whether to use caching for extraction results. Set to False to disable all caching.
validators: list[ValidationHook] | None = None
class-attribute
instance-attribute
¶
Validation hooks to call after processing is done and before post-processing and result return.
TableData¶
A TypedDict that contains data extracted from tables in documents:
kreuzberg.TableData
¶
Bases: TypedDict
Source code in kreuzberg/_types.py
Attributes¶
cropped_image: Image
instance-attribute
¶
The cropped image of the table.
df: DataFrame | None
instance-attribute
¶
The table data as a polars DataFrame.
page_number: int
instance-attribute
¶
The page number of the table.
text: str
instance-attribute
¶
The table text as a markdown string.
Image Extraction Types¶
ExtractedImage¶
Represents an image extracted from a document:
kreuzberg.ExtractedImage
dataclass
¶
Source code in kreuzberg/_types.py
ImageOCRResult¶
Contains the result of running OCR on an extracted image:
kreuzberg.ImageOCRResult
dataclass
¶
ImageOCRConfig¶
Configuration for OCR processing of extracted images:
kreuzberg.ImageOCRConfig
dataclass
¶
Bases: ConfigDict
Configuration for OCR processing of extracted images.
Source code in kreuzberg/_types.py
Attributes¶
allowed_formats: frozenset[str] = frozenset({'jpg', 'jpeg', 'png', 'gif', 'bmp', 'tiff', 'tif', 'webp', 'jp2', 'jpx', 'jpm', 'mj2', 'pnm', 'pbm', 'pgm', 'ppm'})
class-attribute
instance-attribute
¶
Allowed image formats for OCR processing (lowercase, without dot).
backend: OcrBackendType | None = None
class-attribute
instance-attribute
¶
OCR backend for image OCR. Falls back to main ocr_backend when None.
backend_config: TesseractConfig | PaddleOCRConfig | EasyOCRConfig | None = None
class-attribute
instance-attribute
¶
Backend-specific configuration for image OCR.
batch_size: int = 4
class-attribute
instance-attribute
¶
Number of images to process in parallel for OCR.
enabled: bool = False
class-attribute
instance-attribute
¶
Whether to perform OCR on extracted images.
max_dimensions: tuple[int, int] = (10000, 10000)
class-attribute
instance-attribute
¶
Maximum (width, height) in pixels for image OCR eligibility.
min_dimensions: tuple[int, int] = (50, 50)
class-attribute
instance-attribute
¶
Minimum (width, height) in pixels for image OCR eligibility.
timeout_seconds: int = 30
class-attribute
instance-attribute
¶
Maximum time in seconds for OCR processing per image.
OCR Configuration¶
TesseractConfig¶
kreuzberg.TesseractConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
Attributes¶
classify_use_pre_adapted_templates: bool = True
class-attribute
instance-attribute
¶
Whether to use pre-adapted templates during classification to improve recognition accuracy.
enable_table_detection: bool = False
class-attribute
instance-attribute
¶
Enable table structure detection from TSV output.
language: str = 'eng'
class-attribute
instance-attribute
¶
Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu'
language_model_ngram_on: bool = False
class-attribute
instance-attribute
¶
Enable or disable the use of n-gram-based language models for improved text recognition. Default is False for optimal performance on modern documents. Enable for degraded or historical text.
output_format: OutputFormatType = 'markdown'
class-attribute
instance-attribute
¶
Output format: 'markdown' (default), 'text', 'tsv' (for structured data), or 'hocr' (HTML-based).
psm: PSMMode = PSMMode.AUTO
class-attribute
instance-attribute
¶
Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).
table_column_threshold: int = 20
class-attribute
instance-attribute
¶
Pixel threshold for column clustering in table detection.
table_min_confidence: float = 30.0
class-attribute
instance-attribute
¶
Minimum confidence score to include a word in table extraction.
table_row_threshold_ratio: float = 0.5
class-attribute
instance-attribute
¶
Row threshold as ratio of mean text height for table detection.
tessedit_char_whitelist: str = ''
class-attribute
instance-attribute
¶
Whitelist of characters that Tesseract is allowed to recognize. Empty string means no restriction.
tessedit_dont_blkrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents block rejection of words identified as good, improving text output quality.
tessedit_dont_rowrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.
tessedit_enable_dict_correction: bool = True
class-attribute
instance-attribute
¶
Enable or disable dictionary-based correction for recognized text to improve word accuracy.
tessedit_use_primary_params_model: bool = True
class-attribute
instance-attribute
¶
If True, forces the use of the primary parameters model for text recognition.
textord_space_size_is_variable: bool = True
class-attribute
instance-attribute
¶
Allow variable spacing between words, useful for text with irregular spacing.
thresholding_method: bool = False
class-attribute
instance-attribute
¶
Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.
EasyOCRConfig¶
kreuzberg.EasyOCRConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
Attributes¶
add_margin: float = 0.1
class-attribute
instance-attribute
¶
Extend bounding boxes in all directions.
adjust_contrast: float = 0.5
class-attribute
instance-attribute
¶
Target contrast level for low contrast text.
beam_width: int = 5
class-attribute
instance-attribute
¶
Beam width for beam search in recognition.
canvas_size: int = 2560
class-attribute
instance-attribute
¶
Maximum image dimension for detection.
contrast_ths: float = 0.1
class-attribute
instance-attribute
¶
Contrast threshold for preprocessing.
decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'
class-attribute
instance-attribute
¶
Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.
device: DeviceType = 'auto'
class-attribute
instance-attribute
¶
Device to use for inference. Options: 'cpu', 'cuda', 'mps', 'auto'.
fallback_to_cpu: bool = True
class-attribute
instance-attribute
¶
Whether to fallback to CPU if requested device is unavailable.
gpu_memory_limit: float | None = None
class-attribute
instance-attribute
¶
Maximum GPU memory to use in GB. None for no limit.
height_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum difference in box height for merging.
language: str | list[str] = 'en'
class-attribute
instance-attribute
¶
Language or languages to use for OCR. Can be a single language code (e.g., 'en'), a comma-separated string of language codes (e.g., 'en,ch_sim'), or a list of language codes.
link_threshold: float = 0.4
class-attribute
instance-attribute
¶
Link confidence threshold.
low_text: float = 0.4
class-attribute
instance-attribute
¶
Text low-bound score.
mag_ratio: float = 1.0
class-attribute
instance-attribute
¶
Image magnification ratio.
min_size: int = 10
class-attribute
instance-attribute
¶
Minimum text box size in pixels.
rotation_info: list[int] | None = None
class-attribute
instance-attribute
¶
List of angles to try for detection.
slope_ths: float = 0.1
class-attribute
instance-attribute
¶
Maximum slope for merging text boxes.
text_threshold: float = 0.7
class-attribute
instance-attribute
¶
Text confidence threshold.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead.
width_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum horizontal distance for merging boxes.
x_ths: float = 1.0
class-attribute
instance-attribute
¶
Maximum horizontal distance for paragraph merging.
y_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum vertical distance for paragraph merging.
ycenter_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum shift in y direction for merging.
PaddleOCRConfig¶
kreuzberg.PaddleOCRConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
|
Attributes¶
cls_image_shape: str = '3,48,192'
class-attribute
instance-attribute
¶
Image shape for classification algorithm in format 'channels,height,width'.
det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'
class-attribute
instance-attribute
¶
Detection algorithm.
det_db_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'text_det_box_thresh' instead. Score threshold for detected boxes.
det_db_thresh: float = 0.3
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'text_det_thresh' instead. Binarization threshold for DB output map.
det_db_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'text_det_unclip_ratio' instead. Expansion ratio for detected text boxes.
det_east_cover_thresh: float = 0.1
class-attribute
instance-attribute
¶
Score threshold for EAST output boxes.
det_east_nms_thresh: float = 0.2
class-attribute
instance-attribute
¶
NMS threshold for EAST model output boxes.
det_east_score_thresh: float = 0.8
class-attribute
instance-attribute
¶
Binarization threshold for EAST output map.
det_max_side_len: int = 960
class-attribute
instance-attribute
¶
Maximum size of image long side. Images exceeding this will be proportionally resized.
det_model_dir: str | None = None
class-attribute
instance-attribute
¶
Directory for detection model. If None, uses default model location.
device: DeviceType = 'auto'
class-attribute
instance-attribute
¶
Device to use for inference. Options: 'cpu', 'cuda', 'auto'. Note: MPS not supported by PaddlePaddle.
drop_score: float = 0.5
class-attribute
instance-attribute
¶
Filter recognition results by confidence score. Results below this are discarded.
enable_mkldnn: bool = False
class-attribute
instance-attribute
¶
Whether to enable MKL-DNN acceleration (Intel CPU only).
fallback_to_cpu: bool = True
class-attribute
instance-attribute
¶
Whether to fallback to CPU if requested device is unavailable.
gpu_mem: int = 8000
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Parameter no longer supported. GPU memory size (in MB) to use for initialization.
gpu_memory_limit: float | None = None
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Parameter no longer supported. Maximum GPU memory to use in GB.
language: str = 'en'
class-attribute
instance-attribute
¶
Language to use for OCR.
max_text_length: int = 25
class-attribute
instance-attribute
¶
Maximum text length that the recognition algorithm can recognize.
rec: bool = True
class-attribute
instance-attribute
¶
Enable text recognition when using the ocr() function.
rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN'
class-attribute
instance-attribute
¶
Recognition algorithm.
rec_image_shape: str = '3,32,320'
class-attribute
instance-attribute
¶
Image shape for recognition algorithm in format 'channels,height,width'.
rec_model_dir: str | None = None
class-attribute
instance-attribute
¶
Directory for recognition model. If None, uses default model location.
table: bool = True
class-attribute
instance-attribute
¶
Whether to enable table recognition.
text_det_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
Score threshold for detected text boxes (replaces det_db_box_thresh).
text_det_thresh: float = 0.3
class-attribute
instance-attribute
¶
Binarization threshold for text detection output map (replaces det_db_thresh).
text_det_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
Expansion ratio for detected text boxes (replaces det_db_unclip_ratio).
use_angle_cls: bool = True
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'use_textline_orientation' instead. Whether to use text orientation classification model.
use_gpu: bool = False
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Parameter no longer supported. Use hardware acceleration flags instead.
use_space_char: bool = True
class-attribute
instance-attribute
¶
Whether to recognize spaces.
use_textline_orientation: bool = True
class-attribute
instance-attribute
¶
Whether to use text line orientation classification model (replaces use_angle_cls).
use_zero_copy_run: bool = False
class-attribute
instance-attribute
¶
Whether to enable zero_copy_run for inference optimization.
GMFT Configuration¶
Deprecated
GMFTConfig
is deprecated and scheduled for removal in Kreuzberg v4.0. Use the new TATR-based TableExtractionConfig
when migrating to v4.
Configuration options for the GMFT table extraction engine (legacy):
kreuzberg.GMFTConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 |
|
Attributes¶
cell_required_confidence: dict[Literal[0, 1, 2, 3, 4, 5, 6], float] = field(default_factory=(lambda: {0: 0.3, 1: 0.3, 2: 0.3, 3: 0.3, 4: 0.5, 5: 0.5, 6: 99}), hash=False)
class-attribute
instance-attribute
¶
Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label
But low confidences may be better than too high confidence (see formatter_base_threshold)
detector_base_threshold: float = 0.9
class-attribute
instance-attribute
¶
Minimum confidence score required for a table
enable_multi_header: bool = False
class-attribute
instance-attribute
¶
Enable multi-indices in the dataframe.
If false, then multiple headers will be merged column-wise.
force_large_table_assumption: bool | None = None
class-attribute
instance-attribute
¶
Force the large table assumption to be applied, regardless of the number of rows and overlap.
formatter_base_threshold: float = 0.3
class-attribute
instance-attribute
¶
Base threshold for the confidence demanded of a table feature (row/column).
Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).
iob_reject_threshold: float = 0.05
class-attribute
instance-attribute
¶
Reject if iob between textbox and cell is < 5%.
iob_warn_threshold: float = 0.5
class-attribute
instance-attribute
¶
Warn if iob between textbox and cell is < 50%.
large_table_if_n_rows_removed: int = 8
class-attribute
instance-attribute
¶
If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.
large_table_maximum_rows: int = 1000
class-attribute
instance-attribute
¶
Maximum number of rows allowed for a large table.
large_table_row_overlap_threshold: float = 0.2
class-attribute
instance-attribute
¶
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold).
large_table_threshold: int = 10
class-attribute
instance-attribute
¶
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.
nms_warn_threshold: int = 5
class-attribute
instance-attribute
¶
Warn if non maxima suppression removes > 5 rows.
remove_null_rows: bool = True
class-attribute
instance-attribute
¶
Flag to remove rows with no text.
semantic_hierarchical_left_fill: Literal['algorithm', 'deep'] | None = 'algorithm'
class-attribute
instance-attribute
¶
[Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row.
Possible values: 'algorithm', 'deep', None.
semantic_spanning_cells: bool = False
class-attribute
instance-attribute
¶
[Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.
total_overlap_reject_threshold: float = 0.9
class-attribute
instance-attribute
¶
Reject if total overlap is > 90% of table area.
total_overlap_warn_threshold: float = 0.1
class-attribute
instance-attribute
¶
Warn if total overlap is > 10% of table area.
verbosity: int = 0
class-attribute
instance-attribute
¶
Verbosity level for logging.
0: errors only 1: print warnings 2: print warnings and info 3: print warnings, info, and debug
Entity Extraction Configuration¶
Configuration options for spaCy-based entity extraction:
kreuzberg.SpacyEntityExtractionConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 |
|
Attributes¶
batch_size: int = 1000
class-attribute
instance-attribute
¶
Batch size for processing multiple texts.
fallback_to_multilingual: bool = True
class-attribute
instance-attribute
¶
If True and language-specific model fails, try xx_ent_wiki_sm (multilingual).
language_models: dict[str, str] | tuple[tuple[str, str], ...] | None = None
class-attribute
instance-attribute
¶
Mapping of language codes to spaCy model names.
If None, uses default mappings: - en: en_core_web_sm - de: de_core_news_sm - fr: fr_core_news_sm - es: es_core_news_sm - pt: pt_core_news_sm - it: it_core_news_sm - nl: nl_core_news_sm - zh: zh_core_web_sm - ja: ja_core_news_sm
max_doc_length: int = 1000000
class-attribute
instance-attribute
¶
Maximum document length for spaCy processing.
model_cache_dir: str | Path | None = None
class-attribute
instance-attribute
¶
Directory to cache spaCy models. If None, uses spaCy's default.
Language Detection Configuration¶
Configuration options for automatic language detection:
kreuzberg.LanguageDetectionConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
Attributes¶
cache_dir: str | None = None
class-attribute
instance-attribute
¶
Custom directory for model cache. If None, uses system default.
low_memory: bool = True
class-attribute
instance-attribute
¶
Deprecated. Use 'model' parameter instead. If True, uses 'lite' model.
model: Literal['lite', 'full', 'auto'] = 'auto'
class-attribute
instance-attribute
¶
Language detection model to use: - 'lite': Smaller, faster model with good accuracy - 'full': Larger model with highest accuracy - 'auto': Automatically choose based on memory availability (default)
multilingual: bool = False
class-attribute
instance-attribute
¶
If True, uses multilingual detection to handle mixed-language text. If False, uses single language detection.
top_k: int = 3
class-attribute
instance-attribute
¶
Maximum number of languages to return for multilingual detection.
JSON Extraction Configuration¶
Configuration for enhanced JSON document processing:
kreuzberg.JSONExtractionConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
Attributes¶
array_item_limit: int = 1000
class-attribute
instance-attribute
¶
Maximum number of array items to process to prevent memory issues.
custom_text_field_patterns: frozenset[str] | None = None
class-attribute
instance-attribute
¶
Custom patterns to identify text fields beyond default keywords.
extract_schema: bool = False
class-attribute
instance-attribute
¶
Extract and include JSON schema information in metadata.
flatten_nested_objects: bool = True
class-attribute
instance-attribute
¶
Flatten nested objects using dot notation for better text extraction.
include_type_info: bool = False
class-attribute
instance-attribute
¶
Include data type information in extracted content.
max_depth: int = 10
class-attribute
instance-attribute
¶
Maximum nesting depth to process in JSON structures.
HTML to Markdown Configuration¶
Configuration options for converting HTML content to Markdown:
kreuzberg.HTMLToMarkdownConfig
dataclass
¶
Source code in kreuzberg/_types.py
1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 |
|
Attributes¶
autolinks: bool = True
class-attribute
instance-attribute
¶
Automatically convert valid URLs to Markdown links.
br_in_tables: bool = False
class-attribute
instance-attribute
¶
Use
tags for line breaks in table cells instead of spaces.
bullets: str = '*+-'
class-attribute
instance-attribute
¶
Characters to use for unordered list bullets.
code_block_style: Literal['indented', 'backticks', 'tildes'] = 'backticks'
class-attribute
instance-attribute
¶
Style for fenced code blocks.
code_language: str = ''
class-attribute
instance-attribute
¶
Default language identifier for fenced code blocks.
code_language_callback: Callable[[Any], str] | None = field(default=None, compare=False, hash=False)
class-attribute
instance-attribute
¶
Legacy language callback (no longer used by v2 converter).
convert: tuple[str, ...] | None = None
class-attribute
instance-attribute
¶
Legacy list of tags to convert (no longer used by v2 converter).
convert_as_inline: bool = False
class-attribute
instance-attribute
¶
Treat content as inline elements only.
custom_converters: Mapping[str, Callable[..., str]] | None = field(default=None, compare=False, hash=False)
class-attribute
instance-attribute
¶
Legacy mapping of custom converters (ignored by v2 converter).
debug: bool = False
class-attribute
instance-attribute
¶
Enable debug diagnostics in the converter.
default_title: bool = False
class-attribute
instance-attribute
¶
Use default titles for elements like links.
encoding: str = 'utf-8'
class-attribute
instance-attribute
¶
Expected character encoding for the HTML input.
escape_ascii: bool = False
class-attribute
instance-attribute
¶
Escape all ASCII punctuation.
escape_asterisks: bool = False
class-attribute
instance-attribute
¶
Escape * characters to prevent unintended formatting.
escape_misc: bool = False
class-attribute
instance-attribute
¶
Escape miscellaneous characters to prevent Markdown conflicts.
escape_underscores: bool = False
class-attribute
instance-attribute
¶
Escape _ characters to prevent unintended formatting.
extract_metadata: bool = True
class-attribute
instance-attribute
¶
Extract document metadata as comment header.
heading_style: Literal['underlined', 'atx', 'atx_closed'] = 'atx'
class-attribute
instance-attribute
¶
Style for markdown headings.
highlight_style: Literal['double-equal', 'html', 'bold', 'none'] = 'double-equal'
class-attribute
instance-attribute
¶
Style for highlighting text.
keep_inline_images_in: tuple[str, ...] | None = None
class-attribute
instance-attribute
¶
Tags where inline images should be preserved.
list_indent_type: Literal['spaces', 'tabs'] = 'spaces'
class-attribute
instance-attribute
¶
Type of indentation to use for lists.
list_indent_width: int = 4
class-attribute
instance-attribute
¶
Number of spaces per indentation level (use 2 for Discord/Slack).
newline_style: Literal['spaces', 'backslash'] = 'spaces'
class-attribute
instance-attribute
¶
Style for line breaks in markdown.
preprocess_html: bool = False
class-attribute
instance-attribute
¶
Enable HTML preprocessing to clean messy HTML.
preprocessing_preset: Literal['minimal', 'standard', 'aggressive'] = 'standard'
class-attribute
instance-attribute
¶
Preprocessing level for cleaning HTML.
remove_forms: bool = True
class-attribute
instance-attribute
¶
Remove form elements during preprocessing.
remove_navigation: bool = True
class-attribute
instance-attribute
¶
Remove navigation elements during preprocessing.
strip_newlines: bool = False
class-attribute
instance-attribute
¶
Remove newlines from HTML input before processing.
strip_tags: tuple[str, ...] | None = None
class-attribute
instance-attribute
¶
List of HTML tags to remove from output.
strong_em_symbol: Literal['*', '_'] = '*'
class-attribute
instance-attribute
¶
Symbol to use for strong/emphasis formatting.
sub_symbol: str = ''
class-attribute
instance-attribute
¶
Symbol to use for subscript text.
sup_symbol: str = ''
class-attribute
instance-attribute
¶
Symbol to use for superscript text.
whitespace_mode: Literal['normalized', 'strict'] = 'normalized'
class-attribute
instance-attribute
¶
Whitespace handling mode.
wrap: bool = False
class-attribute
instance-attribute
¶
Enable text wrapping.
wrap_width: int = 80
class-attribute
instance-attribute
¶
Width for text wrapping.
Functions¶
to_options() -> tuple[HTMLToMarkdownConversionOptions, HTMLToMarkdownPreprocessingOptions]
¶
Build html_to_markdown ConversionOptions and PreprocessingOptions instances.
Source code in kreuzberg/_types.py
Token Reduction Configuration¶
Configuration options for token reduction and text optimization:
kreuzberg.TokenReductionConfig
dataclass
¶
Source code in kreuzberg/_types.py
PSMMode (Page Segmentation Mode)¶
kreuzberg.PSMMode
¶
Bases: Enum
Source code in kreuzberg/_types.py
Attributes¶
AUTO = 3
class-attribute
instance-attribute
¶
Fully automatic page segmentation (default).
AUTO_ONLY = 2
class-attribute
instance-attribute
¶
Automatic page segmentation without OSD.
AUTO_OSD = 1
class-attribute
instance-attribute
¶
Automatic page segmentation with orientation and script detection.
CIRCLE_WORD = 9
class-attribute
instance-attribute
¶
Treat the image as a single word in a circle.
OSD_ONLY = 0
class-attribute
instance-attribute
¶
Orientation and script detection only.
SINGLE_BLOCK = 6
class-attribute
instance-attribute
¶
Assume a single uniform block of text.
SINGLE_BLOCK_VERTICAL = 5
class-attribute
instance-attribute
¶
Assume a single uniform block of vertically aligned text.
SINGLE_CHAR = 10
class-attribute
instance-attribute
¶
Treat the image as a single character.
SINGLE_COLUMN = 4
class-attribute
instance-attribute
¶
Assume a single column of text.
SINGLE_LINE = 7
class-attribute
instance-attribute
¶
Treat the image as a single text line.
SINGLE_WORD = 8
class-attribute
instance-attribute
¶
Treat the image as a single word.
Entity¶
Represents an extracted named entity:
kreuzberg.Entity
dataclass
¶
Source code in kreuzberg/_types.py
Metadata¶
A TypedDict that contains optional metadata fields extracted from documents:
kreuzberg.Metadata
¶
Bases: TypedDict
Source code in kreuzberg/_types.py
625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 |
|
Attributes¶
abstract: NotRequired[str]
instance-attribute
¶
Document abstract or summary.
attachments: NotRequired[list[str]]
instance-attribute
¶
List of attachment names.
attributes: NotRequired[dict[str, Any]]
instance-attribute
¶
Additional attributes extracted from structured data (e.g., custom text fields with dotted keys).
authors: NotRequired[list[str]]
instance-attribute
¶
List of document authors.
body: NotRequired[str]
instance-attribute
¶
Body text content.
categories: NotRequired[list[str]]
instance-attribute
¶
Categories or classifications.
citations: NotRequired[list[str]]
instance-attribute
¶
Citation identifiers.
comments: NotRequired[str]
instance-attribute
¶
General comments.
content: NotRequired[str]
instance-attribute
¶
Content metadata field.
copyright: NotRequired[str]
instance-attribute
¶
Copyright information.
created_at: NotRequired[str]
instance-attribute
¶
Creation timestamp in ISO format.
created_by: NotRequired[str]
instance-attribute
¶
Document creator.
date: NotRequired[str]
instance-attribute
¶
Email date or document date.
description: NotRequired[str]
instance-attribute
¶
Document description.
email_bcc: NotRequired[str]
instance-attribute
¶
Email blind carbon copy recipients.
email_cc: NotRequired[str]
instance-attribute
¶
Email carbon copy recipients.
email_from: NotRequired[str]
instance-attribute
¶
Email sender (from field).
email_to: NotRequired[str]
instance-attribute
¶
Email recipient (to field).
error: NotRequired[str]
instance-attribute
¶
Error message if extraction failed.
error_context: NotRequired[dict[str, Any]]
instance-attribute
¶
Error context information for debugging.
extraction_error: NotRequired[dict[str, Any]]
instance-attribute
¶
Error information for critical extraction failures.
fonts: NotRequired[list[str]]
instance-attribute
¶
List of fonts used in the document.
height: NotRequired[int]
instance-attribute
¶
Height of the document page/slide/image, if applicable.
identifier: NotRequired[str]
instance-attribute
¶
Unique document identifier.
image_preprocessing: NotRequired[ImagePreprocessingMetadata]
instance-attribute
¶
Metadata about image preprocessing operations (DPI adjustments, scaling, etc.).
json_schema: NotRequired[dict[str, Any]]
instance-attribute
¶
JSON schema information extracted from structured data.
keywords: NotRequired[list[str]]
instance-attribute
¶
Keywords or tags.
languages: NotRequired[list[str]]
instance-attribute
¶
Document language code.
license: NotRequired[str]
instance-attribute
¶
License information.
message: NotRequired[str]
instance-attribute
¶
Message or communication content.
modified_at: NotRequired[str]
instance-attribute
¶
Last modification timestamp in ISO format.
modified_by: NotRequired[str]
instance-attribute
¶
Username of last modifier.
name: NotRequired[str]
instance-attribute
¶
Name field from structured data.
note: NotRequired[str]
instance-attribute
¶
Single note or annotation.
notes: NotRequired[list[str]]
instance-attribute
¶
Notes or additional information extracted from documents.
organization: NotRequired[str | list[str]]
instance-attribute
¶
Organizational affiliation.
parse_error: NotRequired[str]
instance-attribute
¶
Parse error information.
processing_errors: NotRequired[list[ProcessingErrorDict]]
instance-attribute
¶
List of processing errors that occurred during extraction.
publisher: NotRequired[str]
instance-attribute
¶
Publisher or organization name.
quality_score: NotRequired[float]
instance-attribute
¶
Quality score for extracted content (0.0-1.0).
references: NotRequired[list[str]]
instance-attribute
¶
Reference entries.
source_format: NotRequired[str]
instance-attribute
¶
Source format of the extracted content.
status: NotRequired[str]
instance-attribute
¶
Document status (e.g., draft, final).
subject: NotRequired[str]
instance-attribute
¶
Document subject or topic.
subtitle: NotRequired[str]
instance-attribute
¶
Document subtitle.
summary: NotRequired[str]
instance-attribute
¶
Document Summary
table_count: NotRequired[int]
instance-attribute
¶
Number of tables extracted from the document.
tables_detected: NotRequired[int]
instance-attribute
¶
Number of tables detected in the document.
tables_summary: NotRequired[str]
instance-attribute
¶
Summary of table extraction results.
text: NotRequired[str]
instance-attribute
¶
Generic text content.
title: NotRequired[str]
instance-attribute
¶
Document title.
token_reduction: NotRequired[dict[str, float]]
instance-attribute
¶
Token reduction statistics including reduction ratios and counts.
version: NotRequired[str]
instance-attribute
¶
Version identifier or revision number.
warning: NotRequired[str]
instance-attribute
¶
Warning messages.
width: NotRequired[int]
instance-attribute
¶
Width of the document page/slide/image, if applicable.
OutputFormatType¶
The output format for Tesseract OCR processing:
markdown
(default): Structured markdown output with preserved formattingtext
: Plain text, fastest optiontsv
: Tab-separated values with word positions and confidence scoreshocr
: HTML-based OCR format with detailed position information