OCR Configuration¶
Configuration classes for the supported OCR engines.
TesseractConfig¶
Default OCR engine configuration:
kreuzberg.TesseractConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
Attributes¶
classify_use_pre_adapted_templates: bool = True
class-attribute
instance-attribute
¶
Whether to use pre-adapted templates during classification to improve recognition accuracy.
enable_table_detection: bool = False
class-attribute
instance-attribute
¶
Enable table structure detection from TSV output.
language: str = 'eng'
class-attribute
instance-attribute
¶
Language code to use for OCR. Examples: - 'eng' for English - 'deu' for German - multiple languages combined with '+', e.g. 'eng+deu'
language_model_ngram_on: bool = False
class-attribute
instance-attribute
¶
Enable or disable the use of n-gram-based language models for improved text recognition. Default is False for optimal performance on modern documents. Enable for degraded or historical text.
output_format: OutputFormatType = 'markdown'
class-attribute
instance-attribute
¶
Output format: 'markdown' (default), 'text', 'tsv' (for structured data), or 'hocr' (HTML-based).
psm: PSMMode = PSMMode.AUTO
class-attribute
instance-attribute
¶
Page segmentation mode (PSM) to guide Tesseract on how to segment the image (e.g., single block, single line).
table_column_threshold: int = 20
class-attribute
instance-attribute
¶
Pixel threshold for column clustering in table detection.
table_min_confidence: float = 30.0
class-attribute
instance-attribute
¶
Minimum confidence score to include a word in table extraction.
table_row_threshold_ratio: float = 0.5
class-attribute
instance-attribute
¶
Row threshold as ratio of mean text height for table detection.
tessedit_char_whitelist: str = ''
class-attribute
instance-attribute
¶
Whitelist of characters that Tesseract is allowed to recognize. Empty string means no restriction.
tessedit_dont_blkrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents block rejection of words identified as good, improving text output quality.
tessedit_dont_rowrej_good_wds: bool = True
class-attribute
instance-attribute
¶
If True, prevents row rejection of words identified as good, avoiding unnecessary omissions.
tessedit_enable_dict_correction: bool = True
class-attribute
instance-attribute
¶
Enable or disable dictionary-based correction for recognized text to improve word accuracy.
tessedit_use_primary_params_model: bool = True
class-attribute
instance-attribute
¶
If True, forces the use of the primary parameters model for text recognition.
textord_space_size_is_variable: bool = True
class-attribute
instance-attribute
¶
Allow variable spacing between words, useful for text with irregular spacing.
thresholding_method: bool = False
class-attribute
instance-attribute
¶
Enable or disable specific thresholding methods during image preprocessing for better OCR accuracy.
PSMMode¶
Page Segmentation Mode options for Tesseract:
kreuzberg.PSMMode
¶
Bases: Enum
Source code in kreuzberg/_types.py
Attributes¶
AUTO = 3
class-attribute
instance-attribute
¶
Fully automatic page segmentation (default).
AUTO_ONLY = 2
class-attribute
instance-attribute
¶
Automatic page segmentation without OSD.
AUTO_OSD = 1
class-attribute
instance-attribute
¶
Automatic page segmentation with orientation and script detection.
CIRCLE_WORD = 9
class-attribute
instance-attribute
¶
Treat the image as a single word in a circle.
OSD_ONLY = 0
class-attribute
instance-attribute
¶
Orientation and script detection only.
SINGLE_BLOCK = 6
class-attribute
instance-attribute
¶
Assume a single uniform block of text.
SINGLE_BLOCK_VERTICAL = 5
class-attribute
instance-attribute
¶
Assume a single uniform block of vertically aligned text.
SINGLE_CHAR = 10
class-attribute
instance-attribute
¶
Treat the image as a single character.
SINGLE_COLUMN = 4
class-attribute
instance-attribute
¶
Assume a single column of text.
SINGLE_LINE = 7
class-attribute
instance-attribute
¶
Treat the image as a single text line.
SINGLE_WORD = 8
class-attribute
instance-attribute
¶
Treat the image as a single word.
EasyOCRConfig¶
Configuration for the EasyOCR engine:
kreuzberg.EasyOCRConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
Attributes¶
add_margin: float = 0.1
class-attribute
instance-attribute
¶
Extend bounding boxes in all directions.
adjust_contrast: float = 0.5
class-attribute
instance-attribute
¶
Target contrast level for low contrast text.
beam_width: int = 5
class-attribute
instance-attribute
¶
Beam width for beam search in recognition.
canvas_size: int = 2560
class-attribute
instance-attribute
¶
Maximum image dimension for detection.
contrast_ths: float = 0.1
class-attribute
instance-attribute
¶
Contrast threshold for preprocessing.
decoder: Literal['greedy', 'beamsearch', 'wordbeamsearch'] = 'greedy'
class-attribute
instance-attribute
¶
Decoder method. Options: 'greedy', 'beamsearch', 'wordbeamsearch'.
device: DeviceType = 'auto'
class-attribute
instance-attribute
¶
Device to use for inference. Options: 'cpu', 'cuda', 'mps', 'auto'.
fallback_to_cpu: bool = True
class-attribute
instance-attribute
¶
Whether to fallback to CPU if requested device is unavailable.
gpu_memory_limit: float | None = None
class-attribute
instance-attribute
¶
Maximum GPU memory to use in GB. None for no limit.
height_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum difference in box height for merging.
language: str | list[str] = 'en'
class-attribute
instance-attribute
¶
Language or languages to use for OCR. Can be a single language code (e.g., 'en'), a comma-separated string of language codes (e.g., 'en,ch_sim'), or a list of language codes.
link_threshold: float = 0.4
class-attribute
instance-attribute
¶
Link confidence threshold.
low_text: float = 0.4
class-attribute
instance-attribute
¶
Text low-bound score.
mag_ratio: float = 1.0
class-attribute
instance-attribute
¶
Image magnification ratio.
min_size: int = 10
class-attribute
instance-attribute
¶
Minimum text box size in pixels.
rotation_info: list[int] | None = None
class-attribute
instance-attribute
¶
List of angles to try for detection.
slope_ths: float = 0.1
class-attribute
instance-attribute
¶
Maximum slope for merging text boxes.
text_threshold: float = 0.7
class-attribute
instance-attribute
¶
Text confidence threshold.
use_gpu: bool = False
class-attribute
instance-attribute
¶
Whether to use GPU for inference. DEPRECATED: Use 'device' parameter instead.
width_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum horizontal distance for merging boxes.
x_ths: float = 1.0
class-attribute
instance-attribute
¶
Maximum horizontal distance for paragraph merging.
y_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum vertical distance for paragraph merging.
ycenter_ths: float = 0.5
class-attribute
instance-attribute
¶
Maximum shift in y direction for merging.
PaddleOCRConfig¶
Configuration for the PaddleOCR engine:
kreuzberg.PaddleOCRConfig
dataclass
¶
Bases: ConfigDict
Source code in kreuzberg/_types.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 |
|
Attributes¶
cls_image_shape: str = '3,48,192'
class-attribute
instance-attribute
¶
Image shape for classification algorithm in format 'channels,height,width'.
det_algorithm: Literal['DB', 'EAST', 'SAST', 'PSE', 'FCE', 'PAN', 'CT', 'DB++', 'Layout'] = 'DB'
class-attribute
instance-attribute
¶
Detection algorithm.
det_db_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'text_det_box_thresh' instead. Score threshold for detected boxes.
det_db_thresh: float = 0.3
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'text_det_thresh' instead. Binarization threshold for DB output map.
det_db_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'text_det_unclip_ratio' instead. Expansion ratio for detected text boxes.
det_east_cover_thresh: float = 0.1
class-attribute
instance-attribute
¶
Score threshold for EAST output boxes.
det_east_nms_thresh: float = 0.2
class-attribute
instance-attribute
¶
NMS threshold for EAST model output boxes.
det_east_score_thresh: float = 0.8
class-attribute
instance-attribute
¶
Binarization threshold for EAST output map.
det_max_side_len: int = 960
class-attribute
instance-attribute
¶
Maximum size of image long side. Images exceeding this will be proportionally resized.
det_model_dir: str | None = None
class-attribute
instance-attribute
¶
Directory for detection model. If None, uses default model location.
device: DeviceType = 'auto'
class-attribute
instance-attribute
¶
Device to use for inference. Options: 'cpu', 'cuda', 'auto'. Note: MPS not supported by PaddlePaddle.
drop_score: float = 0.5
class-attribute
instance-attribute
¶
Filter recognition results by confidence score. Results below this are discarded.
enable_mkldnn: bool = False
class-attribute
instance-attribute
¶
Whether to enable MKL-DNN acceleration (Intel CPU only).
fallback_to_cpu: bool = True
class-attribute
instance-attribute
¶
Whether to fallback to CPU if requested device is unavailable.
gpu_mem: int = 8000
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Parameter no longer supported. GPU memory size (in MB) to use for initialization.
gpu_memory_limit: float | None = None
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Parameter no longer supported. Maximum GPU memory to use in GB.
language: str = 'en'
class-attribute
instance-attribute
¶
Language to use for OCR.
max_text_length: int = 25
class-attribute
instance-attribute
¶
Maximum text length that the recognition algorithm can recognize.
rec: bool = True
class-attribute
instance-attribute
¶
Enable text recognition when using the ocr() function.
rec_algorithm: Literal['CRNN', 'SRN', 'NRTR', 'SAR', 'SEED', 'SVTR', 'SVTR_LCNet', 'ViTSTR', 'ABINet', 'VisionLAN', 'SPIN', 'RobustScanner', 'RFL'] = 'CRNN'
class-attribute
instance-attribute
¶
Recognition algorithm.
rec_image_shape: str = '3,32,320'
class-attribute
instance-attribute
¶
Image shape for recognition algorithm in format 'channels,height,width'.
rec_model_dir: str | None = None
class-attribute
instance-attribute
¶
Directory for recognition model. If None, uses default model location.
table: bool = True
class-attribute
instance-attribute
¶
Whether to enable table recognition.
text_det_box_thresh: float = 0.5
class-attribute
instance-attribute
¶
Score threshold for detected text boxes (replaces det_db_box_thresh).
text_det_thresh: float = 0.3
class-attribute
instance-attribute
¶
Binarization threshold for text detection output map (replaces det_db_thresh).
text_det_unclip_ratio: float = 2.0
class-attribute
instance-attribute
¶
Expansion ratio for detected text boxes (replaces det_db_unclip_ratio).
use_angle_cls: bool = True
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Use 'use_textline_orientation' instead. Whether to use text orientation classification model.
use_gpu: bool = False
class-attribute
instance-attribute
¶
DEPRECATED in PaddleOCR 3.2.0+: Parameter no longer supported. Use hardware acceleration flags instead.
use_space_char: bool = True
class-attribute
instance-attribute
¶
Whether to recognize spaces.
use_textline_orientation: bool = True
class-attribute
instance-attribute
¶
Whether to use text line orientation classification model (replaces use_angle_cls).
use_zero_copy_run: bool = False
class-attribute
instance-attribute
¶
Whether to enable zero_copy_run for inference optimization.