Automatic Document Classification¶
Kreuzberg can automatically classify documents into common types like invoices, contracts, and receipts. This allows you to build custom processing pipelines tailored to each document type.
Enabling Document Classification¶
To enable this feature, set auto_detect_document_type=True
in your ExtractionConfig
:
Classification Modes¶
You can choose between two classification modes using the document_classification_mode
parameter in ExtractionConfig
:
"text"
(default): This mode uses a rule-based classifier that analyzes the extracted text for keywords and patterns. It's fast and works well for text-based documents."vision"
: This mode uses layout information from OCR to identify document types. It's more accurate for scanned documents and images, but it requires the Tesseract OCR backend.
Here's how to use the vision-based classifier:
Confidence Threshold¶
You can control the minimum confidence required for a classification to be considered valid by setting the type_confidence_threshold
in ExtractionConfig
. The default value is 0.7
.
Output¶
The classification results are available in the ExtractionResult
object:
document_type
: The detected document type (e.g.,"invoice"
,"contract"
) orNone
if no type was detected with sufficient confidence.type_confidence
: The confidence score of the detection (a float between 0.0 and 1.0) orNone
.