Extraction Pipeline¶
The extraction pipeline is the core workflow that transforms raw files into structured, processed content. Understanding this pipeline is essential for effectively using Kreuzberg and building custom extractors or processors.
Pipeline Overview¶
Every extraction request flows through the same multi-stage pipeline, regardless of file format or configuration:
flowchart TD
Start([File Path or Bytes]) --> Cache{Cache<br/>Enabled?}
Cache -->|Yes| CheckCache{Result<br/>in Cache?}
Cache -->|No| MIME[MIME Type Detection]
CheckCache -->|Found| Return([Return Cached Result])
CheckCache -->|Not Found| MIME
MIME --> Registry[Registry Lookup]
Registry --> Extractor[Format Extractor]
Extractor --> OCR{OCR<br/>Needed?}
OCR -->|Yes| RunOCR[OCR Processing]
OCR -->|No| Pipeline[Post-Processing Pipeline]
RunOCR --> Pipeline
Pipeline --> Validate[1. Validators]
Validate --> Quality[2. Quality Processing]
Quality --> Chunk[3. Chunking]
Chunk --> PostProc[4. Post-Processors]
PostProc --> StoreCache{Cache<br/>Enabled?}
StoreCache -->|Yes| Store[Store in Cache]
StoreCache -->|No| Final([Return ExtractionResult])
Store --> Final
style Start fill:#e1f5ff
style Final fill:#e1f5ff
style Return fill:#c8e6c9
style Extractor fill:#fff9c4
style RunOCR fill:#ffccbc
style Pipeline fill:#f3e5f5 Stage Details¶
1. Cache Check¶
If caching is enabled (cache=True in ExtractionConfig), Kreuzberg first checks for a cached result:
- Cache Key: Generated from file path/bytes + configuration hash
- Cache Hit: Returns cached
ExtractionResultimmediately (bypasses all processing) - Cache Miss: Proceeds to MIME detection
Caching significantly improves performance for repeated extractions of the same file.
2. MIME Type Detection¶
MIME types determine which extractor handles the file. Detection happens in two ways:
- Explicit: If
mime_typeparameter is provided, validates it's supported - Automatic: Detects MIME type from file extension using internal mapping (118+ extensions)
flowchart LR
Input[File Input] --> HasMIME{MIME Type<br/>Provided?}
HasMIME -->|Yes| Validate[Validate MIME Type]
HasMIME -->|No| Detect[Detect from Extension]
Detect --> Validate
Validate --> Supported{Supported?}
Supported -->|Yes| Continue[Continue Pipeline]
Supported -->|No| Error([UnsupportedFormat Error])
style Input fill:#e1f5ff
style Continue fill:#c8e6c9
style Error fill:#ffcdd2 Common MIME types include:
- PDFs:
application/pdf - Images:
image/jpeg,image/png, etc. - Office:
application/vnd.openxmlformats-officedocument.wordprocessingml.document(DOCX) - Text:
text/plain,text/markdown,application/xml
See the Configuration Guide for MIME type configuration details.
3. Registry Lookup¶
The extractor registry maps MIME types to DocumentExtractor implementations:
// Registry maintains: MIME type → Extractor mapping
let registry = get_document_extractor_registry();
let extractor = registry.get("application/pdf")?;
If multiple extractors support the same MIME type, priority determines selection (higher priority wins).
4. Format Extraction¶
The selected extractor processes the file using format-specific logic:
- PDF: Extracts text using pdfium-render, optionally extracts images for OCR
- Excel: Parses sheets with calamine, converts to structured tables
- Images: Loads image data, routes to OCR backend
- XML/Text: Streaming parser for memory efficiency
- Email: Parses MIME structure, extracts body and attachments
- Office: Extracts from DOCX/PPTX using format libraries
Each extractor returns an ExtractionResult containing:
content: Extracted textmetadata: File metadata (format-specific)page_count,language, etc.
5. OCR Processing (Optional)¶
If the file contains images and OCR is enabled, Kreuzberg processes images through the configured OCR backend:
flowchart TD
Check{Has Images<br/>+ OCR Config?} -->|No| Skip[Skip OCR]
Check -->|Yes| ForceOCR{force_ocr<br/>= True?}
ForceOCR -->|Yes| RunOCR[Run OCR on All Pages]
ForceOCR -->|No| HasText{Extracted<br/>Text?}
HasText -->|Yes| Skip
HasText -->|No| RunOCR
RunOCR --> Backend[OCR Backend]
Backend --> Tesseract[Tesseract]
Backend --> EasyOCR[EasyOCR]
Backend --> PaddleOCR[PaddleOCR]
Tesseract --> Combine[Combine with Extracted Text]
EasyOCR --> Combine
PaddleOCR --> Combine
Skip --> Continue([Continue Pipeline])
Combine --> Continue
style Continue fill:#c8e6c9 OCR configuration controls:
- Backend selection: Tesseract (default), EasyOCR, PaddleOCR
- Language: OCR language models to use
- force_ocr: Always run OCR even if text exists
- Caching: OCR results cached separately for performance
6. Post-Processing Pipeline¶
After extraction, results pass through a configurable post-processing pipeline:
6.1 Validators¶
Validators run first and can fail-fast if results don't meet requirements:
# Example: Minimum text length validator
class MinLengthValidator:
def validate(self, result: ExtractionResult, config: ExtractionConfig) -> None:
if len(result.content) < 100:
raise ValidationError("Extracted text too short")
Validator errors bubble up immediately and stop processing.
6.2 Quality Processing¶
If enable_quality_processing=True, Kreuzberg calculates a quality score based on:
- Text/non-text character ratio
- Word frequency distribution
- Formatting artifacts (repeated characters, etc.)
- Metadata consistency
Quality score added to metadata.additional["quality_score"].
6.3 Chunking¶
If chunking config is provided, text is split into overlapping chunks:
Chunks added to result.chunks with start/end offsets.
6.4 Post-Processors¶
Post-processors run in order by stage (Early → Middle → Late):
# Example: Custom post-processor
class RedactionProcessor:
def process(self, result: ExtractionResult, config: ExtractionConfig) -> ExtractionResult:
result.content = result.content.replace("[REDACTED]", "***")
return result
Post-processor errors are logged but don't stop the pipeline.
7. Cache Storage¶
If caching is enabled and extraction succeeded, the result is stored in cache for future requests.
8. Result Return¶
The final ExtractionResult is returned to the caller with:
- Extracted and processed content
- Complete metadata
- Optional chunks and embeddings
- Processing history
Performance Optimizations¶
The pipeline includes several optimizations:
- Early cache returns: Cached results skip all processing
- Lazy OCR: Only runs when necessary (no text or
force_ocr) - Streaming parsers: XML and text files stream instead of loading into memory
- Concurrent batch extraction: Multiple files processed in parallel
- Global Tokio runtime: Shared async runtime eliminates initialization overhead
Error Handling¶
Errors at different stages are handled differently:
- Validation errors: Fail immediately (invalid file path, unsupported format)
- Extraction errors: Bubble up as
ParsingError(corrupted file, format-specific issues) - System errors: Always bubble up (OSError, RuntimeError, MemoryError)
- Post-processor errors: Logged but don't fail extraction
See Error Handling for complete error documentation.
Related Documentation¶
- Architecture - Overall system design
- Configuration Guide - How to configure the pipeline
- OCR Guide - Detailed OCR configuration
- Creating Plugins - How to build custom extractors and processors