Skip to content

MIME Type Detection

MIME (Multipurpose Internet Mail Extensions) type detection is the foundation of Kreuzberg's extraction pipeline. The MIME type determines which extractor processes a file, making accurate detection critical for successful extraction.

How MIME Detection Works

Kreuzberg detects MIME types through a two-phase approach:

flowchart TD
    Input[File Input] --> Explicit{MIME Type<br/>Provided?}

    Explicit -->|Yes| Validate[Validate Against<br/>Supported Types]
    Explicit -->|No| Extension[Extract File Extension]

    Extension --> Normalize[Normalize Extension<br/>lowercase, trim]
    Normalize --> Lookup[Lookup in Extension Map]

    Lookup --> Found{Extension<br/>Found?}
    Found -->|Yes| MapMIME[Get MIME Type]
    Found -->|No| Default[Default to<br/>application/octet-stream]

    MapMIME --> Validate
    Default --> Validate

    Validate --> Supported{Supported?}
    Supported -->|Yes| Success([Return MIME Type])
    Supported -->|No| Error([UnsupportedFormat Error])

    style Input fill:#e1f5ff
    style Success fill:#c8e6c9
    style Error fill:#ffcdd2

Phase 1: Extension to MIME Mapping

When no explicit MIME type is provided, Kreuzberg extracts the file extension and looks it up in an internal mapping table:

// Extract extension
let extension = path.extension()
    .and_then(|e| e.to_str())
    .unwrap_or("")
    .to_lowercase();

// Lookup MIME type
let mime_type = EXT_TO_MIME.get(extension.as_str())
    .ok_or(UnsupportedFormat)?;

Extension Normalization:

  • Converted to lowercase (PDFpdf)
  • Leading dots removed (.txttxt)
  • Only last extension used (file.tar.gzgz)

Phase 2: Validation

Whether the MIME type was detected or explicitly provided, it must be supported:

pub fn validate_mime_type(mime_type: &str) -> Result<()> {
    if SUPPORTED_TYPES.contains(mime_type) {
        Ok(())
    } else {
        Err(KreuzbergError::UnsupportedFormat {
            mime_type: mime_type.to_string(),
        })
    }
}

Supported MIME Types

Kreuzberg supports 118+ file extensions across multiple categories:

Documents

Extension MIME Type
.pdf application/pdf
.docx application/vnd.openxmlformats-officedocument.wordprocessingml.document
.doc application/msword
.odt application/vnd.oasis.opendocument.text
.rtf application/rtf

Spreadsheets

Extension MIME Type
.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
.xls application/vnd.ms-excel
.xlsm application/vnd.ms-excel.sheet.macroEnabled.12
.xlsb application/vnd.ms-excel.sheet.binary.macroEnabled.12
.ods application/vnd.oasis.opendocument.spreadsheet
.csv text/csv
.tsv text/tab-separated-values

Presentations

Extension MIME Type
.pptx application/vnd.openxmlformats-officedocument.presentationml.presentation
.ppt application/vnd.ms-powerpoint
.odp application/vnd.oasis.opendocument.presentation

Images

Extension MIME Type
.jpg, .jpeg image/jpeg
.png image/png
.gif image/gif
.bmp image/bmp
.tiff, .tif image/tiff
.webp image/webp
.svg image/svg+xml

Text and Markup

Extension MIME Type
.txt text/plain
.md, .markdown text/markdown
.html, .htm text/html
.xml application/xml
.json application/json
.yaml, .yml application/x-yaml
.toml application/toml

Email

Extension MIME Type
.eml message/rfc822
.msg application/vnd.ms-outlook

Archives

Extension MIME Type
.zip application/zip
.tar application/x-tar
.gz application/gzip
.7z application/x-7z-compressed

Ebooks

Extension MIME Type
.epub application/epub+zip
.mobi application/x-mobipocket-ebook

Fallback Mechanisms

When MIME detection fails or format is unsupported, Kreuzberg provides fallback options:

flowchart TD
    Unsupported[Unsupported Format] --> Pandoc{Pandoc<br/>Installed?}

    Pandoc -->|Yes| TryConvert[Try Pandoc Conversion]
    Pandoc -->|No| Error1([UnsupportedFormat Error])

    TryConvert --> ConvertOK{Conversion<br/>Successful?}
    ConvertOK -->|Yes| Extract[Extract Converted Content]
    ConvertOK -->|No| Error2([ParsingError])

    Extract --> Success([Return Result])

    style Success fill:#c8e6c9
    style Error1 fill:#ffcdd2
    style Error2 fill:#ffcdd2

Pandoc Fallback:

If Pandoc is installed, Kreuzberg attempts to convert unsupported formats to supported ones:

  • Input: .odt, .rtf, .epub, .rst, and 30+ other formats
  • Conversion: Pandoc converts to HTML or Markdown
  • Extraction: Converted content extracted normally

Explicit MIME Type Override:

Users can override auto-detection by providing explicit MIME type:

# Force treating .txt file as markdown
result = extract_file("notes.txt", mime_type="text/markdown", config=config)

Common MIME Type Constants

Kreuzberg exports commonly used MIME types as constants:

pub const PDF_MIME_TYPE: &str = "application/pdf";
pub const HTML_MIME_TYPE: &str = "text/html";
pub const MARKDOWN_MIME_TYPE: &str = "text/markdown";
pub const PLAIN_TEXT_MIME_TYPE: &str = "text/plain";
pub const EXCEL_MIME_TYPE: &str = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
pub const POWER_POINT_MIME_TYPE: &str = "application/vnd.openxmlformats-officedocument.presentationml.presentation";
pub const DOCX_MIME_TYPE: &str = "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
pub const JSON_MIME_TYPE: &str = "application/json";
pub const XML_MIME_TYPE: &str = "application/xml";

Usage:

from kreuzberg import extract_file, PDF_MIME_TYPE

result = extract_file("document.pdf", mime_type=PDF_MIME_TYPE, config=config)

Detection API

Kreuzberg provides utility functions for MIME type operations:

// Detect MIME type from file path
pub fn detect_mime_type(path: impl AsRef<Path>) -> Result<String>

// Validate MIME type is supported
pub fn validate_mime_type(mime_type: &str) -> Result<()>

// Detect or validate (if provided)
pub fn detect_or_validate(
    path: impl AsRef<Path>,
    mime_type: Option<&str>
) -> Result<String>

Python Example:

from kreuzberg import detect_mime_type, validate_mime_type

# Auto-detect from path
mime = detect_mime_type("document.pdf")
print(mime)  # "application/pdf"

# Validate MIME type
validate_mime_type("application/pdf")  # OK
validate_mime_type("invalid/type")     # Raises UnsupportedFormat

Edge Cases

Multiple Extensions

For files with multiple extensions (e.g., archive.tar.gz), only the last extension is used:

detect_mime_type("file.tar.gz")  # Returns "application/gzip" (from .gz)
detect_mime_type("file.json.txt") # Returns "text/plain" (from .txt)

No Extension

Files without extensions default to application/octet-stream (binary data):

detect_mime_type("Makefile")  # Returns "application/octet-stream"

Users must provide explicit MIME type for extensionless files:

result = extract_file("Makefile", mime_type="text/plain", config=config)

Case Sensitivity

Extensions are case-insensitive:

detect_mime_type("file.PDF")  # Returns "application/pdf"
detect_mime_type("file.Pdf")  # Returns "application/pdf"
detect_mime_type("file.pdf")  # Returns "application/pdf"