Rust API Reference¶
Complete reference for the Kreuzberg Rust API.
Installation¶
Add to your Cargo.toml:
With specific features:
Available features:
pdf- PDF extraction support (enabled by default)ocr- OCR support with Tesseractchunking- Text chunking algorithmslanguage-detection- Language detectionkeywords-yake- YAKE keyword extractionkeywords-rake- RAKE keyword extractionapi- HTTP API server supportmcp- Model Context Protocol server support
Core Functions¶
extract_file_sync()¶
Extract content from a file (synchronous, blocking).
Signature:
pub fn extract_file_sync(
file_path: impl AsRef<Path>,
mime_type: Option<&str>,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
file_path(impl AsRef): Path to the file to extract mime_type(Option<&str>): Optional MIME type hint. If None, MIME type is auto-detectedconfig(&ExtractionConfig): Extraction configuration reference
Returns:
Result<ExtractionResult>: Result containing extraction result or error
Errors:
KreuzbergError::Io- File system errors (file not found, permission denied, etc.)KreuzbergError::Validation- Invalid configuration or file pathKreuzbergError::Parsing- Document parsing failureKreuzbergError::Ocr- OCR processing failureKreuzbergError::MissingDependency- Required system dependency not found
Examples:
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("Content: {}", result.content);
println!("Pages: {}", result.metadata.page_count.unwrap_or(0));
Ok(())
}
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig::default()),
force_ocr: false,
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
println!("Extracted: {}", result.content);
Ok(())
}
extract_file()¶
Extract content from a file (asynchronous).
Signature:
pub async fn extract_file(
file_path: impl AsRef<Path>,
mime_type: Option<&str>,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
Same as extract_file_sync().
Returns:
Result<ExtractionResult>: Result containing extraction result or error
Examples:
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("Content: {}", result.content);
Ok(())
}
extract_bytes_sync()¶
Extract content from bytes (synchronous, blocking).
Signature:
pub fn extract_bytes_sync(
data: &[u8],
mime_type: &str,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
data(&[u8]): File content as byte slicemime_type(&str): MIME type of the data (required for format detection)config(&ExtractionConfig): Extraction configuration reference
Returns:
Result<ExtractionResult>: Result containing extraction result or error
Examples:
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
use std::fs;
fn main() -> kreuzberg::Result<()> {
let data = fs::read("document.pdf")?;
let config = ExtractionConfig::default();
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
println!("Content: {}", result.content);
Ok(())
}
extract_bytes()¶
Extract content from bytes (asynchronous).
Signature:
pub async fn extract_bytes(
data: &[u8],
mime_type: &str,
config: &ExtractionConfig
) -> Result<ExtractionResult>
Parameters:
Same as extract_bytes_sync().
Returns:
Result<ExtractionResult>: Result containing extraction result or error
batch_extract_file_sync()¶
Extract content from multiple files in parallel (synchronous, blocking).
Signature:
pub fn batch_extract_file_sync(
paths: &[impl AsRef<Path>],
mime_types: Option<&[&str]>,
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
paths(&[impl AsRef]): Slice of file paths to extract mime_types(Option<&[&str]>): Optional MIME type hints (must match paths length if provided)config(&ExtractionConfig): Extraction configuration applied to all files
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
Examples:
use kreuzberg::{batch_extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let paths = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
let config = ExtractionConfig::default();
let results = batch_extract_file_sync(&paths, None, &config)?;
for (i, result) in results.iter().enumerate() {
println!("{}: {} characters", paths[i], result.content.len());
}
Ok(())
}
batch_extract_file()¶
Extract content from multiple files in parallel (asynchronous).
Signature:
pub async fn batch_extract_file(
paths: &[impl AsRef<Path>],
mime_types: Option<&[&str]>,
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
Same as batch_extract_file_sync().
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
Examples:
use kreuzberg::{batch_extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
let config = ExtractionConfig::default();
let results = batch_extract_file(&files, None, &config).await?;
for result in results {
println!("{}", result.content);
}
Ok(())
}
batch_extract_bytes_sync()¶
Extract content from multiple byte arrays in parallel (synchronous, blocking).
Signature:
pub fn batch_extract_bytes_sync(
data_list: &[&[u8]],
mime_types: &[&str],
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
data_list(&[&[u8]]): Slice of file contents as byte slicesmime_types(&[&str]): Slice of MIME types (must match data_list length)config(&ExtractionConfig): Extraction configuration applied to all items
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
batch_extract_bytes()¶
Extract content from multiple byte arrays in parallel (asynchronous).
Signature:
pub async fn batch_extract_bytes(
data_list: &[&[u8]],
mime_types: &[&str],
config: &ExtractionConfig
) -> Result<Vec<ExtractionResult>>
Parameters:
Same as batch_extract_bytes_sync().
Returns:
Result<Vec<ExtractionResult>>: Result containing vector of extraction results
Configuration¶
ExtractionConfig¶
Main configuration struct for extraction operations.
Definition:
#[derive(Debug, Clone, Default)]
pub struct ExtractionConfig {
pub ocr: Option<OcrConfig>,
pub force_ocr: bool,
pub pdf_options: Option<PdfConfig>,
pub chunking: Option<ChunkingConfig>,
pub language_detection: Option<LanguageDetectionConfig>,
pub token_reduction: Option<TokenReductionConfig>,
pub image_extraction: Option<ImageExtractionConfig>,
pub post_processor: Option<PostProcessorConfig>,
}
Fields:
ocr(Option): OCR configuration. Default: None (no OCR) force_ocr(bool): Force OCR even for text-based PDFs. Default: falsepdf_options(Option): PDF-specific configuration. Default: None chunking(Option): Text chunking configuration. Default: None language_detection(Option): Language detection configuration. Default: None token_reduction(Option): Token reduction configuration. Default: None image_extraction(Option): Image extraction from documents. Default: None post_processor(Option): Post-processing configuration. Default: None
Example:
use kreuzberg::{ExtractionConfig, OcrConfig, PdfConfig};
let config = ExtractionConfig {
ocr: Some(OcrConfig::default()),
force_ocr: false,
pdf_options: Some(PdfConfig {
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
extract_images: true,
image_dpi: 300,
}),
..Default::default()
};
OcrConfig¶
OCR processing configuration.
Definition:
#[derive(Debug, Clone)]
pub struct OcrConfig {
pub backend: String,
pub language: String,
pub tesseract_config: Option<TesseractConfig>,
}
Fields:
backend(String): OCR backend to use. Options: "tesseract". Default: "tesseract"language(String): Language code for OCR (ISO 639-3). Default: "eng"tesseract_config(Option): Tesseract-specific configuration. Default: None
Example:
use kreuzberg::OcrConfig;
let ocr_config = OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
tesseract_config: None,
};
TesseractConfig¶
Tesseract OCR backend configuration.
Definition:
#[derive(Debug, Clone)]
pub struct TesseractConfig {
pub psm: i32,
pub oem: i32,
pub enable_table_detection: bool,
pub tessedit_char_whitelist: Option<String>,
pub tessedit_char_blacklist: Option<String>,
}
Fields:
psm(i32): Page segmentation mode (0-13). Default: 3 (auto)oem(i32): OCR engine mode (0-3). Default: 3 (LSTM only)enable_table_detection(bool): Enable table detection and extraction. Default: falsetessedit_char_whitelist(Option): Character whitelist. Default: None tessedit_char_blacklist(Option): Character blacklist. Default: None
Example:
use kreuzberg::{ExtractionConfig, OcrConfig, TesseractConfig};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
tesseract_config: Some(TesseractConfig {
psm: 6,
oem: 3,
enable_table_detection: true,
tessedit_char_whitelist: Some("0123456789".to_string()),
tessedit_char_blacklist: None,
}),
}),
..Default::default()
};
PdfConfig¶
PDF-specific configuration.
Definition:
#[derive(Debug, Clone, Default)]
pub struct PdfConfig {
pub passwords: Option<Vec<String>>,
pub extract_images: bool,
pub image_dpi: u32,
}
Fields:
passwords(Option>): List of passwords to try for encrypted PDFs. Default: None extract_images(bool): Extract images from PDF. Default: falseimage_dpi(u32): DPI for image extraction. Default: 300
Example:
use kreuzberg::PdfConfig;
let pdf_config = PdfConfig {
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
extract_images: true,
image_dpi: 300,
};
ChunkingConfig¶
Text chunking configuration for splitting long documents.
Definition:
#[derive(Debug, Clone)]
pub struct ChunkingConfig {
pub chunk_size: usize,
pub chunk_overlap: usize,
pub chunking_strategy: String,
}
Fields:
chunk_size(usize): Maximum chunk size in tokens. Default: 512chunk_overlap(usize): Overlap between chunks in tokens. Default: 50chunking_strategy(String): Chunking strategy. Options: "fixed", "semantic". Default: "fixed"
LanguageDetectionConfig¶
Language detection configuration.
Definition:
#[derive(Debug, Clone)]
pub struct LanguageDetectionConfig {
pub enabled: bool,
pub confidence_threshold: f64,
}
Fields:
enabled(bool): Enable language detection. Default: trueconfidence_threshold(f64): Minimum confidence threshold (0.0-1.0). Default: 0.5
Results & Types¶
ExtractionResult¶
Result struct returned by all extraction functions.
Definition:
#[derive(Debug, Clone)]
pub struct ExtractionResult {
pub content: String,
pub mime_type: String,
pub metadata: Metadata,
pub tables: Vec<Table>,
pub detected_languages: Option<Vec<String>>,
}
Fields:
content(String): Extracted text contentmime_type(String): MIME type of the processed documentmetadata(Metadata): Document metadata (format-specific fields)tables(Vec): Vector of extracted tables
detected_languages(Option>): Vector of detected language codes (ISO 639-1) if language detection is enabled Example:
use kreuzberg::{extract_file_sync, ExtractionConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file_sync("document.pdf", None, &config)?; println!("Content: {}", result.content); println!("MIME type: {}", result.mime_type); println!("Tables: {}", result.tables.len()); if let Some(langs) = result.detected_languages { println!("Languages: {}", langs.join(", ")); } Ok(()) }
Metadata¶
Document metadata with format-specific fields.
Definition:
#[derive(Debug, Clone, Default)] pub struct Metadata { // Common fields pub language: Option<String>, pub date: Option<String>, pub subject: Option<String>, pub format_type: Option<String>, // PDF-specific fields pub title: Option<String>, pub author: Option<String>, pub page_count: Option<usize>, pub creation_date: Option<String>, pub modification_date: Option<String>, pub creator: Option<String>, pub producer: Option<String>, pub keywords: Option<String>, // Additional fields via HashMap pub extra: HashMap<String, serde_json::Value>, }Example:
let result = extract_file_sync("document.pdf", None, &config)?; let metadata = &result.metadata; if metadata.format_type.as_deref() == Some("pdf") { if let Some(title) = &metadata.title { println!("Title: {}", title); } if let Some(pages) = metadata.page_count { println!("Pages: {}", pages); } }See the Types Reference for complete metadata field documentation.
Table¶
Extracted table structure.
Definition:
#[derive(Debug, Clone)] pub struct Table { pub cells: Vec<Vec<String>>, pub markdown: String, pub page_number: usize, }Fields:
cells(Vec>): 2D vector of table cells (rows x columns) markdown(String): Table rendered as markdownpage_number(usize): Page number where table was found
Example:
let result = extract_file_sync("invoice.pdf", None, &config)?; for table in &result.tables { println!("Table on page {}:", table.page_number); println!("{}", table.markdown); println!(); }
Error Handling¶
KreuzbergError¶
All errors are returned as
KreuzbergErrorenum.Definition:
#[derive(Debug, thiserror::Error)] pub enum KreuzbergError { #[error("IO error: {0}")] Io(#[from] std::io::Error), #[error("Validation error: {0}")] Validation(String), #[error("Parsing error: {0}")] Parsing(String), #[error("OCR error: {0}")] Ocr(String), #[error("Missing dependency: {0}")] MissingDependency(String), // ... additional variants }Error Handling:
use kreuzberg::{extract_file_sync, ExtractionConfig, KreuzbergError}; fn process_file(path: &str) -> kreuzberg::Result<String> { let config = ExtractionConfig::default(); match extract_file_sync(path, None, &config) { Ok(result) => Ok(result.content), Err(KreuzbergError::Io(e)) => { eprintln!("File system error: {}", e); Err(KreuzbergError::Io(e)) } Err(KreuzbergError::Validation(msg)) => { eprintln!("Invalid input: {}", msg); Err(KreuzbergError::Validation(msg)) } Err(KreuzbergError::Parsing(msg)) => { eprintln!("Failed to parse document: {}", msg); Err(KreuzbergError::Parsing(msg)) } Err(e) => Err(e), } }Using the
?operator:fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file_sync("document.pdf", None, &config)?; println!("{}", result.content); Ok(()) }See Error Handling Reference for detailed error documentation.
Plugin System¶
Document Extractors¶
Register custom document extractors for new file formats.
Trait:
#[async_trait] pub trait DocumentExtractor: Send + Sync { fn name(&self) -> &str; fn mime_types(&self) -> &[&str]; fn priority(&self) -> i32; async fn extract( &self, data: &[u8], mime_type: &str, config: &ExtractionConfig ) -> Result<ExtractionResult>; }Registration:
use kreuzberg::plugins::registry::get_document_extractor_registry; use std::sync::Arc; let registry = get_document_extractor_registry(); registry.register("custom", Arc::new(MyCustomExtractor))?;
MIME Type Detection¶
detect_mime_type()¶
Detect MIME type from file path.
Signature:
Example:
use kreuzberg::detect_mime_type; let mime_type = detect_mime_type("document.pdf")?; println!("MIME type: {}", mime_type); // "application/pdf"
validate_mime_type()¶
Validate if a MIME type is supported.
Signature:
Example:
use kreuzberg::validate_mime_type; if validate_mime_type("application/pdf") { println!("PDF is supported"); }
Complete Documentation¶
For complete Rust API documentation with all types, traits, and functions:
Or visit docs.rs/kreuzberg