C# Bindings for Kreuzberg¶
High-performance document intelligence for .NET applications. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats.
Powered by a Rust core – Native performance for document extraction with P/Invoke interoperability.
Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter.
Installation¶
Install the Kreuzberg package (published under the kreuzberg.dev organization):
System Requirements¶
- .NET 6.0 or higher
- Windows, macOS, or Linux
Optional System Dependencies¶
Tesseract OCR (Required for OCR functionality):
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Windows
# Download from https://github.com/tesseract-ocr/tesseract/wiki/Downloads
LibreOffice (Optional, for legacy Office formats .doc, .ppt):
Quick Start¶
Simple File Extraction¶
using Kreuzberg;
var result = KreuzbergClient.ExtractFileSync("document.pdf");
Console.WriteLine(result.Content);
Extract with Configuration¶
using Kreuzberg;
var config = new ExtractionConfig
{
UseCache = true,
EnableQualityProcessing = true
};
var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Console.WriteLine($"Content: {result.Content}");
Console.WriteLine($"MIME Type: {result.MimeType}");
Console.WriteLine($"Success: {result.Success}");
Async Extraction¶
using Kreuzberg;
var result = await KreuzbergClient.ExtractFileAsync("document.pdf");
Console.WriteLine(result.Content);
Batch Processing¶
using Kreuzberg;
var files = new[] { "doc1.pdf", "doc2.docx", "doc3.xlsx" };
var results = KreuzbergClient.BatchExtractFilesSync(files);
foreach (var result in results)
{
Console.WriteLine($"{result.MimeType}: {result.Content.Length} characters");
}
Configuration¶
Basic Configuration¶
using Kreuzberg;
var config = new ExtractionConfig
{
UseCache = true,
EnableQualityProcessing = true,
ForceOcr = false
};
var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
OCR Configuration¶
using Kreuzberg;
var config = new ExtractionConfig
{
Ocr = new OcrConfig
{
Backend = "tesseract",
Language = "eng",
TesseractConfig = new TesseractConfig
{
Psm = 6,
EnableTableDetection = true,
MinConfidence = 50.0
}
}
};
var result = KreuzbergClient.ExtractFileSync("scanned.pdf", config);
Table Extraction with Tesseract¶
using Kreuzberg;
var config = new ExtractionConfig
{
Ocr = new OcrConfig
{
Backend = "tesseract",
TesseractConfig = new TesseractConfig
{
EnableTableDetection = true
}
}
};
var result = KreuzbergClient.ExtractFileSync("invoice.pdf", config);
foreach (var table in result.Tables)
{
Console.WriteLine(table.Markdown);
Console.WriteLine($"Page: {table.PageNumber}");
}
Complete Configuration Example¶
using Kreuzberg;
var config = new ExtractionConfig
{
UseCache = true,
EnableQualityProcessing = true,
Ocr = new OcrConfig
{
Backend = "tesseract",
Language = "eng",
TesseractConfig = new TesseractConfig
{
Psm = 6,
EnableTableDetection = true,
MinConfidence = 50.0
}
},
ForceOcr = false,
Chunking = new ChunkingConfig
{
MaxChars = 1000,
MaxOverlap = 200
},
Images = new ImageExtractionConfig
{
ExtractImages = true,
TargetDpi = 300,
MaxImageDimension = 4096,
AutoAdjustDpi = true
},
PdfOptions = new PdfConfig
{
ExtractImages = true,
Passwords = new List<string> { "password1", "password2" },
ExtractMetadata = true
},
TokenReduction = new TokenReductionConfig
{
Mode = "moderate",
PreserveImportantWords = true
},
LanguageDetection = new LanguageDetectionConfig
{
Enabled = true,
MinConfidence = 0.8,
DetectMultiple = false
}
};
var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Loading Configuration from File¶
using Kreuzberg;
// Discover configuration by walking up directory tree
var discoveredConfig = KreuzbergClient.DiscoverExtractionConfig();
// Load from explicit path
var config = KreuzbergClient.LoadExtractionConfigFromFile("config/extraction.toml");
var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Extract from Bytes¶
using Kreuzberg;
var data = File.ReadAllBytes("document.pdf");
var result = KreuzbergClient.ExtractBytesSync(data, "application/pdf");
Console.WriteLine(result.Content);
Batch Extract from Bytes¶
using Kreuzberg;
var items = new[]
{
new BytesWithMime(File.ReadAllBytes("doc1.pdf"), "application/pdf"),
new BytesWithMime(File.ReadAllBytes("doc2.docx"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
};
var results = KreuzbergClient.BatchExtractBytesSync(items);
MIME Type Detection¶
Detect from File¶
using Kreuzberg;
var mimeType = KreuzbergClient.DetectMimeTypeFromPath("document.pdf");
Console.WriteLine(mimeType);
// Reverse lookup: get file extensions for a MIME type
var extensions = KreuzbergClient.GetExtensionsForMime("application/pdf");
Console.WriteLine(string.Join(", ", extensions));
Detect from Bytes¶
using Kreuzberg;
var data = File.ReadAllBytes("document");
var mimeType = KreuzbergClient.DetectMimeType(data);
Console.WriteLine(mimeType);
Metadata Extraction¶
using Kreuzberg;
var result = KreuzbergClient.ExtractFileSync("document.pdf");
// Access detected language
if (result.Metadata.Language != null)
{
Console.WriteLine($"Language: {result.Metadata.Language}");
}
// Access PDF-specific metadata
if (result.Metadata.Format?.Pdf != null)
{
var pdf = result.Metadata.Format.Pdf;
Console.WriteLine($"Title: {pdf.Title}");
Console.WriteLine($"Author: {pdf.Author}");
Console.WriteLine($"Pages: {pdf.PageCount}");
Console.WriteLine($"Created: {pdf.CreationDate}");
}
// Access detected format type
Console.WriteLine($"Format: {result.Metadata.FormatType}");
Password-Protected PDFs¶
using Kreuzberg;
var config = new ExtractionConfig
{
PdfOptions = new PdfConfig
{
Passwords = new List<string> { "password1", "password2", "password3" }
}
};
var result = KreuzbergClient.ExtractFileSync("protected.pdf", config);
Language Detection¶
using Kreuzberg;
var config = new ExtractionConfig
{
LanguageDetection = new LanguageDetectionConfig
{
Enabled = true,
MinConfidence = 0.8
}
};
var result = KreuzbergClient.ExtractFileSync("multilingual.pdf", config);
if (result.DetectedLanguages != null)
{
foreach (var lang in result.DetectedLanguages)
{
Console.WriteLine(lang);
}
}
Text Chunking¶
using Kreuzberg;
var config = new ExtractionConfig
{
Chunking = new ChunkingConfig
{
MaxChars = 1000,
MaxOverlap = 200
}
};
var result = KreuzbergClient.ExtractFileSync("long_document.pdf", config);
if (result.Chunks != null)
{
foreach (var chunk in result.Chunks)
{
Console.WriteLine($"Chunk {chunk.Metadata.ChunkIndex + 1}/{chunk.Metadata.TotalChunks}");
Console.WriteLine($"Tokens: {chunk.Metadata.TokenCount ?? 0}");
Console.WriteLine(chunk.Content);
}
}
Image Extraction¶
using Kreuzberg;
var result = KreuzbergClient.ExtractFileSync("document.pdf");
if (result.Images != null && result.Images.Count > 0)
{
Console.WriteLine($"Extracted {result.Images.Count} images");
foreach (var image in result.Images)
{
Console.WriteLine($"Format: {image.Format}, Page: {image.PageNumber}");
File.WriteAllBytes($"image_{image.ImageIndex}.{image.Format.ToLower()}", image.Data);
}
}
Embedding Presets¶
List Available Presets¶
using Kreuzberg;
var presets = KreuzbergClient.ListEmbeddingPresets();
Console.WriteLine($"Available presets: {string.Join(", ", presets)}");
Get Specific Preset¶
using Kreuzberg;
var preset = KreuzbergClient.GetEmbeddingPreset("default");
if (preset != null)
{
Console.WriteLine($"Preset: {preset.Name}");
Console.WriteLine($"Chunk Size: {preset.ChunkSize}");
Console.WriteLine($"Model: {preset.ModelName}");
Console.WriteLine($"Dimensions: {preset.Dimensions}");
}
Custom Post-Processors¶
using Kreuzberg;
public class CustomProcessor : IPostProcessor
{
public string Name => "custom_processor";
public int Priority => 10;
public ExtractionResult Process(ExtractionResult result)
{
// Transform extracted content
result.Content = result.Content.ToUpper();
return result;
}
}
// Register the post-processor
KreuzbergClient.RegisterPostProcessor(new CustomProcessor());
// Processor runs automatically on extraction
var result = KreuzbergClient.ExtractFileSync("document.pdf");
// Query registered processors
var processors = KreuzbergClient.ListPostProcessors();
Console.WriteLine($"Registered: {string.Join(", ", processors)}");
// Remove processor when no longer needed
KreuzbergClient.UnregisterPostProcessor("custom_processor");
Custom Validators¶
using Kreuzberg;
public class CustomValidator : IValidator
{
public string Name => "custom_validator";
public int Priority => 10;
public void Validate(ExtractionResult result)
{
if (string.IsNullOrWhiteSpace(result.Content))
{
throw new InvalidOperationException("Content cannot be empty");
}
}
}
// Register the validator
KreuzbergClient.RegisterValidator(new CustomValidator());
// Validator runs automatically during extraction
var result = KreuzbergClient.ExtractFileSync("document.pdf");
Custom OCR Backends¶
using Kreuzberg;
public class CustomOcrBackend : IOcrBackend
{
public string Name => "custom_ocr";
public string Process(ReadOnlySpan<byte> imageBytes, OcrConfig? config)
{
// Implement OCR processing and return JSON-formatted result
return "{}";
}
}
// Register the custom OCR backend
KreuzbergClient.RegisterOcrBackend(new CustomOcrBackend());
// Configure extraction to use custom backend
var extractConfig = new ExtractionConfig
{
Ocr = new OcrConfig { Backend = "custom_ocr" }
};
var result = KreuzbergClient.ExtractFileSync("document.pdf", extractConfig);
Exception Handling¶
using Kreuzberg;
using System;
try
{
var result = KreuzbergClient.ExtractFileSync("document.pdf");
}
catch (KreuzbergValidationException ex)
{
Console.WriteLine($"Configuration validation error: {ex.Message}");
}
catch (KreuzbergParsingException ex)
{
Console.WriteLine($"Document parsing failed: {ex.Message}");
}
catch (KreuzbergOcrException ex)
{
Console.WriteLine($"OCR processing failed: {ex.Message}");
}
catch (KreuzbergMissingDependencyException ex)
{
Console.WriteLine($"Missing optional dependency: {ex.Message}");
}
catch (KreuzbergException ex)
{
Console.WriteLine($"General Kreuzberg error: {ex.Message}");
}
Exception Hierarchy¶
- KreuzbergException - Base exception for all Kreuzberg errors
- KreuzbergValidationException - Invalid configuration or input
- KreuzbergParsingException - Document parsing failure
- KreuzbergOcrException - OCR processing failure
- KreuzbergMissingDependencyException - Missing optional dependency
- KreuzbergSerializationException - JSON serialization failure
API Reference¶
Extraction Methods¶
ExtractFileSync(string path, ExtractionConfig? config = null)- Synchronous file extractionExtractFileAsync(string path, ExtractionConfig? config = null, CancellationToken cancellationToken = default)- Asynchronous file extractionExtractBytesSync(ReadOnlySpan<byte> data, string mimeType, ExtractionConfig? config = null)- Synchronous bytes extractionExtractBytesAsync(byte[] data, string mimeType, ExtractionConfig? config = null, CancellationToken cancellationToken = default)- Asynchronous bytes extractionBatchExtractFilesSync(IReadOnlyList<string> paths, ExtractionConfig? config = null)- Batch file extractionBatchExtractFilesAsync(IReadOnlyList<string> paths, ExtractionConfig? config = null, CancellationToken cancellationToken = default)- Async batch file extractionBatchExtractBytesSync(IReadOnlyList<BytesWithMime> items, ExtractionConfig? config = null)- Batch bytes extractionBatchExtractBytesAsync(IReadOnlyList<BytesWithMime> items, ExtractionConfig? config = null, CancellationToken cancellationToken = default)- Async batch bytes extraction
MIME Type Detection¶
DetectMimeType(ReadOnlySpan<byte> data)- Detect MIME from bytesDetectMimeTypeFromPath(string path)- Detect MIME from file pathGetExtensionsForMime(string mimeType)- Get file extensions for MIME type
Configuration Discovery¶
DiscoverExtractionConfig()- Discover config by walking parent directories for kreuzberg.toml/yaml/jsonLoadExtractionConfigFromFile(string path)- Load config from specific file
Plugin Management¶
RegisterPostProcessor(IPostProcessor processor)- Register custom post-processorUnregisterPostProcessor(string name)- Unregister post-processorClearPostProcessors()- Clear all post-processorsListPostProcessors()- List registered post-processorsRegisterValidator(IValidator validator)- Register custom validatorUnregisterValidator(string name)- Unregister validatorClearValidators()- Clear all validatorsListValidators()- List registered validatorsRegisterOcrBackend(IOcrBackend backend)- Register custom OCR backendUnregisterOcrBackend(string name)- Unregister OCR backendClearOcrBackends()- Clear all OCR backendsListOcrBackends()- List registered OCR backendsListDocumentExtractors()- List document extractorsUnregisterDocumentExtractor(string name)- Unregister extractorClearDocumentExtractors()- Clear all extractors
Embedding Presets¶
ListEmbeddingPresets()- Get all available embedding presetsGetEmbeddingPreset(string name)- Get specific preset by name
Utility¶
GetVersion()- Get native library version string
Result Types¶
ExtractionResult¶
public sealed class ExtractionResult
{
public string Content { get; set; }
public string MimeType { get; set; }
public Metadata Metadata { get; set; }
public List<Table> Tables { get; set; }
public List<string>? DetectedLanguages { get; set; }
public List<Chunk>? Chunks { get; set; }
public List<ExtractedImage>? Images { get; set; }
public bool Success { get; set; }
}
Metadata¶
Contains language, date, subject, and format-specific metadata (PDF, Excel, Email, PPTX, Archive, Image, XML, Text, HTML, OCR).
Table¶
public sealed class Table
{
public List<List<string>> Cells { get; set; }
public string Markdown { get; set; }
public int PageNumber { get; set; }
}
Chunk¶
public sealed class Chunk
{
public string Content { get; set; }
public float[]? Embedding { get; set; }
public ChunkMetadata Metadata { get; set; }
}
Thread Safety¶
The Kreuzberg C# binding is thread-safe at the API level:
- KreuzbergClient static methods are safe to call from multiple threads concurrently
- Configuration objects (ExtractionConfig, OcrConfig, etc.) are thread-safe for reading
- Post-Processors, Validators, OCR Backends registrations use thread-safe collections
- No synchronization needed for concurrent extraction calls
Note: Individual ExtractionResult objects should not be modified after creation if accessed from multiple threads.
P/Invoke Interoperability¶
The C# binding uses P/Invoke to call native Rust code:
NativeMethods Pattern¶
- NativeMethods - Pinvoke declarations mapping to kreuzberg-ffi C library
- InteropUtilities - Helper functions for UTF-8 string marshaling
- Serialization - JSON serialization wrapper using System.Text.Json
Memory Management¶
- AllocUtf8 - Allocate UTF-8 string in native memory
- FreeUtf8 - Free allocated UTF-8 strings
- FreeString - Free native strings from library
- FreeResult - Free ExtractionResult structures
- FreeBatchResult - Free batch result arrays
All memory allocation/deallocation is handled automatically by try/finally blocks.
Error Handling¶
- ErrorMapper - Converts native error strings to C# exceptions
- ThrowLastError - Retrieves and throws last error from native library
- All FFI boundaries validate pointer returns (check for IntPtr.Zero)
Examples¶
Process Multiple Files with Error Handling¶
using Kreuzberg;
using System;
using System.IO;
var files = Directory.GetFiles("documents", "*.pdf");
foreach (var file in files)
{
try
{
Console.Write($"Processing {Path.GetFileName(file)}...");
var result = KreuzbergClient.ExtractFileSync(file);
if (result.Success)
{
Console.WriteLine($" OK ({result.Content.Length} chars)");
File.WriteAllText($"{file}.txt", result.Content);
}
else
{
Console.WriteLine(" FAILED");
}
}
catch (Exception ex)
{
Console.WriteLine($" ERROR: {ex.Message}");
}
}
Extract and Save Metadata¶
using Kreuzberg;
using System.Text.Json;
var result = KreuzbergClient.ExtractFileSync("document.pdf");
var metadata = new
{
language = result.Metadata.Language,
format = result.Metadata.FormatType,
tables = result.Tables.Count,
images = result.Images?.Count ?? 0,
success = result.Success
};
var json = JsonSerializer.Serialize(metadata, new JsonSerializerOptions { WriteIndented = true });
File.WriteAllText("metadata.json", json);
Concurrent Batch Processing¶
using Kreuzberg;
using System.Threading.Tasks;
var files = new[] { "doc1.pdf", "doc2.pdf", "doc3.pdf", "doc4.pdf" };
var tasks = files.Select(file => Task.Run(() =>
{
try
{
return KreuzbergClient.ExtractFileSync(file);
}
catch (Exception ex)
{
Console.WriteLine($"Error extracting {file}: {ex.Message}");
return null;
}
})).ToList();
var results = await Task.WhenAll(tasks);
var successful = results.Count(r => r?.Success == true);
Console.WriteLine($"Successfully processed {successful}/{files.Length} files");
Troubleshooting¶
DLL Not Found¶
If you get "DLL not found" errors, ensure the native library is in your runtime directory:
The library should be located in runtimes/{rid}/native/ in the package.
P/Invoke Errors¶
If P/Invoke calls fail, verify: 1. Native library is properly installed 2. Architecture matches (x64, arm64) 3. Dependencies are available (Tesseract, LibreOffice if needed)
OCR Not Working¶
Ensure Tesseract is installed and in PATH:
Memory Issues¶
For large documents, consider: 1. Enabling chunking to process in smaller pieces 2. Using batch extraction for memory efficiency 3. Calling GC.Collect() after processing large batches
Complete Documentation¶
For more information, see:
License¶
MIT License - see the LICENSE file in the repository for details.