C# Bindings for Kreuzberg¶

High-performance document intelligence for .NET applications. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats.

Powered by a Rust core – Native performance for document extraction with P/Invoke interoperability.

Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter.

Installation¶

Install the Kreuzberg package (published under the kreuzberg.dev organization):

Terminal

dotnet add package Kreuzberg

System Requirements¶

.NET 6.0 or higher
Windows, macOS, or Linux

Optional System Dependencies¶

Tesseract OCR (Required for OCR functionality):

Terminal

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Windows
# Download from https://github.com/tesseract-ocr/tesseract/wiki/Downloads

LibreOffice (Optional, for legacy Office formats .doc, .ppt):

Terminal

# macOS
brew install libreoffice

# Ubuntu/Debian
sudo apt-get install libreoffice

Quick Start¶

Simple File Extraction¶

SimpleExtraction.cs

using Kreuzberg;

var result = KreuzbergClient.ExtractFileSync("document.pdf");
Console.WriteLine(result.Content);

Extract with Configuration¶

ConfiguredExtraction.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);
Console.WriteLine($"Content: {result.Content}");
Console.WriteLine($"MIME Type: {result.MimeType}");
Console.WriteLine($"Success: {result.Success}");

Async Extraction¶

AsyncExtraction.cs

using Kreuzberg;

var result = await KreuzbergClient.ExtractFileAsync("document.pdf");
Console.WriteLine(result.Content);

Batch Processing¶

BatchProcessing.cs

using Kreuzberg;

var files = new[] { "doc1.pdf", "doc2.docx", "doc3.xlsx" };
var results = KreuzbergClient.BatchExtractFilesSync(files);

foreach (var result in results)
{
    Console.WriteLine($"{result.MimeType}: {result.Content.Length} characters");
}

Configuration¶

Basic Configuration¶

BasicConfiguration.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true,
    ForceOcr = false
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

OCR Configuration¶

OcrConfiguration.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng",
        TesseractConfig = new TesseractConfig
        {
            Psm = 6,
            EnableTableDetection = true,
            MinConfidence = 50.0
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("scanned.pdf", config);

Table Extraction with Tesseract¶

TableExtraction.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        TesseractConfig = new TesseractConfig
        {
            EnableTableDetection = true
        }
    }
};

var result = KreuzbergClient.ExtractFileSync("invoice.pdf", config);

foreach (var table in result.Tables)
{
    Console.WriteLine(table.Markdown);
    Console.WriteLine($"Page: {table.PageNumber}");
}

Complete Configuration Example¶

CompleteConfiguration.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    UseCache = true,
    EnableQualityProcessing = true,
    Ocr = new OcrConfig
    {
        Backend = "tesseract",
        Language = "eng",
        TesseractConfig = new TesseractConfig
        {
            Psm = 6,
            EnableTableDetection = true,
            MinConfidence = 50.0
        }
    },
    ForceOcr = false,
    Chunking = new ChunkingConfig
    {
        MaxChars = 1000,
        MaxOverlap = 200
    },
    Images = new ImageExtractionConfig
    {
        ExtractImages = true,
        TargetDpi = 300,
        MaxImageDimension = 4096,
        AutoAdjustDpi = true
    },
    PdfOptions = new PdfConfig
    {
        ExtractImages = true,
        Passwords = new List<string> { "password1", "password2" },
        ExtractMetadata = true
    },
    TokenReduction = new TokenReductionConfig
    {
        Mode = "moderate",
        PreserveImportantWords = true
    },
    LanguageDetection = new LanguageDetectionConfig
    {
        Enabled = true,
        MinConfidence = 0.8,
        DetectMultiple = false
    }
};

var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

Loading Configuration from File¶

LoadConfiguration.cs

using Kreuzberg;

// Discover configuration by walking up directory tree
var discoveredConfig = KreuzbergClient.DiscoverExtractionConfig();

// Load from explicit path
var config = KreuzbergClient.LoadExtractionConfigFromFile("config/extraction.toml");
var result = KreuzbergClient.ExtractFileSync("document.pdf", config);

Extract from Bytes¶

ExtractBytes.cs

using Kreuzberg;

var data = File.ReadAllBytes("document.pdf");
var result = KreuzbergClient.ExtractBytesSync(data, "application/pdf");
Console.WriteLine(result.Content);

Batch Extract from Bytes¶

BatchExtractBytes.cs

using Kreuzberg;

var items = new[]
{
    new BytesWithMime(File.ReadAllBytes("doc1.pdf"), "application/pdf"),
    new BytesWithMime(File.ReadAllBytes("doc2.docx"), "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
};

var results = KreuzbergClient.BatchExtractBytesSync(items);

MIME Type Detection¶

Detect from File¶

MimeDetectionFromPath.cs

using Kreuzberg;

var mimeType = KreuzbergClient.DetectMimeTypeFromPath("document.pdf");
Console.WriteLine(mimeType);

// Reverse lookup: get file extensions for a MIME type
var extensions = KreuzbergClient.GetExtensionsForMime("application/pdf");
Console.WriteLine(string.Join(", ", extensions));

Detect from Bytes¶

MimeDetectionFromBytes.cs

using Kreuzberg;

var data = File.ReadAllBytes("document");
var mimeType = KreuzbergClient.DetectMimeType(data);
Console.WriteLine(mimeType);

Metadata Extraction¶

MetadataExtraction.cs

using Kreuzberg;

var result = KreuzbergClient.ExtractFileSync("document.pdf");

// Access detected language
if (result.Metadata.Language != null)
{
    Console.WriteLine($"Language: {result.Metadata.Language}");
}

// Access PDF-specific metadata
if (result.Metadata.Format?.Pdf != null)
{
    var pdf = result.Metadata.Format.Pdf;
    Console.WriteLine($"Title: {pdf.Title}");
    Console.WriteLine($"Author: {pdf.Author}");
    Console.WriteLine($"Pages: {pdf.PageCount}");
    Console.WriteLine($"Created: {pdf.CreationDate}");
}

// Access detected format type
Console.WriteLine($"Format: {result.Metadata.FormatType}");

Password-Protected PDFs¶

PasswordProtectedPdf.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    PdfOptions = new PdfConfig
    {
        Passwords = new List<string> { "password1", "password2", "password3" }
    }
};

var result = KreuzbergClient.ExtractFileSync("protected.pdf", config);

Language Detection¶

LanguageDetection.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    LanguageDetection = new LanguageDetectionConfig
    {
        Enabled = true,
        MinConfidence = 0.8
    }
};

var result = KreuzbergClient.ExtractFileSync("multilingual.pdf", config);

if (result.DetectedLanguages != null)
{
    foreach (var lang in result.DetectedLanguages)
    {
        Console.WriteLine(lang);
    }
}

Text Chunking¶

TextChunking.cs

using Kreuzberg;

var config = new ExtractionConfig
{
    Chunking = new ChunkingConfig
    {
        MaxChars = 1000,
        MaxOverlap = 200
    }
};

var result = KreuzbergClient.ExtractFileSync("long_document.pdf", config);

if (result.Chunks != null)
{
    foreach (var chunk in result.Chunks)
    {
        Console.WriteLine($"Chunk {chunk.Metadata.ChunkIndex + 1}/{chunk.Metadata.TotalChunks}");
        Console.WriteLine($"Tokens: {chunk.Metadata.TokenCount ?? 0}");
        Console.WriteLine(chunk.Content);
    }
}

Image Extraction¶

ImageExtraction.cs

using Kreuzberg;

var result = KreuzbergClient.ExtractFileSync("document.pdf");

if (result.Images != null && result.Images.Count > 0)
{
    Console.WriteLine($"Extracted {result.Images.Count} images");

    foreach (var image in result.Images)
    {
        Console.WriteLine($"Format: {image.Format}, Page: {image.PageNumber}");
        File.WriteAllBytes($"image_{image.ImageIndex}.{image.Format.ToLower()}", image.Data);
    }
}

Embedding Presets¶

List Available Presets¶

ListEmbeddingPresets.cs

using Kreuzberg;

var presets = KreuzbergClient.ListEmbeddingPresets();
Console.WriteLine($"Available presets: {string.Join(", ", presets)}");

Get Specific Preset¶

GetEmbeddingPreset.cs

using Kreuzberg;

var preset = KreuzbergClient.GetEmbeddingPreset("default");
if (preset != null)
{
    Console.WriteLine($"Preset: {preset.Name}");
    Console.WriteLine($"Chunk Size: {preset.ChunkSize}");
    Console.WriteLine($"Model: {preset.ModelName}");
    Console.WriteLine($"Dimensions: {preset.Dimensions}");
}

Custom Post-Processors¶

CustomPostProcessor.cs

using Kreuzberg;

public class CustomProcessor : IPostProcessor
{
    public string Name => "custom_processor";
    public int Priority => 10;

    public ExtractionResult Process(ExtractionResult result)
    {
        // Transform extracted content
        result.Content = result.Content.ToUpper();
        return result;
    }
}

// Register the post-processor
KreuzbergClient.RegisterPostProcessor(new CustomProcessor());

// Processor runs automatically on extraction
var result = KreuzbergClient.ExtractFileSync("document.pdf");

// Query registered processors
var processors = KreuzbergClient.ListPostProcessors();
Console.WriteLine($"Registered: {string.Join(", ", processors)}");

// Remove processor when no longer needed
KreuzbergClient.UnregisterPostProcessor("custom_processor");

Custom Validators¶

CustomValidator.cs

using Kreuzberg;

public class CustomValidator : IValidator
{
    public string Name => "custom_validator";
    public int Priority => 10;

    public void Validate(ExtractionResult result)
    {
        if (string.IsNullOrWhiteSpace(result.Content))
        {
            throw new InvalidOperationException("Content cannot be empty");
        }
    }
}

// Register the validator
KreuzbergClient.RegisterValidator(new CustomValidator());

// Validator runs automatically during extraction
var result = KreuzbergClient.ExtractFileSync("document.pdf");

Custom OCR Backends¶

CustomOcrBackend.cs

using Kreuzberg;

public class CustomOcrBackend : IOcrBackend
{
    public string Name => "custom_ocr";

    public string Process(ReadOnlySpan<byte> imageBytes, OcrConfig? config)
    {
        // Implement OCR processing and return JSON-formatted result
        return "{}";
    }
}

// Register the custom OCR backend
KreuzbergClient.RegisterOcrBackend(new CustomOcrBackend());

// Configure extraction to use custom backend
var extractConfig = new ExtractionConfig
{
    Ocr = new OcrConfig { Backend = "custom_ocr" }
};
var result = KreuzbergClient.ExtractFileSync("document.pdf", extractConfig);

Exception Handling¶

ExceptionHandling.cs

using Kreuzberg;
using System;

try
{
    var result = KreuzbergClient.ExtractFileSync("document.pdf");
}
catch (KreuzbergValidationException ex)
{
    Console.WriteLine($"Configuration validation error: {ex.Message}");
}
catch (KreuzbergParsingException ex)
{
    Console.WriteLine($"Document parsing failed: {ex.Message}");
}
catch (KreuzbergOcrException ex)
{
    Console.WriteLine($"OCR processing failed: {ex.Message}");
}
catch (KreuzbergMissingDependencyException ex)
{
    Console.WriteLine($"Missing optional dependency: {ex.Message}");
}
catch (KreuzbergException ex)
{
    Console.WriteLine($"General Kreuzberg error: {ex.Message}");
}

Exception Hierarchy¶

KreuzbergException - Base exception for all Kreuzberg errors
KreuzbergValidationException - Invalid configuration or input
KreuzbergParsingException - Document parsing failure
KreuzbergOcrException - OCR processing failure
KreuzbergMissingDependencyException - Missing optional dependency
KreuzbergSerializationException - JSON serialization failure

API Reference¶

Extraction Methods¶

ExtractFileSync(string path, ExtractionConfig? config = null) - Synchronous file extraction
ExtractFileAsync(string path, ExtractionConfig? config = null, CancellationToken cancellationToken = default) - Asynchronous file extraction
ExtractBytesSync(ReadOnlySpan<byte> data, string mimeType, ExtractionConfig? config = null) - Synchronous bytes extraction
ExtractBytesAsync(byte[] data, string mimeType, ExtractionConfig? config = null, CancellationToken cancellationToken = default) - Asynchronous bytes extraction
BatchExtractFilesSync(IReadOnlyList<string> paths, ExtractionConfig? config = null) - Batch file extraction
BatchExtractFilesAsync(IReadOnlyList<string> paths, ExtractionConfig? config = null, CancellationToken cancellationToken = default) - Async batch file extraction
BatchExtractBytesSync(IReadOnlyList<BytesWithMime> items, ExtractionConfig? config = null) - Batch bytes extraction
BatchExtractBytesAsync(IReadOnlyList<BytesWithMime> items, ExtractionConfig? config = null, CancellationToken cancellationToken = default) - Async batch bytes extraction

MIME Type Detection¶

DetectMimeType(ReadOnlySpan<byte> data) - Detect MIME from bytes
DetectMimeTypeFromPath(string path) - Detect MIME from file path
GetExtensionsForMime(string mimeType) - Get file extensions for MIME type

Configuration Discovery¶

DiscoverExtractionConfig() - Discover config by walking parent directories for kreuzberg.toml/yaml/json
LoadExtractionConfigFromFile(string path) - Load config from specific file

Plugin Management¶

RegisterPostProcessor(IPostProcessor processor) - Register custom post-processor
UnregisterPostProcessor(string name) - Unregister post-processor
ClearPostProcessors() - Clear all post-processors
ListPostProcessors() - List registered post-processors
RegisterValidator(IValidator validator) - Register custom validator
UnregisterValidator(string name) - Unregister validator
ClearValidators() - Clear all validators
ListValidators() - List registered validators
RegisterOcrBackend(IOcrBackend backend) - Register custom OCR backend
UnregisterOcrBackend(string name) - Unregister OCR backend
ClearOcrBackends() - Clear all OCR backends
ListOcrBackends() - List registered OCR backends
ListDocumentExtractors() - List document extractors
UnregisterDocumentExtractor(string name) - Unregister extractor
ClearDocumentExtractors() - Clear all extractors

Embedding Presets¶

ListEmbeddingPresets() - Get all available embedding presets
GetEmbeddingPreset(string name) - Get specific preset by name

Utility¶

GetVersion() - Get native library version string

Result Types¶

ExtractionResult¶

ExtractionResult.cs

public sealed class ExtractionResult
{
    public string Content { get; set; }
    public string MimeType { get; set; }
    public Metadata Metadata { get; set; }
    public List<Table> Tables { get; set; }
    public List<string>? DetectedLanguages { get; set; }
    public List<Chunk>? Chunks { get; set; }
    public List<ExtractedImage>? Images { get; set; }
    public bool Success { get; set; }
}

Metadata¶

Contains language, date, subject, and format-specific metadata (PDF, Excel, Email, PPTX, Archive, Image, XML, Text, HTML, OCR).

Table¶

Table.cs

public sealed class Table
{
    public List<List<string>> Cells { get; set; }
    public string Markdown { get; set; }
    public int PageNumber { get; set; }
}

Chunk¶

Chunk.cs

public sealed class Chunk
{
    public string Content { get; set; }
    public float[]? Embedding { get; set; }
    public ChunkMetadata Metadata { get; set; }
}

Thread Safety¶

The Kreuzberg C# binding is thread-safe at the API level:

KreuzbergClient static methods are safe to call from multiple threads concurrently
Configuration objects (ExtractionConfig, OcrConfig, etc.) are thread-safe for reading
Post-Processors, Validators, OCR Backends registrations use thread-safe collections
No synchronization needed for concurrent extraction calls

Note: Individual ExtractionResult objects should not be modified after creation if accessed from multiple threads.

P/Invoke Interoperability¶

The C# binding uses P/Invoke to call native Rust code:

NativeMethods Pattern¶

NativeMethods - Pinvoke declarations mapping to kreuzberg-ffi C library
InteropUtilities - Helper functions for UTF-8 string marshaling
Serialization - JSON serialization wrapper using System.Text.Json

Memory Management¶

AllocUtf8 - Allocate UTF-8 string in native memory
FreeUtf8 - Free allocated UTF-8 strings
FreeString - Free native strings from library
FreeResult - Free ExtractionResult structures
FreeBatchResult - Free batch result arrays

All memory allocation/deallocation is handled automatically by try/finally blocks.

Error Handling¶

ErrorMapper - Converts native error strings to C# exceptions
ThrowLastError - Retrieves and throws last error from native library
All FFI boundaries validate pointer returns (check for IntPtr.Zero)

Examples¶

Process Multiple Files with Error Handling¶

ProcessMultipleFiles.cs

using Kreuzberg;
using System;
using System.IO;

var files = Directory.GetFiles("documents", "*.pdf");

foreach (var file in files)
{
    try
    {
        Console.Write($"Processing {Path.GetFileName(file)}...");
        var result = KreuzbergClient.ExtractFileSync(file);

        if (result.Success)
        {
            Console.WriteLine($" OK ({result.Content.Length} chars)");
            File.WriteAllText($"{file}.txt", result.Content);
        }
        else
        {
            Console.WriteLine(" FAILED");
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine($" ERROR: {ex.Message}");
    }
}

Extract and Save Metadata¶

SaveMetadata.cs

using Kreuzberg;
using System.Text.Json;

var result = KreuzbergClient.ExtractFileSync("document.pdf");

var metadata = new
{
    language = result.Metadata.Language,
    format = result.Metadata.FormatType,
    tables = result.Tables.Count,
    images = result.Images?.Count ?? 0,
    success = result.Success
};

var json = JsonSerializer.Serialize(metadata, new JsonSerializerOptions { WriteIndented = true });
File.WriteAllText("metadata.json", json);

Concurrent Batch Processing¶

ConcurrentBatchProcessing.cs

using Kreuzberg;
using System.Threading.Tasks;

var files = new[] { "doc1.pdf", "doc2.pdf", "doc3.pdf", "doc4.pdf" };

var tasks = files.Select(file => Task.Run(() =>
{
    try
    {
        return KreuzbergClient.ExtractFileSync(file);
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error extracting {file}: {ex.Message}");
        return null;
    }
})).ToList();

var results = await Task.WhenAll(tasks);

var successful = results.Count(r => r?.Success == true);
Console.WriteLine($"Successfully processed {successful}/{files.Length} files");

Troubleshooting¶

DLL Not Found¶

If you get "DLL not found" errors, ensure the native library is in your runtime directory:

Terminal

# Check library path
echo $LD_LIBRARY_PATH     # Linux
echo $DYLD_LIBRARY_PATH   # macOS

The library should be located in runtimes/{rid}/native/ in the package.

P/Invoke Errors¶

If P/Invoke calls fail, verify: 1. Native library is properly installed 2. Architecture matches (x64, arm64) 3. Dependencies are available (Tesseract, LibreOffice if needed)

OCR Not Working¶

Ensure Tesseract is installed and in PATH:

Terminal

tesseract --version

Memory Issues¶

For large documents, consider: 1. Enabling chunking to process in smaller pieces 2. Using batch extraction for memory efficiency 3. Calling GC.Collect() after processing large batches

Complete Documentation¶

For more information, see:

License¶

MIT License - see the LICENSE file in the repository for details.