Go API Reference¶

Complete reference for the Kreuzberg Go bindings using cgo to access the Rust-powered extraction pipeline.

The Go binding exposes the same extraction capabilities as the other languages through C FFI bindings to kreuzberg-ffi. You get identical metadata extraction, OCR processing, chunking, embeddings, and plugin support—with synchronous and context-aware async APIs.

Requirements¶

Go 1.25+ (with cgo support)
Rust toolchain (builds kreuzberg-ffi)
C compiler (gcc/clang for cgo compilation)
libkreuzberg_ffi native library (staged in target/release)
libpdfium runtime (auto-discovered via target/release)
Tesseract/EasyOCR/PaddleOCR (optional, for OCR functionality)

Installation¶

Add the package to your go.mod:

go get github.com/Goldziher/kreuzberg/packages/go/kreuzberg@latest

Build the FFI library and set library paths:

# Build the FFI crate
cargo build -p kreuzberg-ffi --release

# Configure library path for your platform
# Linux
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH

# macOS
export DYLD_FALLBACK_LIBRARY_PATH=$PWD/target/release:$DYLD_FALLBACK_LIBRARY_PATH

# Windows
# Add target\release to PATH environment variable
set PATH=%CD%\target\release;%PATH%

Quickstart¶

Basic file extraction (synchronous)¶

package main

import (
    "fmt"
    "log"

    "github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    fmt.Printf("Format: %s\n", result.MimeType)
    fmt.Printf("Content length: %d\n", len(result.Content))
    fmt.Printf("Success: %v\n", result.Success)
}

Async extraction with timeout¶

package main

import (
    "context"
    "errors"
    "log"
    "time"

    "github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
    defer cancel()

    result, err := kreuzberg.ExtractFile(ctx, "large-document.pdf", nil)
    if errors.Is(err, context.DeadlineExceeded) {
        log.Println("extraction timed out")
        return
    }
    if err != nil {
        log.Fatalf("extraction failed: %v", err)
    }

    log.Printf("Extracted %d characters\n", len(result.Content))
}

Core Functions¶

ExtractFileSync¶

Extract content and metadata from a file synchronously.

Signature:

func ExtractFileSync(path string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

path (string): Path to the file to extract (absolute or relative)
config (*ExtractionConfig): Optional extraction configuration; uses defaults if nil

Returns:

*ExtractionResult: Populated result containing content, metadata, tables, chunks, and images
error: KreuzbergError or standard Go error (see Error Handling section)

Error Handling:

ValidationError: If path is empty
IOError: If file not found or not readable
ParsingError: If document parsing fails
MissingDependencyError: If required OCR/processing library is missing
UnsupportedFormatError: If MIME type is not supported

Example - Extract PDF:

result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

fmt.Printf("Title: %s\n", *result.Metadata.PdfMetadata().Title)
fmt.Printf("Page count: %d\n", *result.Metadata.PdfMetadata().PageCount)
fmt.Printf("Content preview: %s...\n", result.Content[:100])

Example - Extract with configuration:

cfg := &kreuzberg.ExtractionConfig{
    UseCache: boolPtr(true),
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
    },
}

result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

ExtractFile¶

Extract content from a file asynchronously with context support.

Signature:

func ExtractFile(ctx context.Context, path string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

ctx (context.Context): Context for cancellation and timeout
path (string): Path to the file
config (*ExtractionConfig): Optional configuration

Returns:

*ExtractionResult: Extraction result
error: May include context errors (context.DeadlineExceeded, context.Canceled)

Note: Context cancellation is best-effort. The underlying C call cannot be interrupted, but the function returns immediately with ctx.Err() when the context deadline is exceeded or cancelled.

Example - With deadline:

ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(30*time.Second))
defer cancel()

result, err := kreuzberg.ExtractFile(ctx, "large.docx", nil)
if errors.Is(err, context.DeadlineExceeded) {
    log.Println("extraction took too long")
    return
}
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

ExtractBytesSync¶

Extract content from an in-memory byte slice with specified MIME type.

Signature:

func ExtractBytesSync(data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

data ([]byte): Document bytes
mimeType (string): MIME type (e.g., "application/pdf", "text/plain")
config (*ExtractionConfig): Optional configuration

Returns:

*ExtractionResult: Extraction result
error: KreuzbergError on extraction failure

Example - Extract from downloaded PDF:

httpResp, err := http.Get("https://example.com/document.pdf")
if err != nil {
    log.Fatal(err)
}
defer httpResp.Body.Close()

data, err := io.ReadAll(httpResp.Body)
if err != nil {
    log.Fatal(err)
}

result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

fmt.Printf("Extracted %d words\n", len(strings.Fields(result.Content)))

ExtractBytes¶

Extract content from in-memory bytes asynchronously.

Signature:

func ExtractBytes(ctx context.Context, data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)

Parameters:

ctx (context.Context): Context for cancellation and timeout
data ([]byte): Document bytes
mimeType (string): MIME type
config (*ExtractionConfig): Optional configuration

Returns:

*ExtractionResult: Extraction result
error: KreuzbergError or context error

Example:

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

result, err := kreuzberg.ExtractBytes(ctx, data, "text/html", nil)
if err != nil {
    log.Fatalf("extraction failed: %v", err)
}

BatchExtractFilesSync¶

Extract multiple files sequentially using the optimized batch pipeline.

Signature:

func BatchExtractFilesSync(paths []string, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

paths ([]string): Slice of file paths
config (*ExtractionConfig): Configuration applied to all files

Returns:

[]*ExtractionResult: Slice of results (one per input file; may contain nils for failed extractions)
error: Returned only if batch setup fails; individual file errors are captured in ErrorMetadata

Example - Batch extract multiple PDFs:

files := []string{"doc1.pdf", "doc2.pdf", "doc3.pdf"}

results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
    log.Fatalf("batch extraction setup failed: %v", err)
}

for i, result := range results {
    if result == nil {
        fmt.Printf("File %d: extraction failed\n", i)
        continue
    }

    if result.Metadata.Error != nil {
        fmt.Printf("File %d: %s (%s)\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
        continue
    }

    fmt.Printf("File %d: extracted %d chars\n", i, len(result.Content))
}

BatchExtractFiles¶

Batch extract multiple files asynchronously.

Signature:

func BatchExtractFiles(ctx context.Context, paths []string, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

ctx (context.Context): Context for cancellation
paths ([]string): File paths
config (*ExtractionConfig): Configuration for all files

Returns:

[]*ExtractionResult: Results slice
error: Context or setup errors

BatchExtractBytesSync¶

Extract multiple in-memory documents in a single batch operation.

Signature:

func BatchExtractBytesSync(items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

items ([]BytesWithMime): Slice of {Data, MimeType} pairs
config (*ExtractionConfig): Configuration applied to all items

Returns:

[]*ExtractionResult: Results slice
error: Setup error or validation error

BytesWithMime structure:

type BytesWithMime struct {
    Data     []byte
    MimeType string
}

Example - Batch extract multiple formats:

items := []kreuzberg.BytesWithMime{
    {Data: pdfData, MimeType: "application/pdf"},
    {Data: docxData, MimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
    {Data: htmlData, MimeType: "text/html"},
}

results, err := kreuzberg.BatchExtractBytesSync(items, nil)
if err != nil {
    log.Fatalf("batch extraction failed: %v", err)
}

for i, result := range results {
    if result == nil || !result.Success {
        log.Printf("Item %d extraction failed\n", i)
        continue
    }
    log.Printf("Item %d: %s format\n", i, result.MimeType)
}

BatchExtractBytes¶

Batch extract in-memory documents asynchronously.

Signature:

func BatchExtractBytes(ctx context.Context, items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)

Parameters:

ctx (context.Context): Context for cancellation
items ([]BytesWithMime): Document slice
config (*ExtractionConfig): Configuration

Returns:

[]*ExtractionResult: Results slice
error: Context or setup errors

LibraryVersion¶

Get the version of the underlying Rust library.

Signature:

func LibraryVersion() string

Returns:

string: Version string (e.g., "4.0.0-rc.1")

Example:

fmt.Printf("Kreuzberg version: %s\n", kreuzberg.LibraryVersion())

Configuration¶

ExtractionConfig¶

Root configuration struct for all extraction operations. All fields are optional (pointers); omitted fields use Kreuzberg defaults.

Signature:

type ExtractionConfig struct {
    UseCache                 *bool                    // Enable result caching
    EnableQualityProcessing  *bool                    // Run quality improvements
    OCR                      *OCRConfig               // OCR backend and settings
    ForceOCR                 *bool                    // Force OCR even for text-extractable docs
    Chunking                 *ChunkingConfig          // Text chunking and embeddings
    Images                   *ImageExtractionConfig   // Image extraction from docs
    PdfOptions               *PdfConfig               // PDF-specific options
    TokenReduction           *TokenReductionConfig    // Token pruning before embeddings
    LanguageDetection        *LanguageDetectionConfig // Language detection settings
    Keywords                 *KeywordConfig           // Keyword extraction
    Postprocessor            *PostProcessorConfig     // Post-processor selection
    HTMLOptions              *HTMLConversionOptions   // HTML-to-Markdown conversion
    MaxConcurrentExtractions *int                     // Batch concurrency limit
}

OCRConfig¶

Configure OCR backend selection and language.

Signature:

type OCRConfig struct {
    Backend   string           // OCR backend name: "tesseract", "easyocr", "paddle", etc.
    Language  *string          // Language code (e.g., "eng", "deu", "fra")
    Tesseract *TesseractConfig // Tesseract-specific fine-tuning
}

Example:

cfg := &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
        Tesseract: &kreuzberg.TesseractConfig{
            PSM: intPtr(3),
            MinConfidence: float64Ptr(0.5),
        },
    },
}

TesseractConfig¶

Fine-grained Tesseract OCR tuning.

Signature:

type TesseractConfig struct {
    Language                       string                    // Language code
    PSM                            *int                      // Page segmentation mode (0-13)
    OutputFormat                   string                    // Output format: "text", "pdf", "hocr"
    OEM                            *int                      // Engine mode (0-3)
    MinConfidence                  *float64                  // Confidence threshold (0.0-1.0)
    Preprocessing                  *ImagePreprocessingConfig // Image preprocessing
    EnableTableDetection           *bool                     // Detect and extract tables
    TableMinConfidence             *float64                  // Table detection confidence
    TableColumnThreshold           *int                      // Column separation threshold
    TableRowThresholdRatio         *float64                  // Row separation ratio
    UseCache                       *bool                     // Cache OCR results
    // Additional Tesseract parameters...
    TesseditCharWhitelist          string                    // Character whitelist
    TesseditCharBlacklist          string                    // Character blacklist
}

ImagePreprocessingConfig¶

Configure OCR image preprocessing (DPI normalization, rotation, denoising, etc.).

Signature:

type ImagePreprocessingConfig struct {
    TargetDPI        *int   // Target DPI for OCR (typically 300)
    AutoRotate       *bool  // Auto-detect and correct image rotation
    Deskew           *bool  // Correct skewed text
    Denoise          *bool  // Remove noise
    ContrastEnhance  *bool  // Enhance contrast
    BinarizationMode string // Binarization method: "otsu", "adaptive"
    InvertColors     *bool  // Invert black/white
}

ChunkingConfig¶

Configure text chunking for RAG and retrieval workloads.

Signature:

type ChunkingConfig struct {
    MaxChars     *int             // Maximum characters per chunk
    MaxOverlap   *int             // Overlap between chunks
    ChunkSize    *int             // Alias for MaxChars
    ChunkOverlap *int             // Alias for MaxOverlap
    Preset       *string          // Preset: "semantic", "sliding", "recursive"
    Embedding    *EmbeddingConfig // Embedding generation
    Enabled      *bool            // Enable chunking
}

ImageExtractionConfig¶

Configure image extraction from documents.

Signature:

type ImageExtractionConfig struct {
    ExtractImages     *bool // Extract embedded images
    TargetDPI         *int  // Target DPI for extraction
    MaxImageDimension *int  // Maximum dimension (width/height)
    AutoAdjustDPI     *bool // Auto-adjust DPI for small images
    MinDPI            *int  // Minimum DPI threshold
    MaxDPI            *int  // Maximum DPI threshold
}

PdfConfig¶

PDF-specific extraction options.

Signature:

type PdfConfig struct {
    ExtractImages   *bool    // Extract embedded images
    Passwords       []string // List of passwords for encrypted PDFs
    ExtractMetadata *bool    // Extract document metadata
}

EmbeddingConfig¶

Configure embedding generation for chunks.

Signature:

type EmbeddingConfig struct {
    Model                *EmbeddingModelType // Model selection
    Normalize            *bool               // L2 normalization
    BatchSize            *int                // Batch size for inference
    ShowDownloadProgress *bool               // Show download progress
    CacheDir             *string             // Cache directory
}

type EmbeddingModelType struct {
    Type       string // "preset", "fastembed", "custom"
    Name       string // For preset models
    Model      string // For fastembed/custom
    ModelID    string // Alias for custom
    Dimensions *int   // Embedding dimensions
}

KeywordConfig¶

Configure keyword extraction.

Signature:

type KeywordConfig struct {
    Algorithm   string      // "yake" or "rake"
    MaxKeywords *int        // Maximum keywords to extract
    MinScore    *float64    // Minimum keyword score
    NgramRange  *[2]int     // N-gram range: [min, max]
    Language    *string     // Language code
    Yake        *YakeParams // YAKE-specific tuning
    Rake        *RakeParams // RAKE-specific tuning
}

type YakeParams struct {
    WindowSize *int
}

type RakeParams struct {
    MinWordLength     *int
    MaxWordsPerPhrase *int
}

PostProcessorConfig¶

Configure post-processing steps.

Signature:

type PostProcessorConfig struct {
    Enabled            *bool    // Enable post-processing
    EnabledProcessors  []string // Specific processors to run
    DisabledProcessors []string // Processors to skip
}

Results & Types¶

ExtractionResult¶

The main result struct containing all extracted data.

Signature:

type ExtractionResult struct {
    Content           string           // Extracted text content
    MimeType          string           // Detected MIME type
    Metadata          Metadata         // Document metadata
    Tables            []Table          // Extracted tables
    DetectedLanguages []string         // Detected languages
    Chunks            []Chunk          // Text chunks (if enabled)
    Images            []ExtractedImage // Embedded images (if enabled)
    Success           bool             // Extraction success flag
}

Example - Accessing results:

result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil || !result.Success {
    log.Fatal("extraction failed")
}

fmt.Printf("Detected MIME type: %s\n", result.MimeType)
fmt.Printf("Content length: %d\n", len(result.Content))
fmt.Printf("Detected languages: %v\n", result.DetectedLanguages)
fmt.Printf("Number of tables: %d\n", len(result.Tables))
fmt.Printf("Number of chunks: %d\n", len(result.Chunks))
fmt.Printf("Number of images: %d\n", len(result.Images))

Metadata¶

Aggregated document metadata with format-specific fields.

Signature:

type Metadata struct {
    Language           *string                     // Detected language code
    Date               *string                     // Extracted document date
    Subject            *string                     // Document subject
    Format             FormatMetadata              // Format-specific metadata
    ImagePreprocessing *ImagePreprocessingMetadata // OCR preprocessing info
    JSONSchema         json.RawMessage             // JSON Schema if available
    Error              *ErrorMetadata              // Error info for batch operations
    Additional         map[string]json.RawMessage  // Custom/additional fields
}

Access format-specific metadata:

// Type discriminator
fmt.Println("Format type:", result.Metadata.FormatType())

// Type-safe accessors
if pdfMeta, ok := result.Metadata.PdfMetadata(); ok {
    fmt.Printf("Title: %s\n", *pdfMeta.Title)
    fmt.Printf("Pages: %d\n", *pdfMeta.PageCount)
    fmt.Printf("Author: %s\n", *pdfMeta.Authors[0])
}

if excelMeta, ok := result.Metadata.ExcelMetadata(); ok {
    fmt.Printf("Sheets: %d\n", excelMeta.SheetCount)
    fmt.Printf("Sheet names: %v\n", excelMeta.SheetNames)
}

if htmlMeta, ok := result.Metadata.HTMLMetadata(); ok {
    fmt.Printf("Page title: %s\n", *htmlMeta.Title)
    fmt.Printf("OG image: %s\n", *htmlMeta.OGImage)
}

Table¶

Extracted table structure.

Signature:

type Table struct {
    Cells      [][]string // 2D cell array [row][col]
    Markdown   string     // Markdown representation
    PageNumber int        // Page number (PDF/Image documents)
}

Example:

for tableIdx, table := range result.Tables {
    fmt.Printf("Table %d (page %d):\n", tableIdx, table.PageNumber)
    for _, row := range table.Cells {
        fmt.Println(strings.Join(row, " | "))
    }
    fmt.Println("Markdown:", table.Markdown)
}

Chunk¶

Text chunk with optional embeddings and metadata.

Signature:

type Chunk struct {
    Content   string        // Chunk text
    Embedding []float32     // Embedding vector (if enabled)
    Metadata  ChunkMetadata // Chunk positioning
}

type ChunkMetadata struct {
    CharStart   int  // Character offset in original content
    CharEnd     int  // End character offset
    TokenCount  *int // Token count (if available)
    ChunkIndex  int  // Index in chunk sequence
    TotalChunks int  // Total number of chunks
}

Example:

for _, chunk := range result.Chunks {
    fmt.Printf("Chunk %d/%d\n", chunk.Metadata.ChunkIndex, chunk.Metadata.TotalChunks)
    fmt.Printf("Content: %s...\n", chunk.Content[:min(50, len(chunk.Content))])
    fmt.Printf("Tokens: %d\n", *chunk.Metadata.TokenCount)
    if len(chunk.Embedding) > 0 {
        fmt.Printf("Embedding dim: %d\n", len(chunk.Embedding))
        fmt.Printf("First 5 values: %v\n", chunk.Embedding[:5])
    }
}

ExtractedImage¶

Image extracted from document with optional OCR results.

Signature:

type ExtractedImage struct {
    Data             []byte            // Raw image bytes
    Format           string            // Image format: "jpeg", "png", "webp"
    ImageIndex       int               // Index in images list
    PageNumber       *int              // Page number (if applicable)
    Width            *uint32           // Image width in pixels
    Height           *uint32           // Image height in pixels
    Colorspace       *string           // Colorspace (sRGB, CMYK, etc.)
    BitsPerComponent *uint32           // Bits per color component
    IsMask           bool              // Is image a mask?
    Description      *string           // Image description/alt text
    OCRResult        *ExtractionResult // Nested OCR extraction
}

Example:

for imgIdx, img := range result.Images {
    fmt.Printf("Image %d: %s, %dx%d\n", imgIdx, img.Format, *img.Width, *img.Height)

    // Save image
    filename := fmt.Sprintf("image_%d.%s", imgIdx, img.Format)
    os.WriteFile(filename, img.Data, 0644)

    // OCR if available
    if img.OCRResult != nil {
        fmt.Printf("Image %d OCR: %s\n", imgIdx, img.OCRResult.Content)
    }
}

Error Handling¶

Error Types¶

Kreuzberg defines a type hierarchy of errors via the KreuzbergError interface:

type KreuzbergError interface {
    error
    Kind() ErrorKind
}

type ErrorKind string

const (
    ErrorKindUnknown           ErrorKind = "unknown"
    ErrorKindIO                ErrorKind = "io"
    ErrorKindValidation        ErrorKind = "validation"
    ErrorKindParsing           ErrorKind = "parsing"
    ErrorKindOCR               ErrorKind = "ocr"
    ErrorKindCache             ErrorKind = "cache"
    ErrorKindImageProcessing   ErrorKind = "image_processing"
    ErrorKindSerialization     ErrorKind = "serialization"
    ErrorKindMissingDependency ErrorKind = "missing_dependency"
    ErrorKindPlugin            ErrorKind = "plugin"
    ErrorKindUnsupportedFormat ErrorKind = "unsupported_format"
    ErrorKindRuntime           ErrorKind = "runtime"
)

Error type classes:

ValidationError: Input validation failed (empty paths, missing MIME types)
ParsingError: Document parsing failed (malformed file, unsupported format)
OCRError: OCR backend failure (library missing, invalid language)
CacheError: Cache operation failed
ImageProcessingError: Image manipulation failed
SerializationError: JSON encoding/decoding failed
MissingDependencyError: Required library not found (Tesseract, EasyOCR, etc.)
PluginError: Plugin registration or execution failed
UnsupportedFormatError: MIME type not supported
IOError: File I/O failure
RuntimeError: Unexpected runtime failure (lock poisoning, etc.)

Error Classification¶

Errors are automatically classified based on native error messages. Use errors.As() and errors.Is() to handle specific error types:

import (
    "errors"
    "log"

    "github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)

result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
    // Check specific error type
    var parsingErr *kreuzberg.ParsingError
    if errors.As(err, &parsingErr) {
        log.Printf("Parsing failed: %v\n", parsingErr)
        return
    }

    var missingDep *kreuzberg.MissingDependencyError
    if errors.As(err, &missingDep) {
        log.Printf("Missing dependency: %s\n", missingDep.Dependency)
        return
    }

    // Generic error handling
    log.Printf("Extraction failed: %v\n", err)
}

Error Unwrapping¶

All Kreuzberg errors support error unwrapping via errors.Unwrap():

result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)
if err != nil {
    // Check root cause
    rootErr := errors.Unwrap(err)
    if rootErr != nil {
        log.Printf("Root cause: %v\n", rootErr)
    }

    // Check error kind
    if krErr, ok := err.(kreuzberg.KreuzbergError); ok {
        log.Printf("Error kind: %v\n", krErr.Kind())
    }
}

Error Handling Examples¶

Handle file not found:

result, err := kreuzberg.ExtractFileSync("missing.pdf", nil)
if err != nil {
    var ioErr *kreuzberg.IOError
    if errors.As(err, &ioErr) {
        log.Println("File not found or unreadable")
        return
    }
    log.Fatalf("unexpected error: %v\n", err)
}

Handle missing OCR dependency:

cfg := &kreuzberg.ExtractionConfig{
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
    },
}

result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
    var missingDep *kreuzberg.MissingDependencyError
    if errors.As(err, &missingDep) {
        log.Printf("Install %s to use OCR\n", missingDep.Dependency)
        return
    }
    log.Fatalf("extraction failed: %v\n", err)
}

Batch error handling:

results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
    log.Fatalf("batch setup failed: %v\n", err)
}

for i, result := range results {
    if result == nil {
        log.Printf("File %d: extraction failed (nil result)\n", i)
        continue
    }

    // Check for per-file errors
    if result.Metadata.Error != nil {
        log.Printf("File %d: %s - %s\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
        continue
    }

    if !result.Success {
        log.Printf("File %d: extraction unsuccessful\n", i)
        continue
    }

    log.Printf("File %d: success (%d chars)\n", i, len(result.Content))
}

Advanced Usage¶

MIME Type Detection¶

Detect MIME type from file extension or content:

// Detect from filename (requires kreuzberg-ffi binding support)
// Use system tools or your own MIME database
mimeType := "application/pdf" // e.g., use mime.TypeByExtension(".pdf")

CGO-Specific Patterns¶

Memory Management¶

Go's cgo automatically manages C memory for simple types. Kreuzberg handles C pointer cleanup internally via defer statements:

// Safe: strings are copied to Go memory, C strings freed internally
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)

// Safe: byte slices are copied, C buffers freed internally
result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)

Library Path Configuration¶

Set library paths before running your program:

Linux:

export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH
go run main.go

macOS:

export DYLD_FALLBACK_LIBRARY_PATH=$PWD/target/release:$DYLD_FALLBACK_LIBRARY_PATH
go run main.go

Windows:

set PATH=%CD%\target\release;%PATH%
go run main.go

Configuration as JSON¶

Internally, ExtractionConfig is serialized to JSON and passed to the C FFI:

// This internally becomes:
// {
//   "use_cache": true,
//   "ocr": {
//     "backend": "tesseract",
//     "language": "eng"
//   }
// }

cfg := &kreuzberg.ExtractionConfig{
    UseCache: boolPtr(true),
    OCR: &kreuzberg.OCRConfig{
        Backend:  "tesseract",
        Language: stringPtr("eng"),
    },
}

result, err := kreuzberg.ExtractFileSync("doc.pdf", cfg)

Custom Post-Processors¶

Register custom post-processing logic in Go:

package main

import (
    "C"
    "encoding/json"
    "log"

    "github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)

//export myCustomProcessor
func myCustomProcessor(resultJSON *C.char) *C.char {
    // Parse JSON result
    jsonStr := C.GoString(resultJSON)
    var result kreuzberg.ExtractionResult
    if err := json.Unmarshal([]byte(jsonStr), &result); err != nil {
        // Return error as C string (Rust will free it)
        errMsg := C.CString("failed to parse JSON")
        return errMsg
    }

    // Modify content
    result.Content = strings.ToUpper(result.Content)

    // Serialize back to JSON
    modified, _ := json.Marshal(result)
    return C.CString(string(modified))
}

func init() {
    err := kreuzberg.RegisterPostProcessor(
        "go-uppercase",
        100, // priority
        (C.PostProcessorCallback)(C.myCustomProcessor),
    )
    if err != nil {
        log.Fatalf("failed to register post-processor: %v\n", err)
    }
}

func main() {
    cfg := &kreuzberg.ExtractionConfig{
        Postprocessor: &kreuzberg.PostProcessorConfig{
            EnabledProcessors: []string{"go-uppercase"},
        },
    }

    result, _ := kreuzberg.ExtractFileSync("doc.pdf", cfg)
    // Content is now uppercase
}

Custom Validators¶

Validate extraction results:

//export myValidator
func myValidator(resultJSON *C.char) *C.char {
    jsonStr := C.GoString(resultJSON)
    var result kreuzberg.ExtractionResult
    json.Unmarshal([]byte(jsonStr), &result)

    // Validation logic
    if len(result.Content) == 0 {
        errMsg := C.CString("content is empty")
        return errMsg
    }

    // NULL means validation passed
    return nil
}

func init() {
    kreuzberg.RegisterValidator(
        "content-not-empty",
        50,
        (C.ValidatorCallback)(C.myValidator),
    )
}

Custom OCR Backends¶

Register a custom OCR backend:

//export customOCR
func customOCR(imageData *C.uint8_t, width C.uint32_t, height C.uint32_t, lang *C.char) *C.char {
    // Call your OCR library
    // Return JSON-encoded ExtractionResult
    result := kreuzberg.ExtractionResult{
        Content:  "extracted text from custom OCR",
        MimeType: "text/plain",
        Success:  true,
    }
    data, _ := json.Marshal(result)
    return C.CString(string(data))
}

func init() {
    kreuzberg.RegisterOCRBackend(
        "custom-ocr",
        (C.OcrBackendCallback)(C.customOCR),
    )
}

Plugin Management¶

List and manage registered plugins:

// List validators
validators, err := kreuzberg.ListValidators()
if err == nil {
    fmt.Printf("Validators: %v\n", validators)
}

// List post-processors
processors, err := kreuzberg.ListPostProcessors()
if err == nil {
    fmt.Printf("Post-processors: %v\n", processors)
}

// List OCR backends
backends, err := kreuzberg.ListOCRBackends()
if err == nil {
    fmt.Printf("OCR backends: %v\n", backends)
}

// Clear all validators
if err := kreuzberg.ClearValidators(); err != nil {
    log.Fatalf("failed to clear validators: %v\n", err)
}

// Unregister specific validator
if err := kreuzberg.UnregisterValidator("my-validator"); err != nil {
    log.Fatalf("failed to unregister: %v\n", err)
}

Performance Tips¶

Batch Processing: Use BatchExtractFilesSync() for multiple files to leverage internal optimizations
Context Timeouts: Set realistic timeouts; OCR can be slow on large documents
Caching: Enable UseCache: boolPtr(true) to cache frequently extracted documents
Library Paths: Ensure LD_LIBRARY_PATH/DYLD_FALLBACK_LIBRARY_PATH is set before Go initialization
Configuration Reuse: Create and reuse ExtractionConfig objects across multiple calls
Goroutines: Use ExtractFile() / ExtractBytes() variants in goroutines for concurrency

Troubleshooting¶

Library Loading Errors¶

Error: cannot open shared object file: No such file or directory

Solution:

# Verify library exists
ls -la target/release/libkreuzberg_ffi.*

# Set library path
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH

# Test with ldd (Linux)
ldd target/release/libkreuzberg_ffi.so

CGO Compilation Errors¶

Error: error: kreuzberg.h: No such file or directory

Solution:

Ensure kreuzberg-ffi is built before building your Go module:

cargo build -p kreuzberg-ffi --release
go build ./...

Missing OCR Library¶

Error: MissingDependencyError: Missing dependency: tesseract

Solution:

Install Tesseract or use a different OCR backend:

# macOS
brew install tesseract

# Debian/Ubuntu
apt-get install tesseract-ocr

# Or use EasyOCR/PaddleOCR (Python packages)

Context Timeout on Large Documents¶

Issue: Extraction times out before completion

Solution:

Increase timeout or disable OCR for large documents:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()

cfg := &kreuzberg.ExtractionConfig{
    ForceOCR: boolPtr(false), // Disable OCR if not needed
}

result, err := kreuzberg.ExtractFile(ctx, "large.pdf", cfg)

Testing¶

Run the test suite:

# Unit tests (from packages/go)
task go:test

# Lint (gofmt + golangci-lint)
task go:lint

# E2E tests (from e2e/go, auto-generated from fixtures)
task e2e:go:verify

# Manual test with library path
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH
go test -v ./packages/go/kreuzberg

Helper Functions¶

Add these utility functions to your code:

func stringPtr(s string) *string {
    return &s
}

func boolPtr(b bool) *bool {
    return &b
}

func intPtr(i int) *int {
    return &i
}

func float64Ptr(f float64) *float64 {
    return &f
}

func uint32Ptr(u uint32) *uint32 {
    return &u
}

Source: packages/go/kreuzberg/ (Go binding implementation)
FFI Bridge: crates/kreuzberg-ffi/ (C FFI layer)
Rust Core: crates/kreuzberg/ (extraction logic)
E2E Tests: e2e/go/ (auto-generated test fixtures)
CI: .github/workflows/go-test.yml (test pipeline)

Go API Reference¶

Requirements¶

Installation¶

Quickstart¶

Basic file extraction (synchronous)¶

Async extraction with timeout¶

Core Functions¶

ExtractFileSync¶

ExtractFile¶

ExtractBytesSync¶

ExtractBytes¶

BatchExtractFilesSync¶

BatchExtractFiles¶

BatchExtractBytesSync¶

BatchExtractBytes¶

LibraryVersion¶

Configuration¶

ExtractionConfig¶

OCRConfig¶

TesseractConfig¶

ImagePreprocessingConfig¶

ChunkingConfig¶

ImageExtractionConfig¶

PdfConfig¶

EmbeddingConfig¶

KeywordConfig¶

PostProcessorConfig¶

Results & Types¶

ExtractionResult¶

Metadata¶

Table¶

Chunk¶

ExtractedImage¶

Error Handling¶

Error Types¶

Error Classification¶

Error Unwrapping¶

Error Handling Examples¶

Advanced Usage¶

MIME Type Detection¶

CGO-Specific Patterns¶

Memory Management¶

Library Path Configuration¶

Configuration as JSON¶

Custom Post-Processors¶

Custom Validators¶

Custom OCR Backends¶

Plugin Management¶

Performance Tips¶

Troubleshooting¶

Library Loading Errors¶

CGO Compilation Errors¶

Missing OCR Library¶

Context Timeout on Large Documents¶

Testing¶

Helper Functions¶

Related Resources¶