Go API Reference¶
Complete reference for the Kreuzberg Go bindings using cgo to access the Rust-powered extraction pipeline.
The Go binding exposes the same extraction capabilities as the other languages through C FFI bindings to kreuzberg-ffi. You get identical metadata extraction, OCR processing, chunking, embeddings, and plugin support—with synchronous and context-aware async APIs.
Requirements¶
- Go 1.25+ (with cgo support)
- Rust toolchain (builds
kreuzberg-ffi) - C compiler (gcc/clang for cgo compilation)
- libkreuzberg_ffi native library (staged in
target/release) - libpdfium runtime (auto-discovered via
target/release) - Tesseract/EasyOCR/PaddleOCR (optional, for OCR functionality)
Installation¶
Add the package to your go.mod:
Build the FFI library and set library paths:
# Build the FFI crate
cargo build -p kreuzberg-ffi --release
# Configure library path for your platform
# Linux
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH
# macOS
export DYLD_FALLBACK_LIBRARY_PATH=$PWD/target/release:$DYLD_FALLBACK_LIBRARY_PATH
# Windows
# Add target\release to PATH environment variable
set PATH=%CD%\target\release;%PATH%
Quickstart¶
Basic file extraction (synchronous)¶
package main
import (
"fmt"
"log"
"github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)
func main() {
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Printf("Format: %s\n", result.MimeType)
fmt.Printf("Content length: %d\n", len(result.Content))
fmt.Printf("Success: %v\n", result.Success)
}
Async extraction with timeout¶
package main
import (
"context"
"errors"
"log"
"time"
"github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
result, err := kreuzberg.ExtractFile(ctx, "large-document.pdf", nil)
if errors.Is(err, context.DeadlineExceeded) {
log.Println("extraction timed out")
return
}
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
log.Printf("Extracted %d characters\n", len(result.Content))
}
Core Functions¶
ExtractFileSync¶
Extract content and metadata from a file synchronously.
Signature:
Parameters:
path(string): Path to the file to extract (absolute or relative)config(*ExtractionConfig): Optional extraction configuration; uses defaults if nil
Returns:
*ExtractionResult: Populated result containing content, metadata, tables, chunks, and imageserror: KreuzbergError or standard Go error (see Error Handling section)
Error Handling:
ValidationError: If path is emptyIOError: If file not found or not readableParsingError: If document parsing failsMissingDependencyError: If required OCR/processing library is missingUnsupportedFormatError: If MIME type is not supported
Example - Extract PDF:
result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
fmt.Printf("Title: %s\n", *result.Metadata.PdfMetadata().Title)
fmt.Printf("Page count: %d\n", *result.Metadata.PdfMetadata().PageCount)
fmt.Printf("Content preview: %s...\n", result.Content[:100])
Example - Extract with configuration:
cfg := &kreuzberg.ExtractionConfig{
UseCache: boolPtr(true),
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
ExtractFile¶
Extract content from a file asynchronously with context support.
Signature:
func ExtractFile(ctx context.Context, path string, config *ExtractionConfig) (*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellation and timeoutpath(string): Path to the fileconfig(*ExtractionConfig): Optional configuration
Returns:
*ExtractionResult: Extraction resulterror: May include context errors (context.DeadlineExceeded, context.Canceled)
Note: Context cancellation is best-effort. The underlying C call cannot be interrupted, but the function returns immediately with ctx.Err() when the context deadline is exceeded or cancelled.
Example - With deadline:
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(30*time.Second))
defer cancel()
result, err := kreuzberg.ExtractFile(ctx, "large.docx", nil)
if errors.Is(err, context.DeadlineExceeded) {
log.Println("extraction took too long")
return
}
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
ExtractBytesSync¶
Extract content from an in-memory byte slice with specified MIME type.
Signature:
func ExtractBytesSync(data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)
Parameters:
data([]byte): Document bytesmimeType(string): MIME type (e.g., "application/pdf", "text/plain")config(*ExtractionConfig): Optional configuration
Returns:
*ExtractionResult: Extraction resulterror: KreuzbergError on extraction failure
Example - Extract from downloaded PDF:
httpResp, err := http.Get("https://example.com/document.pdf")
if err != nil {
log.Fatal(err)
}
defer httpResp.Body.Close()
data, err := io.ReadAll(httpResp.Body)
if err != nil {
log.Fatal(err)
}
result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
fmt.Printf("Extracted %d words\n", len(strings.Fields(result.Content)))
ExtractBytes¶
Extract content from in-memory bytes asynchronously.
Signature:
func ExtractBytes(ctx context.Context, data []byte, mimeType string, config *ExtractionConfig) (*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellation and timeoutdata([]byte): Document bytesmimeType(string): MIME typeconfig(*ExtractionConfig): Optional configuration
Returns:
*ExtractionResult: Extraction resulterror: KreuzbergError or context error
Example:
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
result, err := kreuzberg.ExtractBytes(ctx, data, "text/html", nil)
if err != nil {
log.Fatalf("extraction failed: %v", err)
}
BatchExtractFilesSync¶
Extract multiple files sequentially using the optimized batch pipeline.
Signature:
Parameters:
paths([]string): Slice of file pathsconfig(*ExtractionConfig): Configuration applied to all files
Returns:
[]*ExtractionResult: Slice of results (one per input file; may contain nils for failed extractions)error: Returned only if batch setup fails; individual file errors are captured in ErrorMetadata
Example - Batch extract multiple PDFs:
files := []string{"doc1.pdf", "doc2.pdf", "doc3.pdf"}
results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
log.Fatalf("batch extraction setup failed: %v", err)
}
for i, result := range results {
if result == nil {
fmt.Printf("File %d: extraction failed\n", i)
continue
}
if result.Metadata.Error != nil {
fmt.Printf("File %d: %s (%s)\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
continue
}
fmt.Printf("File %d: extracted %d chars\n", i, len(result.Content))
}
BatchExtractFiles¶
Batch extract multiple files asynchronously.
Signature:
func BatchExtractFiles(ctx context.Context, paths []string, config *ExtractionConfig) ([]*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellationpaths([]string): File pathsconfig(*ExtractionConfig): Configuration for all files
Returns:
[]*ExtractionResult: Results sliceerror: Context or setup errors
BatchExtractBytesSync¶
Extract multiple in-memory documents in a single batch operation.
Signature:
func BatchExtractBytesSync(items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)
Parameters:
items([]BytesWithMime): Slice of {Data, MimeType} pairsconfig(*ExtractionConfig): Configuration applied to all items
Returns:
[]*ExtractionResult: Results sliceerror: Setup error or validation error
BytesWithMime structure:
Example - Batch extract multiple formats:
items := []kreuzberg.BytesWithMime{
{Data: pdfData, MimeType: "application/pdf"},
{Data: docxData, MimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
{Data: htmlData, MimeType: "text/html"},
}
results, err := kreuzberg.BatchExtractBytesSync(items, nil)
if err != nil {
log.Fatalf("batch extraction failed: %v", err)
}
for i, result := range results {
if result == nil || !result.Success {
log.Printf("Item %d extraction failed\n", i)
continue
}
log.Printf("Item %d: %s format\n", i, result.MimeType)
}
BatchExtractBytes¶
Batch extract in-memory documents asynchronously.
Signature:
func BatchExtractBytes(ctx context.Context, items []BytesWithMime, config *ExtractionConfig) ([]*ExtractionResult, error)
Parameters:
ctx(context.Context): Context for cancellationitems([]BytesWithMime): Document sliceconfig(*ExtractionConfig): Configuration
Returns:
[]*ExtractionResult: Results sliceerror: Context or setup errors
LibraryVersion¶
Get the version of the underlying Rust library.
Signature:
Returns:
string: Version string (e.g., "4.0.0-rc.1")
Example:
Configuration¶
ExtractionConfig¶
Root configuration struct for all extraction operations. All fields are optional (pointers); omitted fields use Kreuzberg defaults.
Signature:
type ExtractionConfig struct {
UseCache *bool // Enable result caching
EnableQualityProcessing *bool // Run quality improvements
OCR *OCRConfig // OCR backend and settings
ForceOCR *bool // Force OCR even for text-extractable docs
Chunking *ChunkingConfig // Text chunking and embeddings
Images *ImageExtractionConfig // Image extraction from docs
PdfOptions *PdfConfig // PDF-specific options
TokenReduction *TokenReductionConfig // Token pruning before embeddings
LanguageDetection *LanguageDetectionConfig // Language detection settings
Keywords *KeywordConfig // Keyword extraction
Postprocessor *PostProcessorConfig // Post-processor selection
HTMLOptions *HTMLConversionOptions // HTML-to-Markdown conversion
MaxConcurrentExtractions *int // Batch concurrency limit
}
OCRConfig¶
Configure OCR backend selection and language.
Signature:
type OCRConfig struct {
Backend string // OCR backend name: "tesseract", "easyocr", "paddle", etc.
Language *string // Language code (e.g., "eng", "deu", "fra")
Tesseract *TesseractConfig // Tesseract-specific fine-tuning
}
Example:
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
Tesseract: &kreuzberg.TesseractConfig{
PSM: intPtr(3),
MinConfidence: float64Ptr(0.5),
},
},
}
TesseractConfig¶
Fine-grained Tesseract OCR tuning.
Signature:
type TesseractConfig struct {
Language string // Language code
PSM *int // Page segmentation mode (0-13)
OutputFormat string // Output format: "text", "pdf", "hocr"
OEM *int // Engine mode (0-3)
MinConfidence *float64 // Confidence threshold (0.0-1.0)
Preprocessing *ImagePreprocessingConfig // Image preprocessing
EnableTableDetection *bool // Detect and extract tables
TableMinConfidence *float64 // Table detection confidence
TableColumnThreshold *int // Column separation threshold
TableRowThresholdRatio *float64 // Row separation ratio
UseCache *bool // Cache OCR results
// Additional Tesseract parameters...
TesseditCharWhitelist string // Character whitelist
TesseditCharBlacklist string // Character blacklist
}
ImagePreprocessingConfig¶
Configure OCR image preprocessing (DPI normalization, rotation, denoising, etc.).
Signature:
type ImagePreprocessingConfig struct {
TargetDPI *int // Target DPI for OCR (typically 300)
AutoRotate *bool // Auto-detect and correct image rotation
Deskew *bool // Correct skewed text
Denoise *bool // Remove noise
ContrastEnhance *bool // Enhance contrast
BinarizationMode string // Binarization method: "otsu", "adaptive"
InvertColors *bool // Invert black/white
}
ChunkingConfig¶
Configure text chunking for RAG and retrieval workloads.
Signature:
type ChunkingConfig struct {
MaxChars *int // Maximum characters per chunk
MaxOverlap *int // Overlap between chunks
ChunkSize *int // Alias for MaxChars
ChunkOverlap *int // Alias for MaxOverlap
Preset *string // Preset: "semantic", "sliding", "recursive"
Embedding *EmbeddingConfig // Embedding generation
Enabled *bool // Enable chunking
}
ImageExtractionConfig¶
Configure image extraction from documents.
Signature:
type ImageExtractionConfig struct {
ExtractImages *bool // Extract embedded images
TargetDPI *int // Target DPI for extraction
MaxImageDimension *int // Maximum dimension (width/height)
AutoAdjustDPI *bool // Auto-adjust DPI for small images
MinDPI *int // Minimum DPI threshold
MaxDPI *int // Maximum DPI threshold
}
PdfConfig¶
PDF-specific extraction options.
Signature:
type PdfConfig struct {
ExtractImages *bool // Extract embedded images
Passwords []string // List of passwords for encrypted PDFs
ExtractMetadata *bool // Extract document metadata
}
EmbeddingConfig¶
Configure embedding generation for chunks.
Signature:
type EmbeddingConfig struct {
Model *EmbeddingModelType // Model selection
Normalize *bool // L2 normalization
BatchSize *int // Batch size for inference
ShowDownloadProgress *bool // Show download progress
CacheDir *string // Cache directory
}
type EmbeddingModelType struct {
Type string // "preset", "fastembed", "custom"
Name string // For preset models
Model string // For fastembed/custom
ModelID string // Alias for custom
Dimensions *int // Embedding dimensions
}
KeywordConfig¶
Configure keyword extraction.
Signature:
type KeywordConfig struct {
Algorithm string // "yake" or "rake"
MaxKeywords *int // Maximum keywords to extract
MinScore *float64 // Minimum keyword score
NgramRange *[2]int // N-gram range: [min, max]
Language *string // Language code
Yake *YakeParams // YAKE-specific tuning
Rake *RakeParams // RAKE-specific tuning
}
type YakeParams struct {
WindowSize *int
}
type RakeParams struct {
MinWordLength *int
MaxWordsPerPhrase *int
}
PostProcessorConfig¶
Configure post-processing steps.
Signature:
type PostProcessorConfig struct {
Enabled *bool // Enable post-processing
EnabledProcessors []string // Specific processors to run
DisabledProcessors []string // Processors to skip
}
Results & Types¶
ExtractionResult¶
The main result struct containing all extracted data.
Signature:
type ExtractionResult struct {
Content string // Extracted text content
MimeType string // Detected MIME type
Metadata Metadata // Document metadata
Tables []Table // Extracted tables
DetectedLanguages []string // Detected languages
Chunks []Chunk // Text chunks (if enabled)
Images []ExtractedImage // Embedded images (if enabled)
Success bool // Extraction success flag
}
Example - Accessing results:
result, err := kreuzberg.ExtractFileSync("report.pdf", nil)
if err != nil || !result.Success {
log.Fatal("extraction failed")
}
fmt.Printf("Detected MIME type: %s\n", result.MimeType)
fmt.Printf("Content length: %d\n", len(result.Content))
fmt.Printf("Detected languages: %v\n", result.DetectedLanguages)
fmt.Printf("Number of tables: %d\n", len(result.Tables))
fmt.Printf("Number of chunks: %d\n", len(result.Chunks))
fmt.Printf("Number of images: %d\n", len(result.Images))
Metadata¶
Aggregated document metadata with format-specific fields.
Signature:
type Metadata struct {
Language *string // Detected language code
Date *string // Extracted document date
Subject *string // Document subject
Format FormatMetadata // Format-specific metadata
ImagePreprocessing *ImagePreprocessingMetadata // OCR preprocessing info
JSONSchema json.RawMessage // JSON Schema if available
Error *ErrorMetadata // Error info for batch operations
Additional map[string]json.RawMessage // Custom/additional fields
}
Access format-specific metadata:
// Type discriminator
fmt.Println("Format type:", result.Metadata.FormatType())
// Type-safe accessors
if pdfMeta, ok := result.Metadata.PdfMetadata(); ok {
fmt.Printf("Title: %s\n", *pdfMeta.Title)
fmt.Printf("Pages: %d\n", *pdfMeta.PageCount)
fmt.Printf("Author: %s\n", *pdfMeta.Authors[0])
}
if excelMeta, ok := result.Metadata.ExcelMetadata(); ok {
fmt.Printf("Sheets: %d\n", excelMeta.SheetCount)
fmt.Printf("Sheet names: %v\n", excelMeta.SheetNames)
}
if htmlMeta, ok := result.Metadata.HTMLMetadata(); ok {
fmt.Printf("Page title: %s\n", *htmlMeta.Title)
fmt.Printf("OG image: %s\n", *htmlMeta.OGImage)
}
Table¶
Extracted table structure.
Signature:
type Table struct {
Cells [][]string // 2D cell array [row][col]
Markdown string // Markdown representation
PageNumber int // Page number (PDF/Image documents)
}
Example:
for tableIdx, table := range result.Tables {
fmt.Printf("Table %d (page %d):\n", tableIdx, table.PageNumber)
for _, row := range table.Cells {
fmt.Println(strings.Join(row, " | "))
}
fmt.Println("Markdown:", table.Markdown)
}
Chunk¶
Text chunk with optional embeddings and metadata.
Signature:
type Chunk struct {
Content string // Chunk text
Embedding []float32 // Embedding vector (if enabled)
Metadata ChunkMetadata // Chunk positioning
}
type ChunkMetadata struct {
CharStart int // Character offset in original content
CharEnd int // End character offset
TokenCount *int // Token count (if available)
ChunkIndex int // Index in chunk sequence
TotalChunks int // Total number of chunks
}
Example:
for _, chunk := range result.Chunks {
fmt.Printf("Chunk %d/%d\n", chunk.Metadata.ChunkIndex, chunk.Metadata.TotalChunks)
fmt.Printf("Content: %s...\n", chunk.Content[:min(50, len(chunk.Content))])
fmt.Printf("Tokens: %d\n", *chunk.Metadata.TokenCount)
if len(chunk.Embedding) > 0 {
fmt.Printf("Embedding dim: %d\n", len(chunk.Embedding))
fmt.Printf("First 5 values: %v\n", chunk.Embedding[:5])
}
}
ExtractedImage¶
Image extracted from document with optional OCR results.
Signature:
type ExtractedImage struct {
Data []byte // Raw image bytes
Format string // Image format: "jpeg", "png", "webp"
ImageIndex int // Index in images list
PageNumber *int // Page number (if applicable)
Width *uint32 // Image width in pixels
Height *uint32 // Image height in pixels
Colorspace *string // Colorspace (sRGB, CMYK, etc.)
BitsPerComponent *uint32 // Bits per color component
IsMask bool // Is image a mask?
Description *string // Image description/alt text
OCRResult *ExtractionResult // Nested OCR extraction
}
Example:
for imgIdx, img := range result.Images {
fmt.Printf("Image %d: %s, %dx%d\n", imgIdx, img.Format, *img.Width, *img.Height)
// Save image
filename := fmt.Sprintf("image_%d.%s", imgIdx, img.Format)
os.WriteFile(filename, img.Data, 0644)
// OCR if available
if img.OCRResult != nil {
fmt.Printf("Image %d OCR: %s\n", imgIdx, img.OCRResult.Content)
}
}
Error Handling¶
Error Types¶
Kreuzberg defines a type hierarchy of errors via the KreuzbergError interface:
type KreuzbergError interface {
error
Kind() ErrorKind
}
type ErrorKind string
const (
ErrorKindUnknown ErrorKind = "unknown"
ErrorKindIO ErrorKind = "io"
ErrorKindValidation ErrorKind = "validation"
ErrorKindParsing ErrorKind = "parsing"
ErrorKindOCR ErrorKind = "ocr"
ErrorKindCache ErrorKind = "cache"
ErrorKindImageProcessing ErrorKind = "image_processing"
ErrorKindSerialization ErrorKind = "serialization"
ErrorKindMissingDependency ErrorKind = "missing_dependency"
ErrorKindPlugin ErrorKind = "plugin"
ErrorKindUnsupportedFormat ErrorKind = "unsupported_format"
ErrorKindRuntime ErrorKind = "runtime"
)
Error type classes:
ValidationError: Input validation failed (empty paths, missing MIME types)ParsingError: Document parsing failed (malformed file, unsupported format)OCRError: OCR backend failure (library missing, invalid language)CacheError: Cache operation failedImageProcessingError: Image manipulation failedSerializationError: JSON encoding/decoding failedMissingDependencyError: Required library not found (Tesseract, EasyOCR, etc.)PluginError: Plugin registration or execution failedUnsupportedFormatError: MIME type not supportedIOError: File I/O failureRuntimeError: Unexpected runtime failure (lock poisoning, etc.)
Error Classification¶
Errors are automatically classified based on native error messages. Use errors.As() and errors.Is() to handle specific error types:
import (
"errors"
"log"
"github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
// Check specific error type
var parsingErr *kreuzberg.ParsingError
if errors.As(err, &parsingErr) {
log.Printf("Parsing failed: %v\n", parsingErr)
return
}
var missingDep *kreuzberg.MissingDependencyError
if errors.As(err, &missingDep) {
log.Printf("Missing dependency: %s\n", missingDep.Dependency)
return
}
// Generic error handling
log.Printf("Extraction failed: %v\n", err)
}
Error Unwrapping¶
All Kreuzberg errors support error unwrapping via errors.Unwrap():
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)
if err != nil {
// Check root cause
rootErr := errors.Unwrap(err)
if rootErr != nil {
log.Printf("Root cause: %v\n", rootErr)
}
// Check error kind
if krErr, ok := err.(kreuzberg.KreuzbergError); ok {
log.Printf("Error kind: %v\n", krErr.Kind())
}
}
Error Handling Examples¶
Handle file not found:
result, err := kreuzberg.ExtractFileSync("missing.pdf", nil)
if err != nil {
var ioErr *kreuzberg.IOError
if errors.As(err, &ioErr) {
log.Println("File not found or unreadable")
return
}
log.Fatalf("unexpected error: %v\n", err)
}
Handle missing OCR dependency:
cfg := &kreuzberg.ExtractionConfig{
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", cfg)
if err != nil {
var missingDep *kreuzberg.MissingDependencyError
if errors.As(err, &missingDep) {
log.Printf("Install %s to use OCR\n", missingDep.Dependency)
return
}
log.Fatalf("extraction failed: %v\n", err)
}
Batch error handling:
results, err := kreuzberg.BatchExtractFilesSync(files, nil)
if err != nil {
log.Fatalf("batch setup failed: %v\n", err)
}
for i, result := range results {
if result == nil {
log.Printf("File %d: extraction failed (nil result)\n", i)
continue
}
// Check for per-file errors
if result.Metadata.Error != nil {
log.Printf("File %d: %s - %s\n", i, result.Metadata.Error.ErrorType, result.Metadata.Error.Message)
continue
}
if !result.Success {
log.Printf("File %d: extraction unsuccessful\n", i)
continue
}
log.Printf("File %d: success (%d chars)\n", i, len(result.Content))
}
Advanced Usage¶
MIME Type Detection¶
Detect MIME type from file extension or content:
// Detect from filename (requires kreuzberg-ffi binding support)
// Use system tools or your own MIME database
mimeType := "application/pdf" // e.g., use mime.TypeByExtension(".pdf")
CGO-Specific Patterns¶
Memory Management¶
Go's cgo automatically manages C memory for simple types. Kreuzberg handles C pointer cleanup internally via defer statements:
// Safe: strings are copied to Go memory, C strings freed internally
result, err := kreuzberg.ExtractFileSync("doc.pdf", nil)
// Safe: byte slices are copied, C buffers freed internally
result, err := kreuzberg.ExtractBytesSync(data, "application/pdf", nil)
Library Path Configuration¶
Set library paths before running your program:
Linux:
macOS:
Windows:
Configuration as JSON¶
Internally, ExtractionConfig is serialized to JSON and passed to the C FFI:
// This internally becomes:
// {
// "use_cache": true,
// "ocr": {
// "backend": "tesseract",
// "language": "eng"
// }
// }
cfg := &kreuzberg.ExtractionConfig{
UseCache: boolPtr(true),
OCR: &kreuzberg.OCRConfig{
Backend: "tesseract",
Language: stringPtr("eng"),
},
}
result, err := kreuzberg.ExtractFileSync("doc.pdf", cfg)
Custom Post-Processors¶
Register custom post-processing logic in Go:
package main
import (
"C"
"encoding/json"
"log"
"github.com/Goldziher/kreuzberg/packages/go/kreuzberg"
)
//export myCustomProcessor
func myCustomProcessor(resultJSON *C.char) *C.char {
// Parse JSON result
jsonStr := C.GoString(resultJSON)
var result kreuzberg.ExtractionResult
if err := json.Unmarshal([]byte(jsonStr), &result); err != nil {
// Return error as C string (Rust will free it)
errMsg := C.CString("failed to parse JSON")
return errMsg
}
// Modify content
result.Content = strings.ToUpper(result.Content)
// Serialize back to JSON
modified, _ := json.Marshal(result)
return C.CString(string(modified))
}
func init() {
err := kreuzberg.RegisterPostProcessor(
"go-uppercase",
100, // priority
(C.PostProcessorCallback)(C.myCustomProcessor),
)
if err != nil {
log.Fatalf("failed to register post-processor: %v\n", err)
}
}
func main() {
cfg := &kreuzberg.ExtractionConfig{
Postprocessor: &kreuzberg.PostProcessorConfig{
EnabledProcessors: []string{"go-uppercase"},
},
}
result, _ := kreuzberg.ExtractFileSync("doc.pdf", cfg)
// Content is now uppercase
}
Custom Validators¶
Validate extraction results:
//export myValidator
func myValidator(resultJSON *C.char) *C.char {
jsonStr := C.GoString(resultJSON)
var result kreuzberg.ExtractionResult
json.Unmarshal([]byte(jsonStr), &result)
// Validation logic
if len(result.Content) == 0 {
errMsg := C.CString("content is empty")
return errMsg
}
// NULL means validation passed
return nil
}
func init() {
kreuzberg.RegisterValidator(
"content-not-empty",
50,
(C.ValidatorCallback)(C.myValidator),
)
}
Custom OCR Backends¶
Register a custom OCR backend:
//export customOCR
func customOCR(imageData *C.uint8_t, width C.uint32_t, height C.uint32_t, lang *C.char) *C.char {
// Call your OCR library
// Return JSON-encoded ExtractionResult
result := kreuzberg.ExtractionResult{
Content: "extracted text from custom OCR",
MimeType: "text/plain",
Success: true,
}
data, _ := json.Marshal(result)
return C.CString(string(data))
}
func init() {
kreuzberg.RegisterOCRBackend(
"custom-ocr",
(C.OcrBackendCallback)(C.customOCR),
)
}
Plugin Management¶
List and manage registered plugins:
// List validators
validators, err := kreuzberg.ListValidators()
if err == nil {
fmt.Printf("Validators: %v\n", validators)
}
// List post-processors
processors, err := kreuzberg.ListPostProcessors()
if err == nil {
fmt.Printf("Post-processors: %v\n", processors)
}
// List OCR backends
backends, err := kreuzberg.ListOCRBackends()
if err == nil {
fmt.Printf("OCR backends: %v\n", backends)
}
// Clear all validators
if err := kreuzberg.ClearValidators(); err != nil {
log.Fatalf("failed to clear validators: %v\n", err)
}
// Unregister specific validator
if err := kreuzberg.UnregisterValidator("my-validator"); err != nil {
log.Fatalf("failed to unregister: %v\n", err)
}
Performance Tips¶
- Batch Processing: Use
BatchExtractFilesSync()for multiple files to leverage internal optimizations - Context Timeouts: Set realistic timeouts; OCR can be slow on large documents
- Caching: Enable
UseCache: boolPtr(true)to cache frequently extracted documents - Library Paths: Ensure
LD_LIBRARY_PATH/DYLD_FALLBACK_LIBRARY_PATHis set before Go initialization - Configuration Reuse: Create and reuse ExtractionConfig objects across multiple calls
- Goroutines: Use
ExtractFile()/ExtractBytes()variants in goroutines for concurrency
Troubleshooting¶
Library Loading Errors¶
Error: cannot open shared object file: No such file or directory
Solution:
# Verify library exists
ls -la target/release/libkreuzberg_ffi.*
# Set library path
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH
# Test with ldd (Linux)
ldd target/release/libkreuzberg_ffi.so
CGO Compilation Errors¶
Error: error: kreuzberg.h: No such file or directory
Solution:
Ensure kreuzberg-ffi is built before building your Go module:
Missing OCR Library¶
Error: MissingDependencyError: Missing dependency: tesseract
Solution:
Install Tesseract or use a different OCR backend:
# macOS
brew install tesseract
# Debian/Ubuntu
apt-get install tesseract-ocr
# Or use EasyOCR/PaddleOCR (Python packages)
Context Timeout on Large Documents¶
Issue: Extraction times out before completion
Solution:
Increase timeout or disable OCR for large documents:
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
cfg := &kreuzberg.ExtractionConfig{
ForceOCR: boolPtr(false), // Disable OCR if not needed
}
result, err := kreuzberg.ExtractFile(ctx, "large.pdf", cfg)
Testing¶
Run the test suite:
# Unit tests (from packages/go)
task go:test
# Lint (gofmt + golangci-lint)
task go:lint
# E2E tests (from e2e/go, auto-generated from fixtures)
task e2e:go:verify
# Manual test with library path
export LD_LIBRARY_PATH=$PWD/target/release:$LD_LIBRARY_PATH
go test -v ./packages/go/kreuzberg
Helper Functions¶
Add these utility functions to your code:
func stringPtr(s string) *string {
return &s
}
func boolPtr(b bool) *bool {
return &b
}
func intPtr(i int) *int {
return &i
}
func float64Ptr(f float64) *float64 {
return &f
}
func uint32Ptr(u uint32) *uint32 {
return &u
}
Related Resources¶
- Source:
packages/go/kreuzberg/(Go binding implementation) - FFI Bridge:
crates/kreuzberg-ffi/(C FFI layer) - Rust Core:
crates/kreuzberg/(extraction logic) - E2E Tests:
e2e/go/(auto-generated test fixtures) - CI:
.github/workflows/go-test.yml(test pipeline)