Skip to content

WebAssembly API Reference

Complete reference for the Kreuzberg WebAssembly binding (@kreuzberg/wasm).

The WASM binding provides a browser-compatible, runtime-agnostic interface to Kreuzberg's document extraction capabilities. It works in browsers, Node.js, Deno, Bun, and Cloudflare Workers.

Installation

npm
npm install @kreuzberg/wasm

Or with other package managers:

Terminal
# Yarn
yarn add @kreuzberg/wasm

# pnpm
pnpm add @kreuzberg/wasm

Deno

TypeScript
import { extractBytes, initWasm } from "npm:@kreuzberg/wasm@^4.0.0";

Module Initialization

initWasm()

Initialize the WASM module. This must be called once before using any extraction functions.

Signature:

TypeScript
async function initWasm(): Promise<void>

Throws:

  • Error: If WASM module fails to load or is not supported in the current environment

Example - Basic initialization:

init_wasm.ts
import { initWasm } from '@kreuzberg/wasm';

async function main() {
  await initWasm();
  // Now you can use extraction functions
}

main().catch(console.error);

Example - With error handling:

init_with_error_handling.ts
import { initWasm, getWasmCapabilities } from '@kreuzberg/wasm';

async function initializeKreuzberg() {
  const caps = getWasmCapabilities();
  if (!caps.hasWasm) {
    throw new Error('WebAssembly is not supported in this environment');
  }

  try {
    await initWasm();
    console.log('Kreuzberg initialized successfully');
  } catch (error) {
    console.error('Failed to initialize Kreuzberg:', error);
    throw error;
  }
}

initializeKreuzberg().catch(console.error);

isInitialized()

Check if the WASM module is initialized.

Signature:

TypeScript
function isInitialized(): boolean

Returns:

  • boolean: True if WASM module is initialized, false otherwise

Example:

check_init.ts
import { isInitialized, initWasm } from '@kreuzberg/wasm';

if (!isInitialized()) {
  await initWasm();
}

getVersion()

Get the WASM module version.

Signature:

TypeScript
function getVersion(): string

Returns:

  • string: The version string of the WASM module

Throws:

  • Error: If WASM module is not initialized

Example:

get_version.ts
import { initWasm, getVersion } from '@kreuzberg/wasm';

await initWasm();
const version = getVersion();
console.log(`Using Kreuzberg ${version}`);

getInitializationError()

Get the initialization error if module failed to load. Used for debugging initialization issues.

Signature:

TypeScript
function getInitializationError(): Error | null

Returns:

  • Error | null: The error that occurred during initialization, or null if no error

Core Extraction Functions

extractBytes()

Extract content from document bytes asynchronously.

Signature:

TypeScript
async function extractBytes(
  data: Uint8Array,
  mimeType: string,
  config?: ExtractionConfig | null
): Promise<ExtractionResult>

Parameters:

  • data (Uint8Array): The document bytes to extract from
  • mimeType (string): MIME type of the document (e.g., 'application/pdf', 'image/jpeg'). Required.
  • config (ExtractionConfig | null): Optional extraction configuration. Uses defaults if not provided.

Returns:

  • Promise<ExtractionResult>: Extraction result containing content, metadata, tables, images, chunks, and more

Throws:

  • Error: If WASM module is not initialized, document data is empty, MIME type is missing, or extraction fails

Example - Extract PDF:

extract_pdf.ts
import { initWasm, extractBytes } from '@kreuzberg/wasm';

await initWasm();

const pdfBytes = new Uint8Array(buffer);
const result = await extractBytes(pdfBytes, 'application/pdf');
console.log(result.content);
console.log(`Found ${result.tables?.length ?? 0} tables`);

Example - Extract with configuration:

extract_with_config.ts
import { initWasm, extractBytes } from '@kreuzberg/wasm';
import type { ExtractionConfig } from '@kreuzberg/wasm';

await initWasm();

const config: ExtractionConfig = {
  ocr: {
    backend: 'tesseract-wasm',
    language: 'deu' // German
  },
  images: {
    extractImages: true,
    targetDpi: 200
  }
};

const result = await extractBytes(pdfBytes, 'application/pdf', config);

Example - Extract from File in browser:

extract_from_file_browser.ts
import { initWasm, extractBytes } from '@kreuzberg/wasm';
import { fileToUint8Array } from '@kreuzberg/wasm/adapters/wasm-adapter';

await initWasm();

const file = inputEvent.target.files[0];
const bytes = await fileToUint8Array(file);
const result = await extractBytes(bytes, file.type);
console.log(result.content);

extractFile()

Extract content from a file on the file system (Node.js, Deno, Bun only).

Signature:

TypeScript
async function extractFile(
  path: string,
  mimeType?: string | null,
  config?: ExtractionConfig | null
): Promise<ExtractionResult>

Parameters:

  • path (string): Path to the file to extract from. Required.
  • mimeType (string | null): Optional MIME type. If not provided, will be auto-detected from file content and extension.
  • config (ExtractionConfig | null): Optional extraction configuration

Returns:

  • Promise<ExtractionResult>: Extraction result

Throws:

  • Error: If WASM module is not initialized, file path is missing, file doesn't exist, runtime is not supported (browser), or extraction fails

Example - Extract with auto-detection:

extract_file_auto.ts
import { extractFile } from '@kreuzberg/wasm';

const result = await extractFile('./document.pdf');
console.log(result.content);

Example - Extract with explicit MIME type:

extract_file_explicit.ts
import { extractFile } from '@kreuzberg/wasm';

const result = await extractFile('./document.docx', 'application/vnd.openxmlformats-officedocument.wordprocessingml.document');
console.log(result.content);

Example - Extract with configuration:

extract_file_config.ts
import { extractFile } from '@kreuzberg/wasm';

const result = await extractFile('./report.xlsx', null, {
  chunking: {
    maxChars: 1000
  }
});

extractFromFile()

Extract content from a File or Blob (browser-friendly wrapper).

Convenience function that combines fileToUint8Array() and extractBytes() for streamlined browser usage.

Signature:

TypeScript
async function extractFromFile(
  file: File | Blob,
  mimeType?: string | null,
  config?: ExtractionConfig | null
): Promise<ExtractionResult>

Parameters:

  • file (File | Blob): The File or Blob to extract from. Required.
  • mimeType (string | null): Optional MIME type. If not provided, uses file.type for File objects, defaults to 'application/octet-stream' for Blob.
  • config (ExtractionConfig | null): Optional extraction configuration

Returns:

  • Promise<ExtractionResult>: Extraction result

Throws:

  • Error: If WASM module is not initialized or extraction fails

Example - Simple file input:

extract_from_file.ts
import { initWasm, extractFromFile } from '@kreuzberg/wasm';

await initWasm();

const fileInput = document.getElementById('file') as HTMLInputElement;
fileInput.addEventListener('change', async (e) => {
  const file = e.target.files?.[0];
  if (file) {
    const result = await extractFromFile(file);
    console.log(result.content);
  }
});

Example - With configuration:

extract_from_file_config.ts
import { extractFromFile } from '@kreuzberg/wasm';

const result = await extractFromFile(file, file.type, {
  chunking: { maxChars: 1000 },
  images: { extractImages: true }
});

batchExtractBytes()

Extract content from multiple byte arrays in parallel.

Signature:

TypeScript
async function batchExtractBytes(
  dataList: Uint8Array[],
  mimeTypes: string[],
  config?: ExtractionConfig | null
): Promise<ExtractionResult[]>

Parameters:

  • dataList (Uint8Array[]): Array of document bytes to extract from. Required.
  • mimeTypes (string[]): Array of MIME types corresponding to each document. Must match length of dataList. Required.
  • config (ExtractionConfig | null): Optional extraction configuration applied to all documents

Returns:

  • Promise type: Array of extraction results in the same order as input

Throws:

  • Error: If WASM module is not initialized or any extraction fails

Example:

batch_extract_bytes.ts
import { initWasm, batchExtractBytes } from '@kreuzberg/wasm';

await initWasm();

const dataList = [pdfBytes1, pdfBytes2, pdfBytes3];
const mimeTypes = ['application/pdf', 'application/pdf', 'application/pdf'];

const results = await batchExtractBytes(dataList, mimeTypes, {
  extract_tables: true
});

for (const result of results) {
  console.log(`${result.mimeType}: ${result.content.length} characters`);
}

batchExtractFiles()

Extract content from multiple browser File objects in parallel.

Signature:

TypeScript
async function batchExtractFiles(
  files: File[],
  config?: ExtractionConfig | null
): Promise<ExtractionResult[]>

Parameters:

  • files (File[]): Array of File objects to extract from. Required.
  • config (ExtractionConfig | null): Optional extraction configuration applied to all files

Returns:

  • Promise type: Array of extraction results in the same order as input

Throws:

  • Error: If WASM module is not initialized or any extraction fails

Example - Process multiple file uploads:

batch_extract_files.ts
import { initWasm, batchExtractFiles } from '@kreuzberg/wasm';

await initWasm();

const fileInput = document.getElementById('files') as HTMLInputElement;
const files = Array.from(fileInput.files);

const results = await batchExtractFiles(files, {
  extract_tables: true
});

for (const result of results) {
  console.log(`${result.mimeType}: ${result.content.length} characters`);
}

Synchronous Extraction Functions

extractBytesSync()

Extract content from document bytes synchronously.

Note: Synchronous extraction may block the event loop on large documents. Use async extraction (extractBytes()) for better performance in most cases.

Signature:

TypeScript
function extractBytesSync(
  data: Uint8Array,
  mimeType: string,
  config?: ExtractionConfig | null
): ExtractionResult

Parameters:

  • data (Uint8Array): The document bytes to extract from
  • mimeType (string): MIME type of the document
  • config (ExtractionConfig | null): Optional extraction configuration

Returns:

  • ExtractionResult: Extraction result

Throws:

  • Error: If WASM module is not initialized or extraction fails

Example:

extract_sync.ts
import { initWasm, extractBytesSync } from '@kreuzberg/wasm';

await initWasm();

const result = extractBytesSync(pdfBytes, 'application/pdf');
console.log(result.content);

batchExtractBytesSync()

Extract content from multiple byte arrays synchronously.

Signature:

TypeScript
function batchExtractBytesSync(
  dataList: Uint8Array[],
  mimeTypes: string[],
  config?: ExtractionConfig | null
): ExtractionResult[]

Parameters:

  • dataList (Uint8Array[]): Array of document bytes
  • mimeTypes (string[]): Array of MIME types
  • config (ExtractionConfig | null): Optional extraction configuration

Returns:

  • ExtractionResult array type: Array of extraction results

Throws:

  • Error: If WASM module is not initialized or any extraction fails

OCR Functions

enableOcr()

Enable OCR functionality with the tesseract-wasm backend.

Convenience function that automatically initializes and registers the Tesseract WASM backend for browser environments.

Signature:

TypeScript
async function enableOcr(): Promise<void>

Throws:

  • Error: If WASM module is not initialized or not in browser environment

Requirements:

  • Browser environment with Web Workers support
  • Network access to jsDelivr CDN for training data (on first use)
  • createImageBitmap API support

Example - Basic OCR:

enable_ocr.ts
import { initWasm, enableOcr, extractBytes } from '@kreuzberg/wasm';

async function main() {
  // Initialize WASM module
  await initWasm();

  // Enable OCR with tesseract-wasm
  await enableOcr();

  // Now you can use OCR in extraction
  const imageBytes = new Uint8Array(buffer);
  const result = await extractBytes(imageBytes, 'image/png', {
    ocr: { backend: 'tesseract-wasm', language: 'eng' }
  });

  console.log(result.content); // Extracted text
}

main().catch(console.error);

Example - Multi-language OCR:

ocr_multilingual.ts
import { initWasm, enableOcr, extractBytes } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

// Extract English text
const englishResult = await extractBytes(engImageBytes, 'image/png', {
  ocr: { backend: 'tesseract-wasm', language: 'eng' }
});

// Extract German text - model is cached after first use
const germanResult = await extractBytes(deImageBytes, 'image/png', {
  ocr: { backend: 'tesseract-wasm', language: 'deu' }
});

OCR Backend Management

registerOcrBackend()

Register a custom OCR backend.

Signature:

TypeScript
function registerOcrBackend(backend: OcrBackendProtocol): void

Parameters:

  • backend (OcrBackendProtocol): OCR backend implementing the OcrBackendProtocol interface. Required.

Throws:

  • Error: If backend validation fails

Example:

register_ocr_backend.ts
import { registerOcrBackend } from '@kreuzberg/wasm';
import { TesseractWasmBackend } from '@kreuzberg/wasm';

const backend = new TesseractWasmBackend();
await backend.initialize();
registerOcrBackend(backend);

getOcrBackend()

Get a registered OCR backend by name.

Signature:

TypeScript
function getOcrBackend(name: string): OcrBackendProtocol | undefined

Parameters:

  • name (string): Backend name. Required.

Returns:

  • OcrBackendProtocol | undefined: The OCR backend or undefined if not found

Example:

get_ocr_backend.ts
import { getOcrBackend } from '@kreuzberg/wasm';

const backend = getOcrBackend('tesseract-wasm');
if (backend) {
  console.log('Available languages:', backend.supportedLanguages());
}

listOcrBackends()

List all registered OCR backends.

Signature:

TypeScript
function listOcrBackends(): string[]

Returns:

  • string array type: Array of registered backend names

Example:

list_ocr_backends.ts
import { listOcrBackends } from '@kreuzberg/wasm';

const backends = listOcrBackends();
console.log('Available OCR backends:', backends);

unregisterOcrBackend()

Unregister an OCR backend.

Signature:

TypeScript
async function unregisterOcrBackend(name: string): Promise<void>

Parameters:

  • name (string): Backend name to unregister. Required.

Throws:

  • Error: If backend is not found

Example:

unregister_ocr_backend.ts
import { unregisterOcrBackend } from '@kreuzberg/wasm';

await unregisterOcrBackend('tesseract-wasm');

clearOcrBackends()

Clear all registered OCR backends and call their shutdown methods.

Signature:

TypeScript
async function clearOcrBackends(): Promise<void>

Example:

clear_ocr_backends.ts
import { clearOcrBackends } from '@kreuzberg/wasm';

// Clean up all backends when shutting down
await clearOcrBackends();

MIME Type Utilities

detectMimeFromBytes()

Auto-detect MIME type from file bytes.

Signature:

TypeScript
function detectMimeFromBytes(data: Uint8Array): string

Parameters:

  • data (Uint8Array): File bytes to detect MIME type from. Required.

Returns:

  • string: Detected MIME type (e.g., 'application/pdf', 'image/jpeg')

Example:

detect_mime.ts
import { detectMimeFromBytes } from '@kreuzberg/wasm';

const fileBytes = new Uint8Array(buffer);
const mimeType = detectMimeFromBytes(fileBytes);
console.log(`Detected MIME type: ${mimeType}`);

getMimeFromExtension()

Get MIME type from file extension.

Signature:

TypeScript
function getMimeFromExtension(extension: string): string | null

Parameters:

  • extension (string): File extension (with or without leading dot). Required.

Returns:

  • string | null: MIME type or null if extension is not recognized

Example:

get_mime_extension.ts
import { getMimeFromExtension } from '@kreuzberg/wasm';

const mimeType = getMimeFromExtension('pdf');  // 'application/pdf'
const mimeType2 = getMimeFromExtension('.docx'); // 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

getExtensionsForMime()

Get file extensions for a MIME type.

Signature:

TypeScript
function getExtensionsForMime(mimeType: string): string[]

Parameters:

  • mimeType (string): MIME type to look up. Required.

Returns:

  • string array type: Array of file extensions (without leading dots)

Example:

get_extensions.ts
import { getExtensionsForMime } from '@kreuzberg/wasm';

const extensions = getExtensionsForMime('application/pdf');  // ['pdf']
const extensions2 = getExtensionsForMime('image/jpeg');      // ['jpg', 'jpeg']

normalizeMimeType()

Normalize MIME type to canonical form.

Signature:

TypeScript
function normalizeMimeType(mimeType: string): string

Parameters:

  • mimeType (string): MIME type to normalize. Required.

Returns:

  • string: Normalized MIME type

Example:

normalize_mime.ts
import { normalizeMimeType } from '@kreuzberg/wasm';

const normalized = normalizeMimeType('application/PDF');  // 'application/pdf'
const normalized2 = normalizeMimeType('text/plain');      // 'text/plain'

Configuration Loading

loadConfigFromString()

Load extraction configuration from a string in YAML, JSON, or TOML format.

Signature:

TypeScript
function loadConfigFromString(
  content: string,
  format: 'yaml' | 'toml' | 'json'
): ExtractionConfig

Parameters:

  • content (string): Configuration content as a string. Required.
  • format ('yaml' | 'toml' | 'json'): Configuration format. Required.

Returns:

  • ExtractionConfig: Parsed extraction configuration

Throws:

  • Error: If configuration parsing fails

Example - YAML configuration:

load_config_yaml.ts
import { loadConfigFromString, extractBytes } from '@kreuzberg/wasm';

const yamlConfig = `
extract_tables: true
enable_ocr: true
ocr_config:
  languages: [eng, deu]
`;

const config = loadConfigFromString(yamlConfig, 'yaml');
const result = await extractBytes(data, 'application/pdf', config);

Example - JSON configuration:

load_config_json.ts
import { loadConfigFromString } from '@kreuzberg/wasm';

const jsonConfig = '{"extract_tables":true,"enable_ocr":true}';
const config = loadConfigFromString(jsonConfig, 'json');

Example - TOML configuration:

load_config_toml.ts
import { loadConfigFromString } from '@kreuzberg/wasm';

const tomlConfig = `
extract_tables = true
enable_ocr = true

[ocr_config]
languages = ["eng", "deu"]
`;

const config = loadConfigFromString(tomlConfig, 'toml');

Runtime Detection

detectRuntime()

Detect the current JavaScript runtime environment.

Signature:

TypeScript
function detectRuntime(): RuntimeType

Returns:

  • RuntimeType: One of 'browser', 'node', 'deno', 'bun', or 'unknown'

Example:

detect_runtime.ts
import { detectRuntime } from '@kreuzberg/wasm';

const runtime = detectRuntime();
switch (runtime) {
  case 'browser':
    console.log('Running in browser');
    break;
  case 'node':
    console.log('Running in Node.js');
    break;
  case 'deno':
    console.log('Running in Deno');
    break;
  case 'bun':
    console.log('Running in Bun');
    break;
}

getWasmCapabilities()

Get WebAssembly capabilities available in the current runtime.

Signature:

TypeScript
function getWasmCapabilities(): WasmCapabilities

Returns:

  • WasmCapabilities: Object containing capability flags:
  • runtime (RuntimeType): Detected runtime
  • hasWasm (boolean): WebAssembly support
  • hasWasmStreaming (boolean): Streaming WASM instantiation
  • hasFileApi (boolean): File API (browser)
  • hasBlob (boolean): Blob API
  • hasWorkers (boolean): Web Worker support
  • hasSharedArrayBuffer (boolean): SharedArrayBuffer (restricted)
  • hasModuleWorkers (boolean): Module Workers
  • hasBigInt (boolean): BigInt support
  • runtimeVersion (string | undefined): Runtime version if available

Example:

check_capabilities.ts
import { getWasmCapabilities } from '@kreuzberg/wasm';

const caps = getWasmCapabilities();
console.log(`Runtime: ${caps.runtime}`);
console.log(`WASM: ${caps.hasWasm}`);
console.log(`Workers: ${caps.hasWorkers}`);

if (caps.hasSharedArrayBuffer) {
  console.log('Multi-threading available');
} else {
  console.log('Running in single-threaded mode');
}

isBrowser(), isNode(), isDeno(), isBun()

Check if code is running in a specific runtime.

Signature:

TypeScript
function isBrowser(): boolean
function isNode(): boolean
function isDeno(): boolean
function isBun(): boolean

Returns:

  • boolean: True if running in the specified runtime

Example:

runtime_checks.ts
import { isBrowser, isNode, extractFile } from '@kreuzberg/wasm';

if (isNode()) {
  // Node.js: use extractFile() for file system access
  const result = await extractFile('./document.pdf');
} else if (isBrowser()) {
  // Browser: use extractFromFile() or extractBytes()
  const result = await extractFromFile(fileInput.files[0]);
}

hasWorkers(), hasSharedArrayBuffer()

Check for specific WASM capabilities.

Signature:

TypeScript
function hasWorkers(): boolean
function hasSharedArrayBuffer(): boolean

Returns:

  • boolean: True if the capability is available

Example:

capability_checks.ts
import { hasWorkers, hasSharedArrayBuffer } from '@kreuzberg/wasm';

if (hasSharedArrayBuffer()) {
  console.log('Multi-threading with SharedArrayBuffer enabled');
}

if (!hasWorkers()) {
  console.warn('Web Workers not available - some features may be limited');
}

Type Adapter Utilities

fileToUint8Array()

Convert a File or Blob to Uint8Array.

Handles both browser File API and server-side Blob-like objects with a unified interface.

Signature:

TypeScript
async function fileToUint8Array(file: File | Blob): Promise<Uint8Array>

Parameters:

  • file (File | Blob): The File or Blob to convert. Required.

Returns:

  • Promise<Uint8Array>: The byte array

Throws:

  • Error: If file cannot be read or exceeds size limit (512 MB)

Example:

file_to_bytes.ts
import { fileToUint8Array, extractBytes } from '@kreuzberg/wasm';

const file = document.getElementById('input').files[0];
const bytes = await fileToUint8Array(file);
const result = await extractBytes(bytes, file.type);

configToJS()

Normalize ExtractionConfig for WASM processing.

Converts TypeScript configuration objects to WASM-compatible format, handling null values and nested structures.

Signature:

TypeScript
function configToJS(config: ExtractionConfig | null): Record<string, unknown>

Parameters:

  • config (ExtractionConfig | null): The extraction configuration or null

Returns:

  • Record<string, unknown>: Normalized configuration object

Example:

config_normalize.ts
import { configToJS } from '@kreuzberg/wasm/adapters/wasm-adapter';

const config = {
  ocr: { backend: 'tesseract' },
  chunking: { maxChars: 1000 }
};
const wasmConfig = configToJS(config);

jsToExtractionResult()

Parse WASM extraction result and convert to TypeScript type.

Handles conversion of WASM-returned objects to proper ExtractionResult types with full validation.

Signature:

TypeScript
function jsToExtractionResult(jsValue: unknown): ExtractionResult

Parameters:

  • jsValue (unknown): The raw WASM result value

Returns:

  • ExtractionResult: Properly typed extraction result

Throws:

  • Error: If result structure is invalid

isValidExtractionResult()

Validate that a value conforms to ExtractionResult structure.

Performs structural validation without full type checking.

Signature:

TypeScript
function isValidExtractionResult(value: unknown): value is ExtractionResult

Parameters:

  • value (unknown): The value to validate

Returns:

  • boolean: True if value appears to be a valid ExtractionResult

Type Definitions

All types are exported from the @kreuzberg/wasm package and shared from @kreuzberg/core. Use these types for complete type safety when working with configuration and results.

Importing Types

TypeScript
import type {
  ExtractionResult,
  ExtractionConfig,
  OcrConfig,
  ChunkingConfig,
  ImageConfig,
  KeywordsConfig,
  Table,
  ExtractedImage,
  Chunk,
  Metadata,
  OcrBackendProtocol,
  RuntimeType,
  WasmCapabilities
} from '@kreuzberg/wasm';

Types

All types are shared via the @kreuzberg/core package. Import them for type-safe configuration and results:

TypeScript
import type {
  ExtractionResult,
  ExtractionConfig,
  OcrConfig,
  ChunkingConfig,
  ImageConfig,
  KeywordsConfig,
  Table,
  ExtractedImage,
  Chunk,
  Metadata,
  OcrBackendProtocol
} from '@kreuzberg/core';

ExtractionResult

The main result object returned from extraction functions.

Fields:

  • content (string): Extracted text content
  • mimeType (string): MIME type of the document
  • metadata (Metadata): Document metadata (page count, encoding, etc.)
  • tables (Table[] | null): Extracted tables (if extract_tables enabled)
  • images (ExtractedImage[] | null): Extracted images (if extract_images enabled)
  • chunks (Chunk[] | null): Text chunks (if enable_chunking enabled)
  • detectedLanguages (string[] | null): Detected language codes (if enable_language_detection enabled)

ExtractionConfig

Configuration object for extraction. All fields are optional; defaults are used if not provided.

Fields:

  • extract_tables (boolean): Extract tables as structured data
  • extract_images (boolean): Extract embedded images
  • extract_metadata (boolean): Extract document metadata
  • enable_ocr (boolean): Enable OCR for images and scanned PDFs
  • ocr_config (OcrConfig): OCR configuration
  • enable_chunking (boolean): Split text into semantic chunks
  • chunking_config (ChunkingConfig): Text chunking configuration
  • enable_language_detection (boolean): Detect document language
  • enable_quality (boolean): Enable encoding detection and normalization
  • extract_keywords (boolean): Extract important keywords
  • keywords_config (KeywordsConfig): Keyword extraction settings

OcrConfig

Configuration for OCR extraction.

Fields:

  • backend (string): OCR backend name (e.g., 'tesseract-wasm')
  • language (string): Language code for OCR (e.g., 'eng', 'deu', 'fra')
  • languages (string[]): Multiple languages for OCR
  • dpi (number): DPI for OCR processing
  • preprocessing (OcrPreprocessing): Image preprocessing settings

ChunkingConfig

Configuration for text chunking.

Fields:

  • maxChars (number): Maximum characters per chunk
  • maxTokens (number): Maximum tokens per chunk
  • chunkOverlap (number): Overlap between chunks in characters/tokens

ImageConfig

Configuration for image extraction.

Fields:

  • extractImages (boolean): Extract images from documents
  • targetDpi (number): Target DPI for extracted images
  • maxImageDimension (number): Maximum pixel dimension for images

KeywordsConfig

Configuration for keyword extraction.

Fields:

  • maxKeywords (number): Maximum number of keywords to extract
  • method (string): Keyword extraction method (e.g., 'yake')

Table

Extracted table structure.

Fields:

  • cells: string array type (2D array of table cells)
  • markdown (string): Table in Markdown format
  • pageNumber (number): Page number where table appears

ExtractedImage

Image extracted from document.

Fields:

  • data (Uint8Array): Image bytes
  • format (string): Image format (e.g., 'png', 'jpeg')
  • imageIndex (number): Index within document
  • pageNumber (number | null): Page number (if applicable)
  • width (number | null): Image width in pixels
  • height (number | null): Image height in pixels
  • colorspace (string | null): Color space (e.g., 'RGB', 'CMYK')
  • bitsPerComponent (number | null): Bits per color component
  • isMask (boolean): Whether this is a mask image
  • description (string | null): Image description if available

Chunk

Text chunk from chunking operation.

Fields:

  • content (string): Chunk text content
  • metadata (ChunkMetadata): Metadata about the chunk
  • embedding (number[] | null): Vector embedding (if available)

ChunkMetadata:

  • charStart (number): Starting character position
  • charEnd (number): Ending character position
  • chunkIndex (number): Index of this chunk
  • totalChunks (number): Total number of chunks
  • tokenCount (number | null): Token count if available

Metadata

Document metadata.

Fields:

  • pageCount (number | null): Number of pages (if applicable)
  • encoding (string | null): Text encoding
  • format (string): Document format
  • author (string | null): Document author
  • title (string | null): Document title
  • createdAt (string | null): Creation timestamp
  • modifiedAt (string | null): Last modification timestamp
  • [Additional format-specific fields]

Platform-Specific Notes

Browser

Requirements:

  • Modern browser with WebAssembly support (Chrome 91+, Firefox 90+, Safari 16.4+)
  • File API for file uploads

SharedArrayBuffer for Multi-Threading:

To enable multi-threaded extraction, set these HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Example with Express.js:

express_sab_headers.ts
import express from 'express';

const app = express();

app.use((req, res, next) => {
  res.setHeader('Cross-Origin-Opener-Policy', 'same-origin');
  res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp');
  next();
});

Example with Vite:

vite.config.ts
import { defineConfig } from 'vite';

export default defineConfig({
  server: {
    headers: {
      'Cross-Origin-Opener-Policy': 'same-origin',
      'Cross-Origin-Embedder-Policy': 'require-corp'
    }
  }
});

Node.js

Requirements:

  • Node.js 18 or higher
  • WASM support (available by default)

Example:

nodejs_extraction.ts
import { extractFile, initWasm } from '@kreuzberg/wasm';

async function main() {
  await initWasm();
  const result = await extractFile('./document.pdf');
  console.log(result.content);
}

main().catch(console.error);

Deno

Requirements:

  • Deno 1.0 or higher
  • Read permissions for files (--allow-read)
  • Network permissions for OCR training data (--allow-net)

Import:

deno_import.ts
import { extractFile, initWasm } from "npm:@kreuzberg/wasm@^4.0.0";

// Must run with: deno run --allow-read --allow-net script.ts

Example:

deno_example.ts
import { extractFile, initWasm } from "npm:@kreuzberg/wasm@^4.0.0";

async function main() {
  await initWasm();
  const result = await extractFile('./document.pdf');
  console.log(result.content);
}

main().catch(console.error);

Bun

Requirements:

  • Bun 1.x or higher
  • WASM support (available by default)

Example:

bun_example.ts
import { extractFile, initWasm } from '@kreuzberg/wasm';

async function main() {
  await initWasm();
  const result = await extractFile('./document.pdf');
  console.log(result.content);
}

main().catch(console.error);

Cloudflare Workers

Requirements:

  • Cloudflare Workers runtime
  • Bundle size considerations (10MB limit compressed)

HTTP Headers:

Cloudflare Workers automatically handle necessary CORS headers. For multi-threading, ensure:

cloudflare_worker.ts
export default {
  async fetch(request: Request): Promise<Response> {
    const response = new Response(body);
    response.headers.set('Cross-Origin-Opener-Policy', 'same-origin');
    response.headers.set('Cross-Origin-Embedder-Policy', 'require-corp');
    return response;
  }
};

Memory Constraints:

For large documents, use chunking to reduce memory usage:

cloudflare_memory_efficient.ts
import { extractBytes } from '@kreuzberg/wasm';

export default {
  async fetch(request: Request): Promise<Response> {
    const formData = await request.formData();
    const file = formData.get('file') as File;
    const arrayBuffer = await file.arrayBuffer();
    const bytes = new Uint8Array(arrayBuffer);

    const result = await extractBytes(bytes, file.type, {
      chunking_config: { maxChars: 1000 }
    });

    return Response.json({
      text: result.content,
      metadata: result.metadata
    });
  }
};

Common Patterns

Pattern: Runtime-Aware File Loading

Automatically select the appropriate extraction function based on runtime:

runtime_aware_loading.ts
import {
  extractFile,
  extractFromFile,
  isNode,
  isBrowser,
  initWasm
} from '@kreuzberg/wasm';

await initWasm();

async function extractAny(input: string | File): Promise<ExtractionResult> {
  if (isNode() && typeof input === 'string') {
    return await extractFile(input);
  } else if (isBrowser() && input instanceof File) {
    return await extractFromFile(input);
  } else {
    throw new Error('Invalid input for current runtime');
  }
}

Pattern: Graceful OCR Initialization

Initialize OCR with fallback to text-only extraction:

ocr_graceful_init.ts
import { initWasm, enableOcr, extractBytes } from '@kreuzberg/wasm';

async function extractWithOcrFallback(bytes: Uint8Array, mimeType: string) {
  await initWasm();

  let config = {};
  try {
    await enableOcr();
    config = { ocr: { backend: 'tesseract-wasm', language: 'eng' } };
  } catch (error) {
    console.warn('OCR unavailable, continuing with text extraction', error);
  }

  return await extractBytes(bytes, mimeType, config);
}

Pattern: Batch Processing with Progress

Extract multiple files with progress tracking:

batch_with_progress.ts
import { initWasm, batchExtractBytes } from '@kreuzberg/wasm';

async function extractWithProgress(
  files: File[],
  onProgress: (current: number, total: number) => void
) {
  await initWasm();

  const results = [];
  for (let i = 0; i < files.length; i++) {
    const fileBytes = await files[i].arrayBuffer();
    const result = await extractBytes(
      new Uint8Array(fileBytes),
      files[i].type
    );
    results.push(result);
    onProgress(i + 1, files.length);
  }

  return results;
}

Pattern: Configuration Management

Load configuration from environment or file:

config_management.ts
import { loadConfigFromString, extractBytes } from '@kreuzberg/wasm';

async function extractWithConfig(bytes: Uint8Array, mimeType: string) {
  let config = null;

  // Try to load from environment variable
  const configStr = process.env.KREUZBERG_CONFIG;
  if (configStr) {
    try {
      config = loadConfigFromString(configStr, 'json');
    } catch (error) {
      console.warn('Failed to parse config from environment:', error);
    }
  }

  // Default config if not loaded
  if (!config) {
    config = {
      extract_tables: true,
      extract_metadata: true
    };
  }

  return await extractBytes(bytes, mimeType, config);
}

Supported Formats

Category Formats
Documents PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF
Images PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF
Web HTML, XHTML, XML, EPUB
Text TXT, MD, RST, LaTeX, CSV, TSV, JSON, YAML, TOML, ORG, BIB, TeX, FB2
Email EML, MSG
Archives ZIP, TAR, 7Z
Other And 30+ more formats

Supported MIME Types

Common MIME types supported by Kreuzberg WASM:

Documents

  • application/pdf - PDF documents
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document - DOCX (Word)
  • application/msword - DOC (Word 97-2003)
  • application/vnd.openxmlformats-officedocument.presentationml.presentation - PPTX (PowerPoint)
  • application/vnd.ms-powerpoint - PPT (PowerPoint 97-2003)
  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet - XLSX (Excel)
  • application/vnd.ms-excel - XLS (Excel 97-2003)
  • application/vnd.oasis.opendocument.text - ODT (OpenDocument Text)
  • application/vnd.oasis.opendocument.presentation - ODP (OpenDocument Presentation)
  • application/vnd.oasis.opendocument.spreadsheet - ODS (OpenDocument Spreadsheet)
  • text/rtf - RTF (Rich Text Format)

Images

  • image/png - PNG
  • image/jpeg - JPEG
  • image/webp - WebP
  • image/bmp - BMP
  • image/tiff - TIFF
  • image/gif - GIF

Text

  • text/plain - Plain text
  • text/markdown - Markdown
  • text/html - HTML
  • application/json - JSON
  • text/xml - XML
  • application/xml - XML (alternative)
  • text/yaml - YAML
  • text/csv - CSV
  • text/tab-separated-values - TSV

Archives

  • application/zip - ZIP
  • application/x-tar - TAR
  • application/x-7z-compressed - 7Z

Platform Support Matrix

Function Browser Node.js Deno Bun Workers
initWasm() Yes Yes Yes Yes Yes
extractBytes() Yes Yes Yes Yes Yes
extractFile() No Yes Yes Yes No
extractFromFile() Yes No No No No
enableOcr() Yes No No No No
initThreadPool() Yes No No No No
batchExtractFiles() Yes No No No No

Note: OCR is only available in browser environments with Web Worker support due to dependency on tesseract-wasm and browser APIs like createImageBitmap.


Troubleshooting

"WASM module failed to initialize"

Ensure your bundler is configured to handle WASM files:

Vite:

vite.config.ts
export default {
  optimizeDeps: {
    exclude: ['@kreuzberg/wasm']
  }
}

Webpack:

webpack.config.js
module.exports = {
  experiments: {
    asyncWebAssembly: true
  }
}

"Module not found: @kreuzberg/core"

The @kreuzberg/core package is a peer dependency. Install it:

npm install @kreuzberg/core

"SharedArrayBuffer is not available"

This is expected in some browsers or when headers are not set. Multi-threading will not be available, but extraction will continue in single-threaded mode.

To enable multi-threading, set the required HTTP headers (see Platform-Specific Notes > Browser).


Memory Issues in Cloudflare Workers

For large documents, process in smaller chunks:

cloudflare_chunked.ts
const result = await extractBytes(pdfBytes, 'application/pdf', {
  chunking_config: { maxChars: 1000 }
});

WASM Module Not Loading

Symptoms: "Failed to load WASM module" error on initialization

Causes: - Network issues preventing WASM download - Bundler misconfiguration (not handling .wasm files correctly) - CORS restrictions blocking module fetch - Module not included in bundle

Solutions: 1. Check browser network tab for failed requests 2. Configure bundler (see "WASM module failed to initialize" section) 3. Ensure CORS headers allow WASM requests 4. Use CDN-delivered version as fallback


SharedArrayBuffer Not Available

Symptoms: Multi-threading features disabled, or "SharedArrayBuffer is not available" warning

Causes: - HTTPS context not used (required for security) - Missing Cross-Origin-Opener-Policy (COOP) headers - Missing Cross-Origin-Embedder-Policy (COEP) headers - Old browser version without SharedArrayBuffer support

Solutions: 1. Ensure application runs over HTTPS in production 2. Set required headers (see Platform-Specific Notes > Browser section): - Cross-Origin-Opener-Policy: same-origin - Cross-Origin-Embedder-Policy: require-corp 3. Update browser to latest version 4. Application will automatically fall back to single-threaded mode


OCR Not Available or Not Working

Symptoms: "OCR is only available in browser" error or OCR produces no output

Causes: - Attempting to use enableOcr() outside of browser environment (Node.js/Deno/Workers) - Web Workers not supported or blocked - Training data not loading from jsDelivr CDN - Language model not available for selected language

Solutions: 1. Check runtime with isBrowser() before enabling OCR:

check_browser.ts
import { isBrowser, enableOcr } from '@kreuzberg/wasm';

if (isBrowser()) {
  await enableOcr();
}

  1. Verify Web Worker support:

    check_workers.ts
    import { hasWorkers } from '@kreuzberg/wasm';
    
    if (hasWorkers()) {
      console.log('Web Workers available');
    }
    

  2. Check supported languages:

    check_ocr_languages.ts
    import { getOcrBackend } from '@kreuzberg/wasm';
    
    const backend = getOcrBackend('tesseract-wasm');
    if (backend) {
      const langs = backend.supportedLanguages();
      console.log('Supported languages:', langs);
      // Verify your language is in the list
    }
    

  3. Ensure network access to jsDelivr CDN:

  4. First OCR call downloads training data (~50MB for English)
  5. Subsequent calls use cached data
  6. May fail without internet connection

  7. Handle initialization errors gracefully:

    ocr_graceful.ts
    import { enableOcr, extractBytes } from '@kreuzberg/wasm';
    
    let ocrEnabled = false;
    try {
      await enableOcr();
      ocrEnabled = true;
    } catch (error) {
      console.warn('OCR initialization failed:', error);
    }
    
    const config = ocrEnabled
      ? { ocr: { backend: 'tesseract-wasm', language: 'eng' } }
      : {};
    
    const result = await extractBytes(bytes, 'application/pdf', config);
    


WASM Module Size and Performance

Symptoms: Large bundle size or slow initial load

Context: - WASM module: ~5MB uncompressed - Gzip compressed: ~1.5-2MB - OCR training data (per language): ~20-50MB (downloaded on demand, cached)

Optimization strategies: 1. Use code splitting to load WASM only when needed 2. Compress with gzip/brotli (bundlers do this automatically) 3. Load training data selectively (only load languages you need) 4. Use extractBytes() for in-memory processing to avoid file I/O 5. For large documents, enable chunking to reduce memory usage


Multi-Threading with wasm-bindgen-rayon

Kreuzberg WASM leverages wasm-bindgen-rayon to enable multi-threaded document processing with SharedArrayBuffer support.

Initializing Thread Pool

Initialize the thread pool with available CPU cores:

init_thread_pool.ts
import { initThreadPool } from '@kreuzberg/wasm';

// Initialize thread pool for multi-threaded extraction
await initThreadPool(navigator.hardwareConcurrency);

// Now extractions will use multiple threads for better performance
const result = await extractBytes(pdfBytes, 'application/pdf');

Graceful Degradation

The library handles thread pool initialization gracefully:

thread_pool_graceful.ts
import { initThreadPool } from '@kreuzberg/wasm';

try {
  await initThreadPool(navigator.hardwareConcurrency);
  console.log('Multi-threading enabled');
} catch (error) {
  // Fall back to single-threaded processing
  console.warn('Multi-threading unavailable:', error);
  console.log('Using single-threaded extraction');
}

// Extraction will work in both cases
const result = await extractBytes(pdfBytes, 'application/pdf');

See Also