Changelog¶

All notable changes to Kreuzberg will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[4.0.0-rc.13] - 2025-12-19¶

Fixed¶

PDF extractor feature flag mismatch: Corrected feature flag in conditional compilation for bundled PDFium
Code was checking for legacy pdf-bundled feature in #[cfg()] but feature was renamed to bundled-pdfium
Feature aliases don't apply to Rust conditional compilation - only to dependency resolution
This caused bundled PDFium module to never compile when bundled-pdfium was enabled
Affected: Python wheels, Ruby gem, WASM bindings, Java bindings
Fixed all #[cfg(feature = "pdf-bundled")] to #[cfg(feature = "bundled-pdfium")]
Python CLI binary discovery timeout: Restored binary discovery timeout to 2 seconds (regression fix)
Fixes benchmark tests that were timing out during CI runs
Go Windows library linking: Fixed missing system libraries in cgo linking and corrected build directory paths
Ensures Windows binaries link correctly with platform-specific libraries
Resolved benchmark path issues for Go bindings
Ruby vendor workspace dependencies: Added missing toml dependency to Ruby vendor workspace generation
Enables proper workspace management during Ruby gem packaging
Docker caching: Improved Docker caching with GitHub Actions Cache (GHA) backend
Reduces build times and improves CI efficiency
Repository cleanup: Removed smoke tests and unused scripts from publish pipeline
Streamlined CI workflows for faster feedback
Smoke test functionality moved to dedicated test_apps directories
Ruby gem publish pipeline: Enhanced Ruby gem publish pipeline to rebuild and validate before pushing
Ensures gem integrity before package distribution
Added pre-publish validation steps
WASM binary publishing: Added WASM binary files to git for NPM publishing
Ensures compiled binaries are available during package distribution
Supports both browser and Node.js WASM targets

[4.0.0-rc.10] - 2025-12-16¶

Breaking Changes¶

PDFium feature names changed: pdf-static→static-pdfium, pdf-bundled→bundled-pdfium, pdf-system→system-pdfium. Feature full-bundled removed (use full + bundled-pdfium).
Default PDFium linking: pdf feature now defaults to bundled-pdfium (auto-downloads and embeds PDFium).
Go module path: Moved from github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg to github.com/kreuzberg-dev/kreuzberg/packages/go/v4. Update your imports and run go mod tidy.

Fixed¶

Windows CLI: Now includes bundled PDFium runtime
WASM Node.js/Deno: PDFium support for native targets (edge runtimes continue using lopdf)
Go bindings: Added ExtractFileWithContext() and batch variants for context cancellation support
Node/TypeScript: Replaced any types with proper definitions. Fixed data loss in page metadata.
Ruby bindings: Complete YARD documentation for all API methods
C# bindings: Complete XML documentation for all public methods

[4.0.0-rc.12] - 2025-12-19¶

Fixed¶

Python bindings: Fixed PDFium bundling in wheels by correcting feature flag in conditional compilation
Root cause: Code was checking for legacy pdf-bundled feature in #[cfg()] but feature was renamed to bundled-pdfium
Feature aliases don't apply to Rust conditional compilation - only to dependency resolution
Result: Bundled PDFium module was never compiled when bundled-pdfium was enabled
Fixed all #[cfg(feature = "pdf-bundled")] to #[cfg(feature = "bundled-pdfium")] in pdf module
Verified: Python wheels now successfully extract and initialize PDFium at runtime
C# bindings: Fixed MSBuild target to prevent overwriting CI-downloaded native assets
Added conditional logic to detect if runtimes directory already populated with cross-platform files
Prevents build step from creating empty platform directories that overwrite CI artifacts
Supports all platforms: Windows (x64/arm64), macOS (arm64/x64), Linux (x64/arm64)
Rust 2024 compliance: Added unsafe keyword to extern blocks in Ruby bindings
Required by Rust 2024 edition for extern "C" declarations
Fixes compilation error in packages/ruby/ext/kreuzberg_rb/native/src/lib.rs
WASM: Fixed unused import warning in pdf extractor
Added #[cfg(feature = "tokio-runtime")] to conditional import of std::path::Path
Path is only used in tokio-runtime enabled code paths
Docker: Fixed ONNX Runtime package installation
Corrected Debian Trixie package name from libonnxruntime to libonnxruntime1.21
Applied to both Dockerfile.core and Dockerfile.full
Verified: Both Docker images build successfully and PDF extraction works with bundled PDFium
Homebrew CLI: Added bundled-pdfium feature to kreuzberg-cli dependency
Ensures CLI binary includes embedded PDFium without requiring system installation
User can run kreuzberg extract file.pdf without extra setup
Python CI: Fixed path traversal calculation in test script
Script in scripts/ci/python/ needed proper directory levels to reach repo root
Changed from 2 levels (/../..) to 3 levels (/../../..)
LibreOffice tests: Disabled on Windows CI to prevent hanging
soffice binary lookup hangs on Windows runners during CI
Added #[cfg(not(target_os = "windows"))] to 12 LibreOffice test cases in Rust core
Tests continue running on macOS and Linux CI

[4.0.0-rc.11] - 2025-12-18¶

Fixed¶

PDFium bundling: Now correctly bundled in all language bindings (Node.js, Python, Java, Ruby, Go, C#)
FFI library copies libpdfium.dylib/.so/.dll from Rust build output during packaging
C# Kreuzberg.csproj now includes build target to copy native libraries to runtimes directories for all platforms
Node.js package.json includes all native library extensions (*.dylib, *.so, *.dll)
Fixes PDF extraction failures with "libpdfium not found" error
C# bindings: Added native library bundling with PDFium support for all platforms
Build target in Kreuzberg.csproj copies libkreuzberg_ffi and libpdfium to runtimes/{platform}/native directories
Supports macOS (arm64/x64), Linux (x64/arm64), Windows (x64/arm64)
Smoke test suite created in test_apps/csharp with 7 tests (PDF, DOCX, XLSX, JPG, PNG + OCR tests)
All C# tests passing with bundled PDFium
Rust core: Fixed missing Path import in pdf.rs causing compilation errors
Added use std::path::Path; to support async file extraction in PDF extractor
Node.js (NAPI-RS): PDFium always included in npm packages (no longer conditional)
Ruby gems: Fixed gem publishing validation error caused by incorrect compression handling
Root cause: Gems are POSIX tar archives with gzipped internal files (metadata.gz, data.tar.gz, checksums.yaml.gz) - this is the standard RubyGems format
Removed broken manual gzip step in publish script that was double-compressing valid gems
gem spec validation now passes directly on gems produced by bundle exec rake build
Go bindings: Removed duplicate Windows CGO linker flags causing compilation failures
Fixed packages/go/v4/ffi.go and packages/go/v4/plugins_test_helpers.go to use environment-set flags
Smoke test suite created in test_apps/go with 7 tests (PDF, DOCX, XLSX, JPG, PNG + OCR tests)
All Go tests passing with bundled PDFium via CGO/pkg-config
WASM (Deno): Fixed type definition references from .d.mts to .d.ts
Corrects Deno test helper type imports
C# NuGet: Fixed artifact download path to preserve native runtime directory structure
Java FFI: Added system library path fallback for ONNX Runtime when not bundled in JAR
Enables users with system-installed ONNX Runtime (e.g., brew install onnxruntime) to use the library
Gracefully handles missing ONNX Runtime for operations that don't require embeddings
Smoke tests: All 7 tests now passing across all five language bindings (Java, Python, Node.js, C#, Go)
PDF, DOCX, XLSX, JPG, PNG extraction + OCR tests all working
Verified test suite created for each binding in test_apps/{java,python,node,csharp,go}
WASM: Added PDF support to wasm-target feature for browser and Node.js WASM targets
Fixed build.rs to use bundled-pdfium for WASM instead of system-pdfium
Fixed PDF extractor to handle WASM synchronously (no tokio::spawn_blocking in WASM context)
Fixed PDF bindings initialization for WASM using system library binding
WASM test generator: Fixed hardcoded "application/pdf" MIME types in generated tests
Now correctly uses actual fixture media_type for each document format (DOCX, XLSX, HTML, etc.)
Regenerated all WASM e2e tests with correct MIME types

[Unreleased]¶

Breaking Changes¶

Embeddings now require ONNX Runtime installation
Switched from ort-download-binaries to ort-load-dynamic for runtime detection
Users must install ONNX Runtime separately to use embeddings functionality
Benefit: ~150-200MB package size reduction per platform
Windows MSVC support enabled for embeddings (NEW)
Installation: brew install onnxruntime (macOS), apt install libonnxruntime libonnxruntime-dev (Linux), scoop install onnxruntime (Windows)

Removed¶

embeddings-dynamic feature flag (embeddings now always uses dynamic loading)

[4.0.0-rc.10] - 2025-12-16¶

Breaking Changes¶

PDFium feature names changed: pdf-static→static-pdfium, pdf-bundled→bundled-pdfium, pdf-system→system-pdfium. Feature full-bundled removed (use full + bundled-pdfium).
Default PDFium linking: pdf feature now defaults to bundled-pdfium (auto-downloads and embeds PDFium).
Go module path: Moved from github.com/kreuzberg-dev/kreuzberg/packages/go/kreuzberg to github.com/kreuzberg-dev/kreuzberg/packages/go/v4. Update your imports and run go mod tidy.

Fixed¶

Windows CLI: Now includes bundled PDFium runtime
WASM Node.js/Deno: PDFium support for native targets (edge runtimes continue using lopdf)
Go bindings: Added ExtractFileWithContext() and batch variants for context cancellation support
Node/TypeScript: Replaced any types with proper definitions. Fixed data loss in page metadata.
Ruby bindings: Complete YARD documentation for all API methods
C# bindings: Complete XML documentation for all public methods

[4.0.0-rc.9] - 2025-12-15¶

Added¶

PDFIUM_STATIC_LIB_PATH environment variable: Enables custom static PDFium paths for Docker builds and static binaries

Fixed¶

Python: Wheels now include typing metadata (.pyi stubs) for IDE support
Java: Maven packages now bundle platform-specific native libraries (including Windows DLLs)
Node: npm platform packages now contain compiled .node binaries
WASM: Node.js runtime no longer crashes with self is not defined
PDFium static linking: Fixed to correctly search for libpdfium.a (was searching for dynamic library). Added macOS fallback to dynamic when static unavailable.

[4.0.0-rc.8] - 2025-12-14¶

Added¶

MCP HTTP Stream transport (GitHub #207)
HTTP Stream transport for MCP server using rmcp's built-in support
SSE (Server-Sent Events) for bidirectional communication per MCP spec
Session management with secure session IDs handled by rmcp
CLI flag to select transport: kreuzberg mcp --transport http --host 127.0.0.1 --port 8001
Enable with mcp-http feature flag
Production-ready for cloud deployments, Docker, and serverless

Fixed¶

CI/CD reliability: Improved publish workflows, increased test timeouts, and fixed disk space issues
Go bindings: Fixed CGO library path configuration for Linux and macOS
Python wheels: Now built with correct manylinux compatibility (manylinux: auto)
Ruby gems: Removed embedding model cache from distribution (was adding 567MB of bloat)
Maven Central: Updated publishing to use modern Sonatype Central API
Docker publishing: Added checks to skip redundant builds for already-released versions

4.0.0-rc.7 - 2025-12-12¶

Added¶

Configurable PDFium linking for Rust crate via Cargo features
pdf-static: Download and statically link PDFium (no runtime dependency)
pdf-bundled: Embed library in binary (self-contained executables)
pdf-system: Use system-installed PDFium via pkg-config
System installation scripts for Linux and macOS with pkg-config support
Runtime extraction module for bundled PDFium libraries
Comprehensive documentation in docs/guides/pdfium-linking.md
CI testing for system PDFium installation on Linux and macOS
Default behavior unchanged (backward compatible)
Language bindings continue to bundle PDFium automatically
WebAssembly bindings (@kreuzberg/wasm npm package) for browser, Cloudflare Workers, Deno, and Bun
Full TypeScript API with sync and async extraction methods
Multi-threading support via wasm-bindgen-rayon and SharedArrayBuffer
Batch processing with batchExtractBytes() and batchExtractBytesSync()
Plugin system for custom post-processors, validators, and OCR backends
MIME type detection and configuration management
Comprehensive unit tests (95%+ coverage on core modules)
Production-ready error handling with detailed error messages
RTF extractor now builds structured tables (markdown + cells) and parses RTF \info metadata (authors, dates, counts), bringing parity with DOCX/ODT fixtures.
New pandoc-generated RTF fixtures with embedded metadata for word_sample, lorem_ipsum, and extraction_test to validate cross-format extraction.
Page tracking and metadata redesign (#226)
Per-page content extraction with PageContent type
Byte-accurate page boundaries with PageBoundary type for O(1) lookups
Detailed per-page metadata with PageInfo type (dimensions, counts, visibility)
Unified page structure tracking with PageStructure type
PageConfig for controlling page extraction behavior
Automatic chunk-to-page mapping with first_page/last_page in ChunkMetadata
Format-specific support:
- PDF: Full byte-accurate tracking with O(1) performance
- PPTX: Slide boundary tracking
- DOCX: Best-effort page break detection
Page markers in combined text for LLM context awareness

Changed¶

BREAKING: ChunkMetadata field renames for byte-accurate tracking (#226)
char_start → byte_start (UTF-8 byte offset)
char_end → byte_end (UTF-8 byte offset)
Existing code using char_start/char_end must be updated
See migration guide for details

Fixed¶

Comprehensive lint cleanup across the crate and tests (clippy warnings resolved).
Publish workflow now tolerates apt-managed RubyGems installations by skipping unsupported gem update --system during gem rebuild and installs a fallback .NET SDK when the runner lacks dotnet.
Docker publish now skips pushing when the target version tag already exists, avoiding redundant builds for released images.
Docker tag existence is checked upfront before any publish work, and per-variant publish jobs are skipped early when the version is already present.
Added preflight checks for CLI, Go, and Rust crates to skip build/publish when the release artifacts already exist.
Maven publishing now uses Sonatype Central's central-publishing-maven-plugin with auto-publish/wait and Central user-token credentials, replacing the legacy OSSRH endpoint.
Maven Central publish timeout increased from 30 minutes to 2 hours to accommodate slower validation/publishing process.
Python wheels are now built with manylinux: auto parameter (was incorrectly set to manylinux2014 which is not a valid maturin-action value), fixing PyPI upload rejection of linux_x86_64 platform tags.
manylinux wheel builds now detect container type (CentOS vs Debian) and set correct OPENSSL_LIB_DIR paths (/usr/lib64 for CentOS, /usr/lib/x86_64-linux-gnu for Debian) to avoid openssl-sys build failures in maturin builds.
Ruby Gemfile.lock now includes x86_64-linux platform for CI compatibility on Linux runners.
Ruby gem corruption fixed by excluding .fastembed_cache (567MB of embedding models) and target directories from gemspec fallback path.
Java Panama FFM SIGSEGV crashes on macOS ARM64 fixed by adding explicit padding fields to FFI structs (CExtractionResult and CBatchResult) to ensure struct alignment matches between Rust and Java.
TypeScript E2E test type error fixed in smoke.spec.ts by using proper expectation object format.
Node.js benchmarks now have tsx as workspace dev dependency and root-level typecheck script.
C# compilation errors (CS0136, CS0128, CS0165) resolved by fixing variable shadowing in e2e/csharp/Helpers.cs.
Python CI timeout issues resolved by marking slow office document tests with @pytest.mark.slow and skipping them in CI.
Go CI tests enhanced with comprehensive verbose logging and platform-specific diagnostics for better debugging.

4.0.0-rc.6 - 2025-12-10¶

Release Candidate 6 - FFI Core Feature & CI/Build Improvements¶

New Features¶

FFI Bindings: - Added core feature for kreuzberg-ffi without embeddings support - Provides lightweight FFI build option excluding ONNX Runtime dependency - Enables Windows MinGW compatibility for Go bindings - Includes HTML processing and all document extraction features - Use --no-default-features --features core for MinGW builds

Bug Fixes¶

ODT Extraction: - Fixed ODT table extraction producing duplicate content - Table cells were being extracted twice: once as markdown tables (correct) and again as raw paragraphs (incorrect) - Root cause: XML traversal using .descendants() included nested table cell content as document-level text - Solution: Changed to only process direct children of <office:text> element, isolating table content - Impact: ODT extraction now produces clean output without cell duplication - Enhanced ODT metadata extraction to match Office Open XML capabilities - Added comprehensive metadata extraction from meta.xml (OpenDocument standard) - New OdtProperties struct supports all OpenDocument metadata fields - Extracts: title, subject, creator, initial-creator, keywords, description, dates, language - Document statistics: page count, word count, character count, paragraph count, table count, image count - Metadata extraction now consistent between ODT, DOCX, XLSX, and PPTX formats - Impact: ODT files now provide rich metadata comparable to other Office formats

Go Bindings: - Fixed Windows MinGW builds by disabling embeddings feature - Windows ONNX Runtime only provides MSVC .lib files incompatible with MinGW - Go bindings on Windows now use core feature (no embeddings) - Full features (including embeddings) remain available on Linux, macOS, and Windows MSVC - Fixed test execution to use test_documents instead of .kreuzberg cache - Ensures reproducible test runs without relying on user cache directory - Improves CI/CD reliability and test isolation

CI/CD Infrastructure: - Upgraded upload-artifact from v4 to v5 for compatibility with download-artifact@v6 - Fixes artifact version mismatch causing benchmark and CI failures - Affects 10 workflow files with 42 total changes - Resolves "artifact not found" errors in multi-job workflows - Fixed RUSTFLAGS handling in setup-onnx-runtime action - Now appends to existing RUSTFLAGS instead of overwriting - Preserves -C target-feature=+crt-static for Windows GNU builds - Fixed Go Windows CI artifact download path causing linker failures - Changed download path from target to . to prevent double-nesting (target/target/...) - Linker can now find libkreuzberg_ffi.dll at correct location - Added debug logging to show directory structure after artifact download - Aligned all workflows to Java 25 - Updated from Java 24 to 25 across all CI and publish workflows - Resolves "release version 25 not supported" compilation errors - Affects ci-validate, ci-java, publish, and benchmarks workflows

Ruby Bindings: - Fixed rb-sys links conflict in gem build - Removed rb-sys vendoring, now uses version 0.9.119 from crates.io - Resolves Cargo error: "package rb-sys links to native library rb, but it conflicts with previous package" - Allows Cargo to unify rb-sys dependency across magnus and kreuzberg-rb

C# E2E Tests: - Fixed OCR tests failing with empty content - Added render_config_expression function to C# E2E generator - Tests now pass proper OCR config JSON instead of null - Regenerated all C# tests with tesseract backend configuration - Fixed metadata array contains assertion for single value in array - Extended ValueContains method to handle value-in-array case - Fixes sheet_names metadata assertions in Excel tests

Python Bindings: - Fixed missing format_type in text extraction metadata - TypstExtractor and LatexExtractor incorrectly claimed text/plain MIME type - Removed text/plain from both extractors' supported types - PlainTextExtractor now correctly handles text/plain with proper TextMetadata - Metadata now includes format_type, line_count, word_count, character_count - Added unit test for Metadata serialization to verify format field flattening

4.0.0-rc.5 - 2025-12-01¶

Release Candidate 5 - macOS Binary Fix & Complete Pandoc Removal¶

Breaking Changes¶

Complete Pandoc Removal: - Removed all Pandoc dependencies from v4 codebase (100% native Rust extractors) - Deleted 7 Pandoc code files (3,006 lines) - Removed pandoc-fallback feature flag from Cargo.toml - Removed Pandoc installation from all CI/CD workflows (Linux, macOS, Windows) - Removed Pandoc from Docker images (saving ~500MB-1GB per image) - Updated all documentation to reflect native-only approach - Deleted 160+ Pandoc baseline test files - Native Rust extractors now handle all 12 previously Pandoc-supported formats: - LaTeX, EPUB, BibTeX, Typst, Jupyter, FictionBook, DocBook, JATS, OPML - Org-mode, reStructuredText, RTF, Markdown variants - Benefits: Simpler installation (no system dependencies), faster CI builds (~2-5 min improvement), smaller Docker images, pure Rust codebase - Migration: No action required - native extractors are drop-in replacements with equivalent or better quality

Bug Fixes¶

Build System: - Fixed macOS CLI binary missing libpdfium.dylib (dyld error at runtime) - Build script now correctly copies libpdfium.dylib to target-specific directory when using --target flag - Resolves: dyld: Library not loaded: @rpath/libpdfium.dylib - Impact: macOS CLI binary now functional in releases

Windows Go Builds: - Fixed persistent Windows Go CI failures where ring crate failed with MSVC toolchain detection - Set GNU as default Rust toolchain on Windows: rustup default stable-x86_64-pc-windows-gnu - Updated Rust build cache keys to include target architecture, preventing MSVC cache reuse - Added MSYS2 UCRT64 setup with comprehensive GNU toolchain configuration - Resolves: TARGET = Some(x86_64-pc-windows-msvc) error in build scripts - Impact: Windows Go bindings now build successfully with proper GNU toolchain isolation

Ruby CI Bundler 4.0 Compatibility: - Fixed gem installation failures on macOS and Linux caused by empty environment variables - Removed job-level GEM_HOME="" and BUNDLE_PATH="" that broke non-Windows builds - These variables are now only set on Windows with proper short paths for MAX_PATH mitigation - Updated to bundle update --all (deprecated bundle update removed in Bundler 4.0) - Resolves: ERROR: While executing gem ... (Errno::ENOENT) No such file or directory @ dir_s_mkdir - Impact: Ruby gem builds now succeed on all platforms with Bundler 4.0.0

Note: rc.4 workflow fixes for Python, Node, Ruby, and Maven were committed after rc.4 tag, causing those packages not to publish. All fixes are now present for rc.5.

4.0.0-rc.3 - 2025-12-01¶

Release Candidate 3 - Publishing & Testing Fixes¶

Bug Fixes¶

Publishing Workflow: - Fixed crates.io publishing order (crates-io publishing now properly sequenced) - Fixed NuGet publishing to use API key authentication - Resolved all remaining publish workflow failures across CI/CD pipeline - Fixed Maven Central publishing to use NEXUS_USERNAME/PASSWORD credentials

Language Bindings: - Fixed C# tests cloning JsonNode values to avoid parent assignment violations - Resolved test failures across Ruby and Java bindings - Updated Node binding dependencies and lockfile - Fixed import paths in Node binding tests from src/ to dist/ - Removed incorrect dependencies from Node package.json

Core Library: - Prevented ONNX Runtime mutex errors during process cleanup - Fixed embeddings model initialization to prevent deadlocks - Prevented OCR backend clearing from affecting other tests - Switched from ort-load-dynamic to ort-download-binaries for better compatibility

CLI & Binaries: - Included libpdfium shared library in CLI binary packages for proper runtime linking

Documentation & Theme: - Updated documentation theme colors to align with new design system - Added CONTRIBUTING.md symlink to fix broken GitHub documentation links

CI/CD Infrastructure: - Restructured publish workflow for independent package publishing across languages - Fixed dependency updates for kreuzberg-tesseract to 4.0.0-rc.2 - Updated pnpm filters for consistent @kreuzberg/node package handling - Applied rustfmt to benchmarks and tests for code consistency

4.0.0-rc.4 - 2025-12-01¶

Release Candidate 4 - Critical CI/CD and Build Fixes¶

Bug Fixes¶

CI/CD Workflow Fixes: - Fixed RubyGems action version (v2 doesn't exist, now using v1.0.0) - Fixed pnpm workspace configuration (replaced invalid --cwd flag with -C) - Fixed Docker environment variables (undefined $LD_LIBRARY_PATH in Dockerfiles) - Fixed Maven credentials timing (env vars now available when setup-java generates settings.xml) - Fixed Maven GPG configuration (modernized arguments to --pinentry-mode=loopback format) - Removed release notes update job from publish workflow (not needed)

Core Library Fixes: - Fixed Tesseract OCR test failure (corrected API call ordering: set_image before set_source_resolution) - Fixed Go Windows CGO linking (build Rust FFI with x86_64-pc-windows-gnu target for MinGW compatibility)

Testing: - All 24 Tesseract tests now pass (was 23/24 in rc.3) - Go bindings now build successfully on Windows

4.0.0-rc.2 - 2025-11-30¶

Release Candidate 2 - C# Support & Infrastructure Improvements¶

Breaking Changes¶

TypeScript/Node.js Package Restructuring: - NPM package renamed from kreuzberg to @kreuzberg/node (scoped package) - Platform-specific packages now use @kreuzberg/{platform} naming scheme - TypeScript source consolidated into crates/kreuzberg-node (merged with native bindings) - Migration: Replace import { ... } from 'kreuzberg' with import { ... } from '@kreuzberg/node' - See Migration Guide for details

New Features¶

C#/.NET Bindings: - Complete C# bindings using .NET 9+ Foreign Function & Memory API - Native FFI bridge via kreuzberg-ffi C library - Supports .NET 9+ on Linux, macOS, and Windows - Package: Kreuzberg on NuGet (pending publication) - Full feature parity with other language bindings

Documentation Improvements: - New v3 documentation site with MkDocs - Comprehensive multi-language code examples for all 7 supported languages - API reference documentation for all bindings - Migration guides and tutorials

Bug Fixes¶

CI/CD Fixes: - Fixed all CI workflow failures (ONNX Runtime, Maven dependencies, path triggers) - Fixed benchmark harness configuration validation (single-file mode) - Added ONNX Runtime installation to all relevant CI workflows - Updated Maven plugins to latest compatible versions with enforcer plugin

Core Library Fixes: - Added lock poisoning recovery to embeddings model cache - Improved Tesseract tessdata path detection for Linux systems - Fixed Python async test configuration (pytest-asyncio) - Resolved Rust formatting issues with edition 2024 let-chain syntax

Benchmark Harness Improvements: - Added comprehensive adapter registration diagnostics - Fixed Tesseract benchmark path resolution - Improved Python async benchmark output with performance metrics - Added missing --max-concurrent parameter validation

Code Quality: - Fixed Python linting issues (ruff complexity, mypy type parameters) - Resolved all clippy warnings - Fixed C# generated file formatting - Improved error handling across all bindings

Developer Experience¶

Build System: - Go FFI library and Ruby native extension build fixes - Improved build reproducibility - Better error messages during compilation

Testing: - 268+ Rust tests passing - 13/13 Java tests passing - 32/32 benchmark harness tests passing - Fixed 27 Python async test failures - Unblocked 250+ tests across all language bindings

4.0.0-rc.1 - 2025-11-23¶

Major Release - Complete Architecture Rewrite¶

Kreuzberg v4 represents a complete architectural rewrite, transforming from a Python-only library into a multi-language document intelligence framework with a high-performance Rust core.

Architecture Changes¶

Rust-First Design¶

Complete Rust Core Rewrite (crates/kreuzberg): All extraction logic now implemented in Rust for maximum performance
Standalone Rust Crate: Can be used directly in Rust projects without Python dependencies
10-50x Performance Improvements: Text processing, streaming parsers, and I/O operations significantly faster
Memory Efficiency: Streaming parsers for multi-GB XML/text files with constant memory usage
Type Safety: Strong typing throughout the extraction pipeline

Multi-Language Support¶

Python: PyO3 bindings (crates/kreuzberg-py) with native Python extensions
TypeScript/Node.js: NAPI-RS bindings (crates/kreuzberg-node) for native Node modules
Ruby: Magnus bindings (packages/ruby/ext/kreuzberg_rb/native) with native Ruby extensions
Java: Panama/FFM bindings (packages/java, crates/kreuzberg-ffi) delivering native access for JVM applications
Rust: Direct usage of kreuzberg crate in Rust applications
CLI: Rust-based CLI (crates/kreuzberg-cli) with improved performance

New Features¶

Plugin System¶

PostProcessor Plugins: Transform extraction results (Python, TypeScript, Rust)
Validator Plugins: Enforce quality requirements with fail-fast validation (Python, TypeScript, Rust)
Custom OCR Backends: Integrate cloud OCR or custom ML models (Python, TypeScript, Rust)
NEW: TypeScript/JavaScript OCR Backend Support: Complete NAPI-RS ThreadsafeFunction bridge for JavaScript OCR backends
Guten OCR Backend: First-class TypeScript OCR implementation using @gutenye/ocr-node (PaddleOCR + ONNX Runtime)
JSON Serialization Bridge: Efficient data transfer between TypeScript and Rust across FFI boundaries
Custom Document Extractors: Add support for new file formats (Rust)
Cross-Language Plugin Architecture: Plugins can call between languages via FFI

Language Detection¶

Automatic Language Detection: Fast language detection using fast-langdetect
Multi-Language Support: Detect multiple languages in a single document
Configurable Confidence Thresholds: Control detection sensitivity
Available in: ExtractionResult.detected_languages

RAG & Embeddings Support¶

Automatic Embedding Generation: Generate embeddings for text chunks using ONNX models via fastembed-rs
RAG-Optimized Presets: 4 pre-configured presets (fast, balanced, quality, multilingual)
fast: 384-dim AllMiniLML6V2Q (~22M params) - Quick prototyping
balanced: 768-dim BGEBaseENV15 (~109M params) - Production default
quality: 1024-dim BGELargeENV15 (~335M params) - Maximum accuracy
multilingual: 768-dim MultilingualE5Base (100+ languages)
Model Caching: Thread-safe model cache with automatic download management
Batch Processing: Efficient batch embedding generation with configurable batch size
Embedding Normalization: Optional L2 normalization for similarity search
Custom Model Paths: Configure custom cache directories for model storage
Chunk Integration: Embeddings automatically generated and attached to chunks via Chunk.embedding
Available in: All languages (Rust, Python, TypeScript)

Image Extraction¶

Native Image Extraction: Extract embedded images from PDFs and PowerPoint presentations
Rich Metadata: Format, dimensions, colorspace, bits per component, page number
Cross-Language Raw Bytes: Returns raw image bytes (not PIL objects) for maximum compatibility
Nested OCR Support: Each extracted image can have an optional nested ocr_result field
Clean API Design: Images stored in ExtractionResult.images list with all metadata inline
No Backward Compatibility Required: New v4-only feature with clean, forward-looking design
Supported Formats: PDF (via lopdf), PowerPoint (via Python python-pptx)

Enhanced Extraction¶

XML Extraction: - Streaming XML parser using quick-xml - Memory-efficient processing of multi-GB XML files - Element counting and unique element tracking - Preserves text content while filtering XML structure

Plain Text & Markdown: - Streaming line-by-line parser for multi-GB text files - Markdown metadata extraction: headers, links, code blocks - Word count, line count, character count tracking - CRLF line ending support

PowerPoint (PPTX) Extraction: - Custom XML parser using roxmltree for Office Open XML format - Position-based text sorting (Y-primary, X-secondary) for accurate reading order - Table detection and extraction - List formatting (bulleted and numbered lists) - Image extraction with optional OCR integration - Text formatting preservation (bold, italic, underline) - Hyperlink detection and extraction - Speaker notes extraction - Comprehensive slide processing (30+ test cases covering complex scenarios)

Stopwords System: - 64 Language Support: Comprehensive stopword collections for Afrikaans, Arabic, Bulgarian, Bengali, Breton, Catalan, Czech, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Irish, Galician, Gujarati, Hausa, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Italian, Japanese, Kannada, Korean, Kurdish, Latin, Lithuanian, Latvian, Malayalam, Marathi, Malay, Nepali, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Sinhala, Slovak, Slovenian, Somali, Sesotho, Swedish, Swahili, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese, Zulu - Compile-Time Embedding: All stopword lists embedded in Rust binary using include_str!() macro - Zero Runtime I/O: No file system access required, eliminating deployment dependencies - Automatic Integration: Used by keyword extraction (YAKE/RAKE) and token reduction features

Comprehensive Metadata Extraction:

v4 introduces native metadata extraction across all major document formats:

PDF (native Rust extraction via lopdf): - Title, subject, authors, keywords - Created/modified dates, creator, producer - Page count, page dimensions, PDF version - Encryption status - Auto-generated document summary

Office Documents (native Office Open XML parsing): - DOCX: Core properties (Dublin Core metadata), app properties (page/word/character/line/paragraph counts, template, editing time), custom properties - XLSX: Core properties, app properties (worksheet names, sheet count), custom properties - PPTX: Core properties, app properties (slide count, notes, hidden slides, slide titles), custom properties - Non-blocking extraction (falls back gracefully if metadata unavailable)

Email (via mail-parser): - From, to, cc, bcc addresses - Message ID, subject, date - Attachment filenames

Images (via image crate + kamadak-exif): - Width, height, format - Comprehensive EXIF data (camera settings, GPS, timestamps, etc.)

XML (via Rust streaming parser): - Element count - Unique element names

Plain Text / Markdown (via Rust streaming parser): - Line count, word count, character count - Markdown only: Headers, links, code blocks

Structured Data (JSON/YAML/TOML): - Field count - Format type

HTML (via html-to-markdown-rs): - Comprehensive structured metadata extraction enabled by default - Parses YAML frontmatter and populates HtmlMetadata struct: - Standard meta tags: title, description, keywords, author - Open Graph: og:title, og:description, og:image, og:url, og:type, og:site_name - Twitter Card: twitter:card, twitter:title, twitter:description, twitter:image, twitter:site, twitter:creator - Navigation: base_href, canonical URL - Link relations: link_author, link_license, link_alternate - YAML frontmatter automatically stripped from markdown content - Accessible via ExtractionResult.metadata.html

Key Improvements from v3: - PDF: Pure Rust lopdf instead of Python playa-pdf for better performance - Office: Comprehensive native metadata extraction via Office Open XML parsing - All metadata extraction is non-blocking and gracefully handles failures - Python Type Safety: All metadata types now have proper TypedDict definitions with comprehensive field typing - PdfMetadata, ExcelMetadata, EmailMetadata, PptxMetadata, ArchiveMetadata - ImageMetadata, XmlMetadata, TextMetadata, HtmlMetadata - OcrMetadata, ImagePreprocessingMetadata, ErrorMetadata - IDE autocomplete and type checking for all metadata fields

Legacy MS Office Support: - LibreOffice conversion for .doc and .ppt files - Automatic fallback to modern format extractors after LibreOffice conversion - Optional system dependency (graceful degradation if unavailable)

PDF Improvements: - Better text extraction with pdfium-render - Improved image extraction - Force OCR mode for text-based PDFs - Password-protected PDF support (with crypto extra)

OCR Enhancements: - Table detection and reconstruction - Configurable Tesseract PSM modes - Custom OCR backend support - Image preprocessing and DPI adjustment - OCR result caching

API Changes¶

Core Extraction Functions¶

Async-First Design:

Python

# Asynchronous extraction (recommended for I/O-bound operations)
result = await extract_file("document.pdf")
result = await extract_bytes(data, "application/pdf")
results = await batch_extract_files(["doc1.pdf", "doc2.pdf"])

# Synchronous variants (for simple scripts or non-async contexts)
result = extract_file_sync("document.pdf")
result = extract_bytes_sync(data, "application/pdf")
results = batch_extract_files_sync(["doc1.pdf", "doc2.pdf"])

New TypeScript/Node.js API:

TypeScript

import { extractFile, extractFileSync, ExtractionConfig } from 'kreuzberg';

// Asynchronous extraction
const result = await extractFile('document.pdf');

// Synchronous extraction
const result = extractFileSync('document.pdf');

// Extraction with custom configuration (quality processing enabled)
const config = new ExtractionConfig({ enableQualityProcessing: true });
const result = await extractFile('document.pdf', null, config);

Rust API:

Rust

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    // Initialize configuration with default settings
    let config = ExtractionConfig::default();
    // Perform asynchronous file extraction
    let result = extract_file("document.pdf", None, &config).await?;
    // Display extracted content
    println!("Extracted: {}", result.content);
    Ok(())
}

Configuration¶

Strongly-Typed Configuration: - All configuration uses typed structs/classes (no more dictionaries) - ExtractionConfig, OcrConfig, ChunkingConfig, etc. - Compile-time validation of configuration options - Better IDE autocomplete and type checking

Configuration File Support: - TOML, YAML, and JSON configuration files - Automatic discovery from current/parent directories - kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json - CLI, API server, and MCP server all support config files

Result Types¶

Enhanced ExtractionResult:

Python

@dataclass
class ExtractionResult:
    content: str
    mime_type: str
    metadata: Metadata  # Format-specific strongly-typed metadata
    tables: List[ExtractedTable]
    detected_languages: Optional[List[str]]  # Language detection results (v4 feature)
    chunks: Optional[List[str]]

Strongly-Typed Metadata: - PdfMetadata, ExcelMetadata, EmailMetadata, ImageMetadata, etc. - Type-safe access to format-specific metadata - No more dictionary casting or key errors

Plugin System¶

PostProcessors¶

Python

from kreuzberg import register_post_processor, ExtractionResult

class MyPostProcessor:
    def name(self) -> str:
        return "my_processor"

    def process(self, result: ExtractionResult) -> ExtractionResult:
        # Apply custom transformations to extraction result
        return result

register_post_processor(MyPostProcessor())

Validators¶

Python

from kreuzberg import register_validator, ExtractionResult

class MyValidator:
    def name(self) -> str:
        return "my_validator"

    def validate(self, result: ExtractionResult) -> None:
        # Enforce minimum content length requirement
        if len(result.content) < 10:
            raise ValidationError("Content too short")

register_validator(MyValidator())

Custom OCR Backends¶

Python

from kreuzberg import register_ocr_backend

class CloudOCR:
    def name(self) -> str:
        return "cloud_ocr"

    def extract_text(self, image_bytes: bytes, language: str) -> str:
        # Send image to cloud OCR service and return extracted text
        return extracted_text

register_ocr_backend(CloudOCR())

Performance¶

10-50x faster text processing operations (streaming parsers)
Memory-efficient streaming for multi-GB files
Parallel batch processing with configurable concurrency
SIMD optimizations for text processing hot paths
Zero-copy operations where possible

Docker Images¶

All Docker images include LibreOffice and Tesseract by default:

kreuzberg-dev/kreuzberg:4.0.0-rc.1 - Core image with Tesseract OCR
kreuzberg-dev/kreuzberg:4.0.0-rc.1-easyocr - Core + EasyOCR
kreuzberg-dev/kreuzberg:4.0.0-rc.1-paddle - Core + PaddleOCR
kreuzberg-dev/kreuzberg:4.0.0-rc.1-vision-tables - Core + vision-based table extraction
kreuzberg-dev/kreuzberg:4.0.0-rc.1-all - All features included

Installation¶

Python:

Terminal

pip install kreuzberg               # Core functionality
pip install "kreuzberg[api]"        # With API server
pip install "kreuzberg[easyocr]"    # With EasyOCR
pip install "kreuzberg[all]"        # All features

TypeScript/Node.js:

Terminal

npm install @kreuzberg/node
# or
pnpm add @kreuzberg/node

Rust:

Cargo.toml

[dependencies]
kreuzberg = "4.0"

CLI (Homebrew):

Terminal

brew install kreuzberg-dev/tap/kreuzberg

CLI (Cargo):

Terminal

cargo install kreuzberg-cli

Breaking Changes from v3¶

Architecture¶

Rust core required: Python package now includes Rust binaries (PyO3 bindings)
Binary wheels only: No more pure-Python installation
Minimum versions: Python 3.10+, Node.js 18+, Rust 1.75+

API Changes¶

Async-first API: Primary API is now async, sync variants have _sync suffix
Configuration: All config uses typed classes, not dictionaries
Metadata: Strongly-typed metadata replaces free-form dictionaries
Function renames: extract() → extract_file(), extract_bytes() is new
Batch API: batch_extract() → batch_extract_files() with async support

Removed Features¶

Pure-Python API: No longer available (use v3 for pure Python)
Old configuration format: Dictionary-based config no longer supported
Legacy extractors: Some Python-only extractors migrated to Rust
GMFT (Give Me Formatted Tables): Vision-based table extraction using TATR (Table Transformer) models removed
v3's GMFT used deep learning models for sophisticated table detection and parsing
Provided polars DataFrames, PIL Images, and multi-level header support
v4 replaces this with native Tesseract-based table detection (OCR-based, faster, simpler)
Configure via TesseractConfig.enable_table_detection=True
Returns ExtractedTable objects with cells (2D list) and markdown output
For advanced vision-based table extraction, use v3.x or specialized libraries
Entity Extraction (spaCy): Named entity recognition removed - use external NER libraries with postprocessors
Keyword Extraction (KeyBERT): Automatic keyword extraction removed - use external keyword extractors with postprocessors
Document Classification: Automatic document type detection removed - use external classifiers with postprocessors

Migration Path¶

See Migration Guide for detailed migration instructions.

Documentation¶

New Documentation Site: https://docs.kreuzberg.dev
Multi-Language Examples: Python, TypeScript, and Rust examples
Plugin Development Guides: Comprehensive guides for each language
API Reference: Auto-generated from docstrings
Architecture Documentation: Detailed system architecture explanations

Testing¶

95%+ Test Coverage: Comprehensive test suite in Python, TypeScript, and Rust
Integration Tests: Real-world document testing
Benchmark Suite: Performance comparison with other extraction libraries
CI/CD: Automated testing on Linux, macOS, and Windows

Bug Fixes¶

Fixed memory leaks in PDF extraction
Improved error handling and error messages
Better Unicode support in text extraction
Fixed table extraction edge cases
Resolved deadlocks in plugin system

Security¶

All dependencies audited and updated
No known security vulnerabilities
Sandboxed subprocess execution (LibreOffice)
Input validation on all user-provided data

Contributors¶

Kreuzberg v4 was a major undertaking. Thank you to all contributors!

3.22.0 - 2025-11-27¶

Fixed¶

Always attempt EasyOCR import before raising MissingDependencyError to improve error handling
Hardened HTML regexes for script/style stripping to prevent edge cases

Added¶

Test coverage for EasyOCR import path edge cases

Changed¶

Updated v3 dependencies to current versions

3.21.0 - 2025-11-05¶

Major Release: Kreuzberg v4 Migration¶

This release introduces Kreuzberg v4, a complete rewrite with Rust core, polyglot bindings (Python/TypeScript/Ruby/Java/Go), and enhanced architecture. v3 remains available for legacy projects.

Added¶

Core (Rust)¶

Complete Rust core library with comprehensive document extraction pipeline
Plugin system architecture for extensible extractors, OCR backends, postprocessors, and validators
PDF extraction and rendering with advanced table detection
Format extraction for Office documents (DOCX, XLSX, PPTX), HTML, XML, and plain text
OCR subsystem with pluggable backend support (Tesseract, EasyOCR, PaddleOCR)
Image processing utilities with EXIF extraction and format conversion
Text processing with token reduction, chunking, keyword extraction, and language detection
Stopwords utilities for 50+ languages
Cache system with in-memory and persistent storage
Embeddings support via FastEmbed
MCP (Model Context Protocol) server integration
CLI binary for command-line document extraction

Language Bindings¶

Python: PyO3 FFI bindings with Python-idiomatic API and async support
TypeScript/Node.js: NAPI-RS bindings with full type definitions
Ruby: Magnus FFI bindings with RBS type definitions
Java: Java 25 Foreign Function & Memory API (FFM/Panama) bindings
Go: CGO bindings with Go 1.25+ support
C#: FFI bindings (baseline implementation)

API Server & MCP¶

REST API server for document extraction
MCP server for Claude integration
Comprehensive configuration system with builder patterns

Testing & Infrastructure¶

95%+ test coverage across all components
End-to-end test suites for all language bindings (auto-generated from fixtures)
Comprehensive Rust test suite with unit/integration/doc tests
Multi-language CI/CD with GitHub Actions
Docker builds for containerized deployment
Benchmark harness for performance analysis

Documentation¶

New documentation site: https://docs.kreuzberg.dev
Language-specific guides for all 6+ bindings
Plugin development documentation
API reference with code examples
Architecture documentation
Migration guide from v3 to v4

Changed¶

Architecture restructured around Rust core with thin language-specific wrappers
Build system upgraded to Rust Edition 2024 with Cargo workspace
Dependency management via Cargo, npm/pnpm, PyPI, Maven, and Go modules
Task automation via Taskfile.yaml
Enhanced error handling with typed exceptions across all languages
Configuration system redesigned for consistency across languages

Removed¶

Old v3 codebase superseded by v4 (v3 remains available in separate branch)
Legacy Python implementation details replaced by PyO3 bindings
Previous Node.js implementation replaced by NAPI-RS

Security¶

All dependencies audited for vulnerabilities
Sandboxed subprocess execution (LibreOffice, Tesseract)
Input validation on all user-provided data
Memory safety via Rust

Performance¶

Streaming PDF extraction for memory efficiency
Zero-copy patterns throughout Rust core
SIMD optimizations where applicable
ONNX Runtime for embeddings
Async-first design for I/O operations

3.20.2 - 2025-10-11¶

Fixed¶

Surface missing optional dependency errors in GMFT extractor
Stabilize aggregation data loading in benchmarks
Make Docker E2E API checks resilient to transient failures

Changed¶

Make comparative benchmarks manual-only to reduce resource consumption

Dependencies¶

Updated dependencies to latest versions

3.20.1 - 2025-10-11¶

Fixed¶

Correct publish docker workflow include path

Changed¶

Optimize sdist size by excluding unnecessary files

3.20.0 - 2025-10-11¶

Added¶

Python 3.14 support

Changed¶

Migrate HTML extractor to html-to-markdown v2 for improved compatibility and performance

Fixed¶

Comparative benchmarks: fix frozen dataclass mutation and aggregation workflow
CI improvements and coverage adjustments
Add pytest-benchmark to dev dependencies for benchmark workflow

Dependencies¶

Bump astral-sh/setup-uv from 6 to 7
Bump peter-evans/dockerhub-description from 4 to 5
Bump actions/download-artifact from 4 to 5
Bump actions/setup-python from 5 to 6
Bump actions/checkout from 4 to 5

Documentation¶

Document Python 3.14 support limitations

3.19.1 - 2025-09-30¶

Fixed¶

Replace mocked Tesseract test with real file-based test
Remove Rust build step from kreuzberg benchmarks workflow
Resolve prek and mypy issues in comparative-benchmarks
Add ruff config to comparative-benchmarks to allow benchmark patterns
Ensure Windows Tesseract 5.5.0 HOCR output compatibility
Properly type TypedDict configs with type narrowing and cast

Changed¶

Optimize extractor configurations for fair comparison in benchmarks
Complete comparative-benchmarks workspace with visualization and docs generation

Dependencies¶

Updated dependencies to latest versions

Testing¶

Add regression test for Issue #149 Windows Tesseract HOCR compatibility

3.19.0 - 2025-09-29¶

Added¶

Enforce critical system error policy with context-aware exception handling
Implement systematic exception handling audit across all modules
Add German language pack support for Windows CI

Fixed¶

Align sync/async OCR pipelines and fix Tesseract PSM enum handling
Remove magic library dependency completely
Prevent magic import crashes in Windows tests
Properly mock magic import in CLI tests
Correct asyncio.gather result assertion in profiler test
Resolve benchmark test failures and naming cleanup
Add Windows-safe fallbacks for CLI progress and magic
Make OCR test less brittle for character recognition
Eliminate all Rich Console instantiations at import time
Prevent Windows access violation in benchmarks CLI
Update OCR test assertion to match actual error message format
Handle ValidationError gracefully in batch processing contexts
Correct test fixture file paths and remove hardcoded paths
Ensure coverage job only runs when all tests pass

Changed¶

Remove verbose AI-pattern naming from components
Enable coverage job for branches and update changelog

Dependencies¶

Add missing needs declaration to python-tests job in workflows

Documentation¶

Align documentation with v4-dev improvements
Improve error handling policy documentation

Performance¶

Parallelize comparative benchmarks with 6-hour timeout

3.18.0 - 2025-09-27¶

Added¶

Make API server configuration configurable via environment variables
Improve spaCy model auto-download with uv fallback
Add regression tests for German image PDF extraction (issue #149)

Changed¶

Use uv to install prek for latest version
Replace pre-commit commands with Prek in Taskfile.yml
Update pre-commit instructions to use Prek
Update html-to-markdown to latest version
Auto-download missing spaCy models for entity extraction
Fix HOCR parsing issues

Fixed¶

Resolve mypy type checking issues
Prevent DeepSource upload when tests fail
Make DeepSource upload optional when DEEPSOURCE_DSN is missing
Prevent coverage-pr from running when test-pr fails

Dependencies¶

Remove pre-commit dependency from dev requirements
Updated dependencies to latest versions

CI/CD¶

Use prek instead of pre-commit in validation workflow

Documentation¶

Update contributing guide to use prek instead of pre-commit
Update uv.lock to reflect dependency changes and remove upload-time attributes

3.17.0 - 2025-09-17¶

Added¶

Add token reduction for text optimization
Add concurrency settings to cancel in-progress workflow runs
Optimize token reduction and update dependencies
Optimize token reduction performance and add streaming support

Fixed¶

Resolve excessive markdown escaping in OCR output (fixes #133)
Remove invalid table extraction tests for non-existent functions
Ensure comprehensive test coverage in CI
Resolve Windows path compatibility and improve test coverage
Critical issues from second review of token reduction

Changed¶

Complete token reduction implementation overhaul

Testing¶

Improve coverage and add pragma no cover annotations

3.16.0 - 2025-09-16¶

Added¶

Enhance JSON extraction with schema analysis and custom field detection
Add internal streaming optimization for html-to-markdown conversions
Comprehensive test coverage improvements

Fixed¶

Export HTMLToMarkdownConfig in public API
Resolve EasyOCR module-level variable issues and adjust coverage requirement
Resolve mypy type errors and test failures
Add xfail markers to chunking, ML-dependent, cache, and language detection tests
Add xfail markers for EasyOCR device validation tests
Resolve CI test failures with targeted fixes
Add missing Any import for type annotations
Update docker e2e workflow to use correct test filename
Prevent docker_e2e.py from being discovered by pytest
Resolve Windows-specific path issues in tests

Changed¶

Remove coverage fail_under threshold completely
Remove unnecessary xfail markers

Dependencies¶

Bump actions/download-artifact from 4 to 5

Testing¶

Add comprehensive tests for API configuration caching
Increase test coverage to meet CI requirements

3.15.0 - 2025-09-14¶

Added¶

Add comprehensive image extraction support
Add polars DataFrame and PIL Image serialization for API responses
Improve test coverage across core modules

Fixed¶

Resolve mypy type errors in CI
Remove unused flame_config parameter from profile_benchmark
Resolve CI formatting issues by ignoring generated AGENTS.md
Resolve pre-commit formatting issues and ruff violations
Resolve all test failures and achieve full test suite compliance
Add polars DataFrame and PIL Image serialization for API responses
Resolve TypeError unhashable type dict in API config merging
Address linting and type issues from PR #130 review

Changed¶

Apply ruff formatting across the codebase

Documentation¶

Add comprehensive image extraction documentation
Update documentation to use ImageOCRConfig instead of deprecated fields

Cleanup¶

Remove INTERFACE_PARITY.md causing mdformat issues

3.14.0 - 2025-09-13¶

Added¶

Comprehensive DPI configuration system for OCR processing with fine-grained control over resolution settings

Changed¶

Enhanced API with 1GB upload limit and comprehensive OpenAPI documentation
Completed pandas to polars migration across entire codebase for improved performance and memory efficiency

Fixed¶

Improved lcov coverage combining robustness in CI pipeline
DPI configuration tests moved to proper test directory for correct CI coverage calculation
Pre-commit formatting issues resolved
PDF content handling differences between CI and local tests
PaddleOCR test isolation and fixture path corrections

3.13.0 - 2025-09-04¶

Added¶

Runtime configuration API with query parameters and header support for flexible request handling
Comprehensive runtime configuration documentation with practical examples
OCR caching system for EasyOCR and PaddleOCR backends to improve performance

Changed¶

Replaced pandas with polars for table extraction (significant performance improvement)
Consolidated benchmarks CLI into unified interface with improved structure
Converted all class-based tests to function-based tests for consistency
Restructured benchmarks package layout for better organization
Removed all module-level docstrings from Python files for cleaner codebase

Fixed¶

MyPy errors and type annotations throughout codebase
Tesseract TSV output format and table extraction implementation
UTF-8 encoding handling across document processing pipeline
Config loading with proper error handling and validation
HTML-to-Markdown configuration now externalized for flexibility
All ruff and mypy linting errors resolved
Test failures in CI environment
Regression in PDF extraction and XLS file handling
Subprocess error analysis for CalledProcessError handling
Empty DataFrames in GMFT table extraction (pandas.errors.EmptyDataError prevention)

Performance¶

Optimized PDF processing and HTML conversion performance
Improved TSV integration test quality with concrete assertions

3.12.0 - 2025-08-30¶

Added¶

Docker E2E testing infrastructure with multi-image matrix strategy
Multilingual OCR support to Docker images with flexible backend selection
Docker documentation with clarity on image contents, sizes, and OCR backend selection

Changed¶

Simplified Docker images to base and core variants (removed intermediate images)
Docker strategy updated to 2-image approach for optimized distribution
Docker E2E tests aligned with CI patterns using matrix strategy for parallel execution
Improved Docker workflow with manual trigger options for flexibility

Fixed¶

Docker image naming conventions (kreuzberg-core:latest)
Boolean condition syntax in Docker workflow
Docker permissions and documentation accuracy
Docker E2E test failure detection (ensures proper test exit code propagation)
Docker workflow disk space management and optimization
Grep exit code handling in Docker test workflows
Naming conflict in CLI config command
grep exit code failures in shell scripts
EasyOCR test image selection (use text-containing image instead of flower)
Test fixture image file naming for clarity

Infrastructure¶

Enhanced disk space cleanup in Docker E2E workflow
Improved Docker workflow disk space management
Updated test-docker-builds.yml for new image naming scheme

3.11.1 - 2025-08-13¶

Fixed¶

EasyOCR device-related parameters removed from readtext() calls
Numpy import optimization: only import numpy inside process_image_sync rather than at module level to improve startup time
Table attribute access in documentation
Pre-commit formatting for table documentation

Infrastructure¶

Updated uv.lock after version bump
AI-rulez pre-commit hooks compatibility with Go setup in CI workflow
Temporary ai-rulez hooks disabled to unblock CI
Reverted to ai-ruyles v1.4.3 with Go setup for CI stability
Dependency updates including amannn/action-semantic-pull-request v5→v6 and actions/checkout v4→v5

3.11.0 - 2025-08-01¶

Added¶

Comprehensive tests for Tesseract OCR backend with edge case coverage
Comprehensive tests for API module with full functionality coverage
Comprehensive tests for config and image extractor modules
Comprehensive Click-based tests for CLI module with proper command testing
Comprehensive tests for structured data extractor module
Comprehensive tests for extraction and document classification modules
Comprehensive tests for pandoc extractor module
Comprehensive tests for email extractor module
Comprehensive tests for _config.py module
Comprehensive tests for _utils._errors module with error handling validation
Comprehensive tests for spreadsheet metadata extraction
Retry mechanisms for CI test reliability to handle transient failures

Changed¶

Implemented Python 3.10+ syntax optimizations (match/case, union types where applicable)
Merged comprehensive test files with regular test files for consolidated coverage
Optimized pragma no cover usage to reduce false negatives
Converted test classes to functional tests for better pytest compatibility
Updated coverage requirements documentation from 95% to 85%

Fixed¶

Linting issues in merged comprehensive test files
Image extractor async delegation tests
CI test failures with improved retry mechanisms
Timezone assertion in spreadsheet metadata test
MyPy errors in test files and config calls
ExceptionGroup import errors for Python 3.10+ compatibility
Syntax errors in test files
Bug in _parse_date_string with proper test mocks
Pre-commit formatting issues
DeepSource coverage reporting with restructured CI workflow

Infrastructure¶

Updated uv.lock to revision 3 to fix CI validation issues
Fixed CI workflow structure for improved DeepSource coverage reporting
Coverage upload now only triggers on push events (not pull requests)
Version bump to 3.10.1 with lock file updates

3.10.0 - 2025-07-29¶

Added¶

PDF password support through new crypto extra feature

Documentation¶

Updated quick start section to use available configuration options

3.9.0 - 2025-07-17¶

Added¶

Initial release of v3.9.0 series with foundational features

3.8.0 - 2025-07-16¶

Added¶

Foundation for v3.8.0 release

3.7.0 - 2025-07-11¶

Added¶

MCP server for AI integration enabling Claude integration with document extraction capabilities
MCP server configuration and optional dependencies support

Fixed¶

Chunk parameters adjusted to prevent overlap validation errors
MCP server linting issues resolved
HTML test compatibility with html-to-markdown v1.6.0 behavior

Changed¶

MCP server tests converted to function-based format
MCP server configuration documentation clarified

3.6.0 - 2025-07-04¶

Added¶

Language detection functionality integrated into extraction pipeline

Fixed¶

Entity extraction migration from gliner to spaCy completed
Docker workflow reliability improved with updated ai-rulez configuration
Optional imports refactored to use classes for better type handling
Validation and post-process logic abstracted into helper functions for sync and async paths

Changed¶

spaCy now used for entity extraction replacing gliner
Optional imports moved inside functions for proper error handling
Entity and keyword extraction validation/post-processing refactored

3.5.0 - 2025-07-04¶

Added¶

Language detection functionality with configurable backends
Performance optimization guidelines and documentation
Full synchronous support for PaddleOCR and EasyOCR backends
Docker documentation and Python 3.10+ compatibility
Benchmarks submodule integration

Fixed¶

Flaky OCR tests marked as xfail in CI environments
Typing errors resolved across codebase
Chunking default configuration reverted to stable settings
PaddleOCR sync implementation updated to correct v3.x API format
Python 3.9 support dropped and version bumped to v3.4.2
Docker workflow version detection and dependencies fixed
Docs workflow configuration and Docker manual dispatch enabled
CI documentation dependencies resolved

Changed¶

Default configurations optimized for modern document extraction
Tesseract OCR configuration optimized based on benchmark analysis
Documentation link moved to top of README
Python 3.10+ now required (3.9 support dropped)

3.4.0 - 2025-07-03¶

Added¶

API support with Litestar framework for web-based document extraction
API tests aligned with best practices
EasyOCR and GMFT Docker build variants
Docker support with comprehensive CI integration
API and Docker documentation

Fixed¶

'all' extra dependency configuration simplified
Docker workflow extras configuration corrected
Race condition in GMFT caching tests resolved
Race condition in Tesseract caching tests resolved
Docker Hub repository configuration fixed
CLI integration test using tmp_path fixture for isolation
gmft module mypy type errors resolved

Changed¶

Docker module migrated to api module with Litestar framework
Docker workflow matrix simplified
Performance documentation enhanced with competitive benchmarks

3.3.0 - 2025-06-23¶

Added¶

Isolated process wrapper for GMFT table extraction
Comprehensive statistical benchmarks proving msgpack superiority
Comprehensive CLI support with Click framework
Comprehensive benchmarking suite for sync vs async performance
Pure synchronous extractors without anyio dependencies
Document-level caching with simplified concurrency handling
Per-file locks and parallel batch processing for sync performance
Python version matrix CI testing (3.9-3.13)
Thread lock for pypdfium2 to prevent macOS segfaults

Fixed¶

All remaining mypy and pydoclint issues ensuring CI passes
Python 3.13 mock compatibility issues in process pool tests
ExceptionGroup compatibility in sync tests
GMFT config updated and test failures resolved
All Windows-specific test failures in multiprocessing and utils modules
Windows compatibility issues resolved
Test coverage improvements from 76% to 83%
Rust formatting issues with edition 2024 let-chain syntax
File existence validation added to extraction functions
Python linting issues (ruff complexity, mypy type parameters) resolved
All clippy warnings resolved
C# generated file formatting fixed

Changed¶

msgspec JSON replaced with msgpack for 5x faster cache serialization
Caching migrated from msgspec JSON to msgpack for improved performance
Async performance optimized with semaphore-based concurrency control
Sync performance optimized with per-file locks and parallel batch processing
Code formatting cleanup and non-essential comments removed
Benchmark workflows removed from CI
Python version cache key updated to include full version

3.2.0 - 2025-06-23¶

Added¶

GPU acceleration support for enhanced OCR and ML operations

Fixed¶

EasyOCR byte string issues resolved
Pandoc version issues fixed
PaddleOCR configuration updated for optimal performance
Multiple language support added to EasyOCR

Changed¶

playa-pdf pinned to version 0.4.3 for stability
PaddleOCR configuration optimized
Dependencies updated to latest compatible versions
Pandoc version string resolution improved

3.1.0 - 2025-03-28¶

Added¶

GMFT (Give Me Formatted Tables) support for vision-based table extraction

Fixed¶

Bundling issue corrected
Wrong link in README fixed
Dependencies updated and GMFT testing issues resolved

Changed¶

Image extraction now non-optional in results
Test imports and structure updated
Concurrency test added for validation

3.0.0 - 2025-03-23¶

Added¶

Chunking functionality for document segmentation
Extractor registry for managing format-specific extractors
Hooks system for pre/post-processing
OCR backend abstraction with EasyOCR and PaddleOCR support
OCR configuration system
Multiple language support in OCR backends
Comprehensive documentation

Fixed¶

Pre-commit hooks configuration
Windows error message handling
PaddleOCR integration issues
Dependencies updated to stable versions

Changed¶

Structure refactored for improved organization
Metadata extraction approach updated
OCR integration with configurable backends
Documentation added and improved