Token Reduction¶
Kreuzberg provides a token reduction capability that helps optimize extracted text for processing by large language models or storage systems. This feature can significantly reduce the size of extracted content while preserving essential information and meaning.
Overview¶
Token reduction processes extracted text to remove redundant content, normalize formatting, and optionally eliminate stopwords. This is particularly useful when working with token-limited APIs, implementing content summarization, or reducing storage costs for large document collections.
Configuration¶
Token reduction is controlled through the ExtractionConfig
class with the token_reduction
parameter, which accepts a TokenReductionConfig
object:
mode
: The reduction level -"off"
,"light"
, or"moderate"
(default:"off"
)preserve_markdown
: Whether to preserve markdown structure during reduction (default:True
)language_hint
: Language hint for stopword removal in moderate mode (default:None
)custom_stopwords
: Additional stopwords per language (default:None
)
⚠️ Important Limitations:
- Maximum text size: 10MB (10,000,000 characters)
- Language codes must match format: alphanumeric and hyphens only (e.g., "en", "en-US")
Reduction Modes¶
Off Mode¶
No reduction is applied - text is returned exactly as extracted.
Light Mode¶
Applies formatting optimizations without changing semantic content:
- Removes HTML comments
- Normalizes excessive whitespace
- Compresses repeated punctuation
- Removes excessive newlines
Performance: ~10% character reduction, \<0.1ms processing time
Moderate Mode¶
Includes all light mode optimizations plus stopword removal:
- Removes common stopwords in 64+ supported languages
- Preserves important words (short words, acronyms, words with numbers)
- Maintains markdown structure when enabled
Performance: ~35% character reduction, ~0.2ms processing time
Basic Usage¶
Language Support¶
Token reduction supports stopword removal in 64+ languages including:
- Germanic: English (en), German (de), Dutch (nl), Swedish (sv), Norwegian (no), Danish (da)
- Romance: Spanish (es), French (fr), Italian (it), Portuguese (pt), Romanian (ro), Catalan (ca)
- Slavic: Russian (ru), Polish (pl), Czech (cs), Bulgarian (bg), Croatian (hr), Slovak (sk)
- Asian: Chinese (zh), Japanese (ja), Korean (ko), Hindi (hi), Arabic (ar), Thai (th)
- And many more: Finnish, Hungarian, Greek, Hebrew, Turkish, Vietnamese, etc.
Custom Stopwords¶
You can add domain-specific stopwords for better reduction:
Markdown Preservation¶
When preserve_markdown=True
(default), the reducer maintains document structure:
Reduction Statistics¶
You can get detailed statistics about the reduction effectiveness:
Performance Benchmarks¶
Based on comprehensive testing across different text types:
Light Mode Performance¶
- Character Reduction: 10.1% average (8.8% - 10.9% range)
- Token Reduction: 0% (preserves all words)
- Processing Time: 0.03ms average per document
- Use Case: Format cleanup without semantic changes
Moderate Mode Performance¶
- Character Reduction: 35.3% average (11.4% - 62.3% range)
- Token Reduction: 33.7% average (1.9% - 57.6% range)
- Processing Time: 0.23ms average per document
- Use Case: Significant size reduction with preserved meaning
Effectiveness by Content Type¶
- Stopword-heavy text: Up to 62% character reduction
- Technical documentation: 23-31% character reduction
- Formal documents: 29% character reduction
- Scientific abstracts: 30% character reduction
- Minimal stopwords: 11% character reduction (mostly formatting)
Use Cases¶
Large Language Model Integration¶
Reduce token costs and fit more content within model limits:
Content Storage Optimization¶
Reduce storage costs for large document collections:
Search Index Optimization¶
Create more efficient search indices:
Best Practices¶
- Choose the right mode: Use
"light"
for format cleanup only,"moderate"
for significant reduction - Preserve markdown for structured documents: Keep
preserve_markdown=True
when document structure matters - Set language hints: Specify
language_hint
for better stopword detection in non-English documents - Test with your content: Effectiveness varies by document type - benchmark with your specific use case
- Consider downstream processing: Balance reduction benefits against potential information loss
- Use custom stopwords judiciously: Add domain-specific terms but avoid over-filtering
Error Handling¶
Technical Details¶
The token reduction system uses:
- Lazy loading: Stopwords are loaded only when needed for specific languages
- Pre-compiled regex patterns for optimal performance
- LRU caching for frequently used languages (up to 16 cached)
- Individual language files for efficient memory usage
- Intelligent markdown parsing to preserve document structure
- Security validation including text size limits and language code validation
- Efficient stopword management with support for custom additions
The reduction process is highly optimized and adds minimal overhead to the extraction pipeline.