Powered by Kreuzberg • Comprehensive document intelligence framework benchmarking
If you find these benchmarks helpful, please consider sponsoring the development:
Your support helps maintain and improve these benchmarks for the community! 🚀
This repository provides comprehensive, automated benchmarks for Python document intelligence frameworks. We test popular multi-format document processing libraries against a diverse dataset of 94 real-world documents, measuring:
- ⚡ Performance: Document processing speed, memory usage, CPU utilization
- ✅ Reliability: Success rates, error handling, timeout behavior
- 📊 Quality: Text extraction accuracy and completeness (optional)
- 🔧 Practicality: Installation size, dependency count, format support
Our GitHub Actions workflow automatically:
- Runs benchmarks every Monday at 6 AM UTC (or on-demand)
- Tests each framework in isolated environments to prevent interference
- Generates comprehensive reports with charts, tables, and analysis
- Deploys results to GitHub Pages for easy viewing
- Stores all raw data in the repository for transparency and reproducibility
- Raw benchmark results: Available in
results/
directory as JSON/CSV - Test documents: 94 files in
test_documents/
(~210MB total) - Visualizations: Charts and graphs in
results/charts/
- Historical data: Track performance trends over time via git history
- Reproducible: Run the same benchmarks locally with our CLI
# Clone the repository
git clone https://github.com/Goldziher/python-text-extraction-libs-benchmarks.git
cd python-text-extraction-libs-benchmarks
# Install base dependencies only
uv sync
# Install specific frameworks (recommended due to conflicts)
uv sync --extra kreuzberg # Kreuzberg framework
uv sync --extra kreuzberg-ocr # Kreuzberg with OCR backends
uv sync --extra extractous # Extractous framework
uv sync --extra unstructured # Unstructured framework
uv sync --extra markitdown # MarkItDown framework
uv sync --extra docling # Docling framework (may conflict with kreuzberg)
# Install all compatible frameworks (excludes docling due to conflicts)
uv sync --extra all
# Run benchmarks for installed frameworks
uv run python -m src.cli benchmark
# Test specific frameworks
uv run python -m src.cli benchmark --framework kreuzberg_sync,extractous --category small
# Generate reports from results
uv run python -m src.cli report --output-format html
uv run python -m src.cli visualize
Due to dependency conflicts between frameworks:
- kreuzberg (3.10.1+) and docling cannot be installed together (lxml version conflict)
- Install frameworks individually or use
--extra all
for all compatible frameworks - The benchmarking tool will gracefully skip frameworks that aren't installed
Our comprehensive CLI provides full control over benchmarking:
# List available commands
uv run python -m src.cli --help
# Benchmarking commands
uv run python -m src.cli benchmark # Run benchmarks
uv run python -m src.cli list-frameworks # Show available frameworks
uv run python -m src.cli list-categories # Show document categories
uv run python -m src.cli list-file-types # Show supported file types
# Analysis and reporting
uv run python -m src.cli report # Generate reports
uv run python -m src.cli visualize # Create charts
uv run python -m src.cli aggregate # Combine results
uv run python -m src.cli quality-assess # Add quality metrics
# Advanced options
uv run python -m src.cli benchmark \
--framework kreuzberg_sync,extractous \
--category tiny,small,medium \
--iterations 5 \
--timeout 600 \
--enable-profiling \
--enable-quality-assessment
- ⚡ Performance Rankings: Speed comparison across all frameworks and file types
- 💾 Resource Usage: Memory consumption and CPU utilization analysis
- ✅ Success Rates: Reliability metrics and failure analysis
- 📊 Interactive Dashboards: Explore data by framework, file type, and size
- 🔍 Detailed Breakdowns: Per-file extraction times and error logs
- 📈 Trend Analysis: Performance over multiple iterations
- 📋 Raw Data: All benchmark data available for download and analysis
We benchmark the following multi-format document intelligence frameworks:
-
Kreuzberg (v3.8.0+)
- Both synchronous and asynchronous APIs
- Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR)
- Lightweight installation (71MB)
-
Extractous (v0.1.0+)
- Rust-based with Python bindings
- Native performance characteristics
- Supports 1000+ formats via Apache Tika
-
Unstructured (v0.18.5+)
- Enterprise-focused solution
- Supports 64+ file types including emails
- Moderate installation size (146MB)
-
MarkItDown (v0.0.1a2+)
- Microsoft's Markdown converter
- Includes ONNX Runtime for ML inference
- Optimized for LLM preprocessing
-
Docling (v2.41.0+)
- IBM Research's document understanding
- Advanced ML models included
- Largest installation (1GB+)
Each framework is tested with identical documents and conditions for fair comparison.
Our test suite includes 94 real-world documents (~210MB total) across diverse formats:
- 📄 Office: DOCX, PPTX, XLSX, XLS, ODT (35 files)
- 📑 PDF: Academic papers, reports, scanned documents (24 files)
- 🌐 Web: HTML pages with various complexities (15 files)
- 🖼️ Images: PNG, JPG, JPEG, BMP for OCR testing (11 files)
- 📧 Email: EML and MSG with attachments (6 files)
- 📝 Text/Markup: MD, RST, ORG, TXT (12 files)
- 📊 Data: CSV, JSON, YAML (4 files)
- Tiny: < 100KB (15 files) - Quick extraction tests
- Small: 100KB - 1MB (45 files) - Typical documents
- Medium: 1MB - 10MB (12 files) - Complex documents
- Large: 10MB - 50MB (20 files) - Stress tests
- Huge: > 50MB (2 files) - Performance limits
Documents in English, Hebrew, German, Chinese, Japanese, and Korean to test language-specific extraction capabilities.
# Test only PDF files
uv run python -m src.cli benchmark --file-types pdf
# Test multiple file types
uv run python -m src.cli benchmark --file-types pdf --file-types docx --file-types html
# Test by format tier (universal formats supported by all frameworks)
uv run python -m src.cli benchmark --format-tier universal
# Test Kreuzberg with Tesseract OCR backend
uv run python -m src.cli benchmark \
# Test async vs sync implementations
uv run python -m src.cli benchmark \
--framework kreuzberg_sync,kreuzberg_async \
--enable-profiling
# Run with quality assessment (slower but provides accuracy metrics)
uv run python -m src.cli benchmark --enable-quality-assessment
# Generate quality-enhanced reports
uv run python -m src.cli quality-assess --results-file results/results.json
# Create custom visualizations
uv run python -m src.cli visualize \
--results-file results/aggregated_results.json \
--output-dir custom_charts/
- Isolated Environments: Each framework runs in a separate CI job to prevent interference
- Cold Start: No warmup runs - we measure real-world first-use performance
- Resource Monitoring: Track memory (RSS) and CPU usage at 50ms intervals
- Timeout Protection: 300s per file, 2 hours per framework job
- Multiple Iterations: Default 3 runs per file to ensure consistency
- Error Tracking: Capture and categorize all failures and timeouts
- Extraction Time: Wall-clock time from start to completion
- Memory Usage: Peak RSS (Resident Set Size) during extraction
- CPU Utilization: Average CPU percentage during processing
- Throughput: Files/second and MB/second processing rates
- Success Rate: Percentage of files extracted without errors
When enabled with --enable-quality-assessment
:
- Readability Scores: Flesch Reading Ease, Gunning Fog Index
- Text Coherence: Sentence structure and flow analysis
- Completeness: Estimated content coverage
- Noise Detection: Garbage text and encoding issues
Our GitHub Actions workflow (benchmark-by-framework.yml
):
- Runs automatically every Monday at 6 AM UTC
- Can be triggered manually via GitHub Actions UI
- Tests each framework in parallel with 2-hour timeouts
- Generates reports and deploys to GitHub Pages
- Stores all data in the repository for analysis
All benchmark data is freely available:
# Raw results (JSON format)
results/results.json
results/summaries.json
results/aggregated_results.json
# CSV exports for analysis
results/detailed_results.csv
results/summary_results.csv
# Visualizations
results/charts/*.png
results/charts/interactive_dashboard.html
# Framework metadata
visualizations/analysis/metadata/
visualizations/analysis/tables/
# Run the exact same benchmarks locally
uv run python -m src.cli benchmark --framework all
# Or download our results
wget https://github.com/Goldziher/python-text-extraction-libs-benchmarks/raw/main/results/aggregated_results.json
Each framework supports different file formats. Our benchmarks test:
Framework | Tested Formats | Notable Limitations |
---|---|---|
Kreuzberg | 17/18 formats | No MSG support |
Extractous | Most formats | Some edge cases |
Unstructured | 64+ formats | Full support |
MarkItDown | Office & web | Limited formats |
Docling | 10 formats | No email/data formats |
For fair comparison across frameworks:
# Tier 1: Universal formats (supported by all frameworks)
uv run python -m src.cli benchmark --format-tier universal
# Tier 2: Common formats (supported by most frameworks)
uv run python -m src.cli benchmark --format-tier common
# All formats (shows full capabilities)
uv run python -m src.cli benchmark --format-tier all
We welcome contributions! Areas of interest:
- New frameworks: Add support for emerging document intelligence libraries
- More test documents: Expand our dataset with edge cases
- Performance optimizations: Improve benchmarking efficiency
- Analysis tools: Enhanced visualization and reporting capabilities
- Multi-language tests: Expand language coverage
# Set up development environment
uv sync --all-extras
uv run pre-commit install
# Run tests
uv run pytest
# Submit PR with your improvements!
MIT License - see LICENSE for details.
- Powered by Kreuzberg - Fast Python document intelligence
- Test documents from various public sources
- Framework maintainers for their excellent libraries
- GitHub Actions for CI/CD infrastructure