File2MD is a FastAPI-based microservice that converts 123 file formats (Word, PDF, PowerPoint, Excel, CSV, images, audio, video, Apple iWork suite, 82 programming languages, etc.) into unified Markdown code block format, which is LLM-friendly.
- π API Key Authentication: Support for multiple API key management
- π Comprehensive Format Support: Supports 123 file formats with 16 parser types
- π» Code File Support: Supports 82 programming languages file conversion, covering mainstream, functional, scripting, and configuration languages
- πΌοΈ Smart Image Recognition: Integrated Vision API and PaddleOCR, supports SVG to PNG recognition
- β‘ High-Performance Async: Based on FastAPI async framework
- π Queue Processing Mode: Supports batch document conversion, limit the maximum number of concurrent tasks (configurable via
.env
) - π― Concurrent Image Processing: Multiple images in documents are processed simultaneously with PaddleOCR and AI vision recognition, improving processing speed by 2-10x. The limit can be configured via
MAX_IMAGES_PER_DOC
(-1
for no limit). - π³ Containerized Deployment: Provides Docker and Docker Compose support
- π Unified Output Format: All file types output in unified code block format
- π₯οΈ Built-in WebUI: Out-of-the-box web interface for file upload, API Key management, document conversion, and OCR recognition.
- π Language Switch: Supports both English and δΈζ interface.
- π¦ No extra deployment: Access at http://localhost:8999/webui after backend is started.
Demo:
- π Unified Output Format
- π Supported File Formats
- π Quick Start
- π΅ Audio and Video Processing Features
- π API Usage Guide
- π API Endpoints Overview
- βοΈ Configuration
- β Error Handling
- ποΈ Architecture Design
- β‘ Performance Optimization
- π Monitoring and Logging
- π More Resources
- π§ Extension Development
- π License
- π€ Contributing
- π Latest Updates
All file conversion results use a unified code block format for easy LLM understanding and processing:
π File Type | π·οΈ Output Format | π Content Example |
---|---|---|
ποΈ Slideshow Files | ```slideshow | PowerPoint/Keynote presentation content |
πΌοΈ Image Files | ```image | OCR text recognition + AI visual description |
π Plain Text Files | ```text | Original text content |
π Document Files | ```document | Word/PDF/Pages structured content |
π Spreadsheet Files | ```sheet | CSV/Excel/Numbers data tables |
π΅ Audio Files | ```audio | Speech transcription + timeline information |
π¬ Video Files | ```video | SRT subtitles + audio transcription |
π» Code Files | ```python/javascript/... etc | Syntax highlighted code blocks |
Format | Extensions | Parser | Output Format | Description |
---|---|---|---|---|
Plain Text | .txt , .md , .markdown , .text |
PlainParser | text |
Direct text content reading |
Word Documents | .docx |
DocxParser | document |
Extract text, tables, and formatting, concurrent image processing |
Word Documents | .doc |
DocParser | document |
Convert via mammoth, concurrent image processing |
RTF Documents | .rtf |
RtfParser | document |
Support RTF format, prefer Pandoc, fallback to striprtf |
ODT Documents | .odt |
OdtParser | document |
OpenDocument text, support tables and lists |
PDF Documents | .pdf |
PdfParser | document |
Extract text and images, concurrent image processing |
Keynote Presentations | .key |
KeynoteParser | slideshow |
Apple Keynote presentations, extract metadata and structure |
Pages Documents | .pages |
PagesParser | document |
Apple Pages word processing documents, extract metadata and structure |
Numbers Spreadsheets | .numbers |
NumbersParser | sheet |
Apple Numbers spreadsheets, support table data extraction |
PowerPoint Presentations | .ppt , .pptx |
PptxParser | slideshow |
Extract slide text content (no vision model) |
Excel Spreadsheets | .xls , .xlsx |
ExcelParser | sheet |
Convert to HTML table format and statistics, concurrent image processing |
CSV Data | .csv |
CsvParser | sheet |
Convert to HTML table format and data analysis |
Image Files | .png , .jpg , .jpeg , .gif , .bmp , .tiff , .webp , .ico , .tga |
ImageParser | image |
PaddleOCR and vision recognition |
SVG Files | .svg |
SvgParser | svg |
Recognize both code structure and visual features, convert to PNG for PaddleOCR and AI vision analysis |
Audio Files | .wav , .mp3 , .mp4 , .m4a , .flac , .ogg , .wma , .aac |
AudioParser | audio |
Smart speech analysis and ASR conversion with intelligent segmentation |
Video Files | .mp4 , .avi , .mov , .wmv , .mkv , .webm , .3gp |
AudioParser | video |
Video audio extraction and subtitle generation with SRT format output |
Language Category | Supported Extensions | Output Format |
---|---|---|
Mainstream Languages | .py , .js , .ts , .java , .cpp , .c , .cs , .go , .rs , .php , .rb |
Corresponding language code blocks |
Frontend Technologies | .html , .css , .scss , .sass , .less , .vue , .jsx , .tsx , .svelte |
Corresponding language code blocks |
Script Languages | .r , .lua , .perl , .pl , .sh , .bash , .zsh , .fish , .ps1 |
Corresponding language code blocks |
Configuration Files | .json , .yaml , .yml , .toml , .xml , .ini , .cfg , .conf |
Corresponding language code blocks |
Database and Others | .sql , .dockerfile , .makefile , .cmake , .gradle , .proto , .graphql |
Corresponding language code blocks |
Functional Languages | .hs , .lhs , .clj , .cljs , .elm , .erl , .ex , .exs , .fs , .fsx |
Corresponding language code blocks |
System and Tools | .vim , .vimrc , .env , .gitignore , .gitattributes , .editorconfig |
Corresponding language code blocks |
Complete List: Python, JavaScript, TypeScript, Java, C/C++, C#, Go, Rust, PHP, Ruby, R, HTML, CSS, SCSS, Sass, Less, Vue, React(JSX), Svelte, JSON, YAML, XML, SQL, Shell scripts, PowerShell, Dockerfile, Makefile, Haskell, Clojure, Elm, Erlang, Elixir, F#, Swift, Kotlin, Dart, Julia, MATLAB, LaTeX, Vim, and 82 other languages.
We provide four deployment options to choose from:
Download the latest docker image medicnex-file2md.tar
from the GitHub Releases, configure .env
in the same directory, then run the following command:
#!/bin/bash
# Check if image exists
if ! docker images | grep -q "medicnex-file2md:latest"; then
echo "Importing image..."
docker load -i medicnex-file2md.tar
fi
# Stop and remove old container (if exists)
docker stop medicnex-file2md 2>/dev/null || true
docker rm medicnex-file2md 2>/dev/null || true
# Start new container
docker run -d --name medicnex-file2md -p 8999:8999 \
-v $(pwd)/.env:/app/.env \
medicnex-file2md:latest
echo "Service started, visit http://localhost:8999/docs"
or
chmod +x docker_image_deploy.sh
./docker_image_deploy.sh
This will deploy and start Docker with one click.
curl http://localhost:8999/v1/health
docker logs -f medicnex-file2md
docker stop medicnex-file2md
docker restart medicnex-file2md
A simple deployment method with one-click automated deployment:
- Clone the project:
git clone https://github.com/MedicNex/file2md.git
cd file2md
- One-click deployment:
# Automated deployment (recommended)
./docker-deploy.sh
This script will automatically:
- Check Docker environment
- Generate secure API keys and Redis password
- Build Docker images
- Start all services (API + Redis + optional Nginx)
- Perform health checks
- Access services:
- π API URL: http://localhost:8999
- π API Documentation: http://localhost:8999/docs
- β€οΈ Health Check: http://localhost:8999/v1/health
- π API Key: The deployment script will display the generated key
- Manage services:
# Check service status
./docker-deploy.sh status
# View real-time logs
./docker-deploy.sh logs
# Restart services
./docker-deploy.sh restart
# Stop services
./docker-deploy.sh stop
If you need custom configuration:
- Configure environment variables:
# Copy environment variable template
cp .env.example .env
# Edit .env file, set your configurations
API_KEY=your-api-key-1,your-api-key-2
VISION_API_KEY=your-vision-api-key # Optional, for image recognition
REDIS_PASSWORD=your-redis-password
- Start services:
# Basic deployment
docker-compose up -d
# Include Nginx reverse proxy
docker-compose --profile with-nginx up -d
Suitable for direct deployment on Linux servers:
- Configure environment variables:
cp .env.example .env
# Edit .env file
- Execute deployment:
# Ubuntu 24.04 server deployment
sudo ./deploy.sh
- View logs:
./monitor_logs.sh
- Install dependencies:
pip install -r requirements.txt
- Install system dependencies:
Ubuntu/Debian:
# PaddleOCR will automatically download required models on first use
# No additional OCR system dependencies needed, PaddleOCR is pure Python
# SVG vision recognition support (ImageMagick recommended)
sudo apt-get install -y imagemagick libmagickwand-dev pkg-config
# Audio processing support
sudo apt-get install -y ffmpeg libavcodec-extra
# Python development tools
sudo apt-get install -y python3-dev python3-pip build-essential
macOS:
# SVG vision recognition support (choose one)
brew install freetype imagemagick # ImageMagick support
# or
brew install cairo pkg-config # Cairo support
# Audio processing support
brew install ffmpeg # Audio format conversion and processing
- Set environment variables:
export API_KEY="dev-test-key-123"
export VISION_API_KEY="your-vision-api-key" # optional
- Start the service:
python -m uvicorn app.main:app --host 0.0.0.0 --port 8999 --reload
If you're experiencing slow Docker deployment on macOS, you can use these steps for direct local deployment:
- Create virtual environment:
python -m venv venv
source venv/bin/activate
- Install dependencies:
# First install base tools
pip install --upgrade pip setuptools wheel
# Install core dependencies individually (to avoid version conflicts)
pip install fastapi uvicorn pydantic python-multipart starlette
pip install loguru python-dotenv
# Install specific versions of PaddleOCR and PaddlePaddle (to resolve compatibility issues)
pip install paddlepaddle==2.5.2
pip install paddleocr==2.7.0
# Then install other dependencies
pip install -r requirements.txt --no-deps
- Install system dependencies:
# SVG vision recognition support (choose one)
brew install freetype imagemagick # ImageMagick support
# or
brew install cairo pkg-config # Cairo support
# Audio processing support
brew install ffmpeg # Audio format conversion and processing
# Note: PaddleOCR will automatically download required models on first use
# On macOS, PaddleOCR is a pure Python implementation with no additional system dependencies
# However, it will download approximately 1GB of model files on first run, ensure good network connection
- Configure environment variables:
Create a
.env
file in the project root directory with necessary configurations:
DEBUG=true
PORT=8999
MAX_CONCURRENT=5
LOG_LEVEL=INFO
REDIS_CACHE_ENABLED=false # Set to false if Redis cache is not needed
API_KEY=your_api_key_here # If API key authentication is required
# If you need vision API functionality, add the following configuration
# VISION_API_KEY=your_vision_api_key
- Start the service:
python -m app.main
Or start directly with uvicorn:
uvicorn app.main:app --host 0.0.0.0 --port 8999 --reload
On first startup, PaddleOCR will automatically download and cache required model files (approximately 1GB), which may take some time depending on your network speed. Subsequent starts will be faster once the models are cached.
Note: If you encounter an Unknown argument: use_gpu
error during startup, this is due to PaddleOCR version compatibility issues. Use these specific versions to resolve:
pip uninstall -y paddleocr paddlepaddle
pip install paddlepaddle==2.5.2
pip install paddleocr==2.7.0
- Optional: Redis cache: If you need Redis cache functionality, install Redis using Homebrew:
brew install redis
brew services start redis
Then enable Redis in your .env
file:
REDIS_CACHE_ENABLED=true
REDIS_HOST=localhost
REDIS_PORT=6379
Supported Formats: .wav
, .mp3
, .mp4
, .m4a
, .flac
, .ogg
, .wma
, .aac
(8 formats)
Core Features:
- π― Smart Audio Preprocessing: Automatic conversion to 16kHz mono, apply 80Hz high-pass filter to remove low-frequency noise
- π RMS Energy Analysis: Calculate RMS of audio signal for precise voice activity detection
- π Adaptive Threshold Detection: Dynamic threshold based on 10th percentile + 3dB, adapts to different recording environments
- βοΈ Smart Segmentation: 300ms minimum silence duration, automatic merging of short segments
- β‘ Concurrent ASR Conversion: Multiple audio segments processed simultaneously for ASR, significantly improving processing speed
- π Quality Assessment: Confidence scores based on average energy calculation
Supported Formats: .mp4
, .avi
, .mov
, .wmv
, .mkv
, .webm
, .3gp
(7 formats)
Core Features:
- π¬ Automatic Audio Extraction: Smart detection and extraction of audio tracks from video files
- π SRT Subtitle Generation: Generate standard timestamp format subtitles (HH:MM:SS,mmm)
- π Unified Processing Pipeline: Reuse audio analysis algorithms for consistent processing quality
- π Timeline Synchronization: Precise timestamp correspondence ensuring subtitle-video sync
Environment Variable Configuration:
# ASR service configuration
ASR_MODEL=whisper-1 # ASR model name
ASR_API_BASE=https://api.openai.com/v1 # ASR API base URL
ASR_API_KEY=your-openai-api-key # ASR API key
# Audio processing parameters
MAX_FILE_SIZE=100 # Maximum file size (MB)
AUDIO_CONCURRENT_LIMIT=5 # Concurrent ASR requests
System Dependencies:
# Audio processing libraries (required)
pip install pydub numpy librosa
# Audio format support (optional, for more formats)
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
- Concurrent Processing: Multiple audio segments processed simultaneously for ASR, 3-5x speed improvement
- Smart Segmentation: Avoid cutting in the middle of words, improving recognition accuracy
- Adaptive Threshold: Dynamically adjust detection parameters based on audio characteristics
- Memory Optimization: Stream processing for large files, avoiding memory overflow
- Error Recovery: Automatic fallback to time-based segmentation when ASR fails
curl -X POST "https://your-domain/v1/convert" \
-H "Authorization: Bearer your-api-key" \
-F "file=@example.py"
Response example (Python file):
{
"filename": "example.py",
"size": 1024,
"content_type": "text/x-python",
"content": "```python\ndef hello_world():\n print('Hello, World!')\n```",
"duration_ms": 150
}
Use queue mode to batch submit multiple files. The number of concurrent connections can be controlled by MAX_CONCURRENT
in .env
:
curl -X POST "https://your-domain/v1/convert-batch" \
-H "Authorization: Bearer your-api-key" \
-F "files=@document1.docx" \
-F "files=@image1.png" \
-F "files=@script.py"
Response example:
{
"submitted_tasks": [
{
"task_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"message": "Task submitted to conversion queue",
"filename": "document1.docx",
"status": "pending"
}
],
"total_count": 3,
"success_count": 3,
"failed_count": 0
}
curl -X GET "https://your-domain/v1/task/{task_id}" \
-H "Authorization: Bearer your-api-key"
curl -X GET "https://your-domain/v1/queue/info" \
-H "Authorization: Bearer your-api-key"
Response example:
{
"max_concurrent": 5,
"queue_size": 2,
"active_tasks": 3,
"total_tasks": 10,
"pending_count": 2,
"processing_count": 3,
"completed_count": 4,
"failed_count": 1
}
Response example (Image file):
{
"filename": "chart.png",
"size": 204800,
"content_type": "image/png",
"content": "```image\n# OCR:\nChart Title: Sales Data Analysis\n\n# Visual_Features:\nThis is a bar chart showing monthly sales trends...\n```",
"duration_ms": 2500
}
Response example (SVG file):
{
"filename": "icon.svg",
"size": 1024,
"content_type": "image/svg+xml",
"content": "```svg\n# Code\n<code class=\"language-svg\">\n<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"0 0 24 24\">\n <path d=\"M12 2l3.09 6.26L22 9.27l-5 4.87 1.18 6.88L12 17.77l-6.18 3.25L7 14.14 2 9.27l6.91-1.01L12 2z\"/>\n</svg>\n</code>\n\n# Visual_Features: This is a five-pointed star icon with clean line design, perfectly symmetrical star shape, suitable for rating or favorite functionality.\n```",
"duration_ms": 3200
}
curl -X POST "https://your-domain/v1/ocr" \
-H "Authorization: Bearer your-api-key" \
-F "file=@document.png"
Description:
Recognize text in uploaded images using OCR only (PaddleOCR), without calling Vision API.
Supported formats: JPG, JPEG, PNG, BMP, TIFF, TIF, GIF, WEBP
Request parameters:
Name | Type | Required | Description |
---|---|---|---|
file | File | Yes | Image file to recognize |
Response example:
{
"filename": "document.png",
"size": 204800,
"content_type": "image/png",
"ocr_text": "This is the recognized text from the image\nSupports multiple lines\nSupports English and Chinese",
"duration_ms": 1200,
"from_cache": false
}
Response fields:
filename
: Original file namesize
: File size (bytes)content_type
: MIME typeocr_text
: Recognized text contentduration_ms
: Processing time (ms)from_cache
: Whether the result is from cache
Error example:
{
"code": "UNSUPPORTED_TYPE",
"message": "Unsupported file type: .pdf, only image formats are supported: .jpg, .jpeg, .png, .bmp, .tiff, .tif, .gif, .webp"
}
Endpoint | Method | Description |
---|---|---|
/v1/convert |
POST | Single file synchronous conversion |
/v1/ocr |
POST | Image OCR recognition (OCR only) |
/v1/convert-batch |
POST | Batch file asynchronous submission |
/v1/task/{task_id} |
GET | Query task status |
/v1/queue/info |
GET | Query queue status |
/v1/queue/cleanup |
POST | Clean up expired tasks |
/v1/supported-types |
GET | Get supported file types |
/v1/health |
GET | Health check (with queue status) |
curl -X GET "https://your-domain/v1/health"
Variable | Description | Default | Required |
---|---|---|---|
API_KEY |
API key list (comma-separated) | dev-test-key-123 |
Yes |
VISION_API_KEY |
Vision API key | - | No |
VISION_API_BASE |
Vision API base URL | https://api.openai.com/v1 |
No |
VISION_MODEL |
Vision recognition model name | gpt-4o-mini |
No |
ASR_API_KEY |
ASR speech recognition API key | - | Required for audio features |
ASR_API_BASE |
ASR API base URL | https://api.openai.com/v1 |
No |
ASR_MODEL |
ASR model name | whisper-1 |
No |
PORT |
Service port | 8999 |
No |
LOG_LEVEL |
Log level | INFO |
No |
MAX_IMAGES_PER_DOC |
Max images processed per document (-1 = unlimited) |
5 |
No |
- Supports multiple API keys, separated by commas
- Use
Bearer <API_KEY>
format in theAuthorization
header
HTTP Status Code | Error Code | Description |
---|---|---|
401 | INVALID_API_KEY |
Invalid or missing API Key |
415 | UNSUPPORTED_TYPE |
Unsupported file type |
422 | PARSE_ERROR |
File parsing failed |
422 | INVALID_FILE |
Invalid file |
- Asynchronous processing for file uploads and parsing
- Concurrent Image Processing: Multiple images in documents are processed simultaneously with OCR and AI vision recognition
- Supported file types: PDF, DOC, DOCX, Excel
- Performance improvement: 2-10x processing speed (depending on image count and network conditions)
- Technical implementation: Use
asyncio.gather()
for concurrent PaddleOCR and vision model calls
- Automatic temporary file cleanup
- Memory-optimized streaming processing
- Support for large file processing
- Smart encoding detection
- API Key authentication mechanism
- File type whitelist validation
- Secure temporary file cleanup
- Run as non-root user
- Structured JSON logging
- Health check endpoints
- Processing time statistics
- Error tracking and reporting
- Redis Cache Configuration - π Redis cache optimization configuration guide
- Supported File Formats - Detailed list of 123 supported formats and feature descriptions
- Conversion Examples - Detailed real conversion cases and feature demonstrations
- Frontend Integration Guide - Frontend developer integration documentation
- Inherit from
BaseParser
class - Implement
parse()
method - Register in
ParserRegistry
Example:
from app.parsers.base import BaseParser
class CustomParser(BaseParser):
@classmethod
def get_supported_extensions(cls):
return ['.custom']
async def parse(self, file_path: str) -> str:
# Read file content
with open(file_path, 'r') as f:
content = f.read()
# Format as code block
return f"```custom\n{content}\n```"
medicnex-file2md/
βββ π³ Docker deployment files
β βββ Dockerfile # Docker image build file
β βββ docker-compose.yml # Docker Compose service orchestration
β βββ docker-deploy.sh # One-click Docker deployment script
β βββ .dockerignore # Docker build ignore file
βββ π οΈ Traditional deployment files
β βββ deploy.sh # Ubuntu server one-click deployment
β βββ monitor_logs.sh # Log monitoring script
βββ βοΈ Configuration files
β βββ .env.example # Environment variable template
β βββ requirements.txt # Python dependencies
β βββ LICENSE # Apache License 2.0
β βββ README.md # Project documentation (this file)
βββ π± Application core
βββ app/
βββ main.py # FastAPI application entry
βββ config.py # Configuration management
βββ auth.py # API Key authentication
βββ models.py # Pydantic data models
βββ vision.py # Vision recognition service
βββ queue_manager.py # Queue manager
βββ cache.py # Redis cache management
βββ utils.py # Utility functions
βββ exceptions.py # Exception handling
βββ routers/
β βββ convert.py # Conversion API routes
βββ parsers/ # π§ Parser modules (16 parsers)
βββ base.py # Parser base class
βββ registry.py # Parser registry
βββ audio.py # Audio/video parser (smart chunking + ASR)
βββ code.py # Code file parser (82 languages)
βββ pdf.py # PDF parser
βββ doc.py # Word DOC parser (legacy)
βββ docx.py # Word DOCX parser
βββ excel.py # Excel parser
βββ pptx.py # PowerPoint parser
βββ csv.py # CSV parser
βββ numbers.py # Apple Numbers parser
βββ keynote.py # Apple Keynote parser
βββ pages.py # Apple Pages parser
βββ image.py # Image parser
βββ svg.py # SVG parser
βββ markdown.py # Markdown parser
βββ odt.py # OpenDocument text parser
βββ rtf.py # RTF document parser
βββ txt.py # Text parser
Docker Service Components:
- file2md-api: Main API service, integrating PaddleOCR and all parsers
- redis: Cache service, improving conversion performance and queue management
- nginx: Reverse proxy service (optional, recommended for production)
Data Persistence:
paddleocr_models
: PaddleOCR model files persistenceredis_data
: Redis data persistencetemp_files
: Temporary file storageapp_logs
: Application log persistence
This project is released under the Apache License 2.0.
Copyright 2025 MedicNex
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
We warmly welcome community contributions! Here's how you can participate:
- Report bugs on the Issues page
- Provide detailed error information and reproduction steps
- Include your environment information (OS, Python version, etc.)
- Propose new features on the Issues page
- Describe use cases and expected effects
- Discuss feasibility of implementation approaches
- Fork this repository
- Create feature branch:
git checkout -b feature/amazing-feature
- Commit changes:
git commit -m 'Add some amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Submit Pull Request
- π§ New Parsers: Add support for new file formats
- π Performance Optimization: Improve processing speed and memory efficiency
- π Documentation Improvement: Enhance usage guides and API documentation
- π³ Deployment Optimization: Improve Docker and deployment scripts
- π§ͺ Test Enhancement: Increase test coverage
For more detailed information, please refer to Contributing Guide.
Thank you for your attention and contribution to the File2MD project! π
- π³ Complete Docker Support: Brand new containerized deployment solution
- Dockerfile: Optimized image based on Ubuntu 24.04, including all PaddleOCR dependencies
- docker-compose.yml: Complete service orchestration including API, Redis, Nginx
- docker-deploy.sh: One-click automated deployment script with automatic secure key generation
- Data Persistence: Persistent storage for PaddleOCR models, Redis data, and logs
- Health Checks: Built-in service health monitoring and automatic recovery
- Resource Limits: Reasonable memory and CPU limit configurations
- Security Configuration: Non-root user execution, automatic strong key generation
- π Documentation Optimization: Reorganized deployment guide with three deployment options
- π§ Architecture Description: Updated project structure description with clear Docker-related files
- OCR engine switched from Tesseract to PaddleOCR, improving recognition accuracy
- π΅ Audio and Video Processing Features: Brand new audio/video file processing support
- Audio Format Support:
.wav
,.mp3
,.mp4
,.m4a
,.flac
,.ogg
,.wma
,.aac
(8 formats) - Video Format Support:
.mp4
,.avi
,.mov
,.wmv
,.mkv
,.webm
,.3gp
(7 formats) - Smart Audio Preprocessing: 16kHz mono conversion, 80Hz high-pass filtering for noise removal
- RMS Energy Analysis: Precise voice detection based on signal RMS
- Adaptive Threshold: 10th percentile + 3dB dynamic threshold, adapts to different environments
- Smart Segmentation Algorithm: 300ms minimum silence detection, automatic short segment merging
- Concurrent ASR Conversion: Multiple audio segments simultaneously processed for speech recognition, 3-5x speed improvement
- SRT Subtitle Generation: Automatic standard timestamp subtitle generation for video files
- Quality Assessment: Confidence calculation and quality metrics based on energy
- Audio Format Support:
- π Statistics Update: Supported formats increased from 109 to 123, added AudioParser
- π§ Dependency Enhancement: Added pydub, numpy, librosa audio processing library support
- π± Apple iWork Support: Added support for Apple iWork suite
- Keynote (.key): Presentation files, extract metadata and structure, output as
slideshow
format - Pages (.pages): Word processing documents, extract metadata and structure, output as
document
format - Numbers (.numbers): Spreadsheet files, support table data extraction, output as
sheet
format - Smart Parsing: Numbers files prioritize
numbers-parser
library for complete table data extraction, fallback to basic parsing
- Keynote (.key): Presentation files, extract metadata and structure, output as
- π Statistics Update: Supported formats increased from 106 to 109, parsers from 13 to 16
- π§ Dependency Update: Added
numbers-parser==4.4.6
dependency for Numbers file parsing
- π Data Update: Complete testing and updated supported format list
- 109 File Formats: Complete validation of all supported extensions
- 16 Parsers: Optimized classification and statistics
- New Documentation: Created detailed Supported Formats List
- π§ API Enhancement:
/v1/supported-types
endpoint returns accurate format information - πΌοΈ SVG Features: Enhanced SVG to PNG visual recognition (ImageMagick support)
- π‘οΈ Security Improvements: Health check API removes sensitive information exposure
- β¨ New: Concurrent image processing functionality
- Multiple images in PDF, DOC, DOCX, Excel documents can now be processed concurrently
- OCR and AI vision recognition run simultaneously, dramatically improving processing speed
- Processing speed improved 2-10x (depending on image count)
- π§ Optimization: Improved exception handling and error recovery mechanisms
- π Fix: Resolved memory issues with large document image processing