A powerful OCR solution that converts PDF documents to Markdown format using DeepSeek-OCR with FastAPI backend. This project provides both a batch processing script and a REST API for flexible document conversion.
- Place your PDF files in the
data/directory - Ensure the DeepSeek-OCR API is running (see Docker setup below)
- Run the processor:
python pdf_to_markdown_processor.py- Build and start the Docker container
- Use the API endpoints to process documents
- Integrate with your applications
- NVIDIA GPU with CUDA 11.8+ support
- GPU Memory: Minimum 12GB VRAM (Model takes ~9GB)
- System RAM: Minimum 32GB (recommended: 64GB+)
- Storage: 50GB+ free space for model and containers
- Python 3.8+ (for local processing)
- Docker 20.10+ with GPU support
- Docker Compose 2.0+
- NVIDIA Container Toolkit installed
- CUDA 11.8 compatible drivers
Create a directory for model weights and download the DeepSeek-OCR model:
# Create models directory
mkdir -p models
# Download using Hugging Face CLI
pip install huggingface_hub
huggingface-cli download deepseek-ai/DeepSeek-OCR --local-dir models/deepseek-ai/DeepSeek-OCR
# Or using git
git clone https://huggingface.co/deepseek-ai/DeepSeek-OCR models/deepseek-ai/DeepSeek-OCRREM Build the Docker image
build.bat
REM Start the service
docker-compose up -d
REM Check logs
docker-compose logs -f deepseek-ocr# Build the Docker image
docker-compose build
# Start the service
docker-compose up -d
# Check logs
docker-compose logs -f deepseek-ocr# Health check
curl http://localhost:8000/health
# Expected response:
{
"status": "healthy",
"model_loaded": true,
"model_path": "/app/models/deepseek-ai/DeepSeek-OCR",
"cuda_available": true,
"cuda_device_count": 1
}This project provides several PDF processing scripts, each designed for different use cases. All scripts scan the data/ directory for PDF files and convert them to Markdown format with different prompts and post-processing options.
All processors append a suffix to the output filename to indicate the processing method used:
- -MD.md: Markdown conversion (preserves document structure)
- -OCR.md: Plain OCR extraction (raw text without formatting)
- -CUSTOM.md: Custom prompt processing (uses prompt from YAML file)
For example, processing document.pdf will create:
document-MD.md(markdown processors)document-OCR.md(OCR processor)document-CUSTOM.md(custom prompt processors)
Purpose: Basic PDF to Markdown conversion using the standard markdown prompt
Features:
- Uses prompt:
'<image>\n<|grounding|>Convert the document to markdown.' - Converts PDFs to structured Markdown format
- Simple processing without image extraction
- Outputs files with
-MD.mdsuffix
Usage:
# Place PDF files in the data directory
cp your_document.pdf data/
# Run the processor
python pdf_to_markdown_processor.py
# Check results
ls data/*-MD.mdPurpose: Enhanced PDF to Markdown conversion with post-processing
Features:
- Uses the same markdown prompt as the basic version
- Post-processing features:
- Image extraction and saving to
data/images/folder - Special token cleanup
- Reference processing for layout information
- Content cleaning and formatting
- Image extraction and saving to
- Outputs files with
-MD.mdsuffix
Usage:
# Place PDF files in the data directory
cp your_document.pdf data/
# Run the enhanced processor
python pdf_to_markdown_processor_enhanced.py
# Check results (including extracted images)
ls data/*-MD.md
ls data/images/Purpose: Plain OCR text extraction without markdown formatting
Features:
- Uses OCR prompt:
'<image>\nFree OCR.' - Extracts raw text without markdown structure
- Includes the same post-processing features as the enhanced markdown processor
- Outputs files with
-OCR.mdsuffix
Usage:
# Place PDF files in the data directory
cp your_document.pdf data/
# Run the OCR processor
python pdf_to_ocr_enhanced.py
# Check results
ls data/*-OCR.mdPurpose: PDF processing with custom prompts (raw output)
Features:
- Uses custom prompt loaded from
custom_prompt.yaml - Returns raw model response without post-processing
- Ideal for testing and debugging different prompts
- Outputs files with
-CUSTOM.mdsuffix
Configuration:
Edit custom_prompt.yaml to customize the prompt:
# Custom prompt for PDF processing
prompt: '<image>\n<|grounding|>Convert the document to markdown.'Usage:
# Edit the prompt in custom_prompt.yaml
nano custom_prompt.yaml
# Place PDF files in the data directory
cp your_document.pdf data/
# Run the custom prompt processor
python pdf_to_custom_prompt.py
# Check results
ls data/*-CUSTOM.mdPurpose: PDF processing with custom prompts and full post-processing
Features:
- Uses custom prompt loaded from
custom_prompt.yaml - Includes all post-processing features (image extraction, content cleaning, etc.)
- Combines custom prompts with enhanced output processing
- Outputs files with
-CUSTOM.mdsuffix
Configuration:
Same as pdf_to_custom_prompt.py - edit custom_prompt.yaml to customize the prompt.
Usage:
# Edit the prompt in custom_prompt.yaml
nano custom_prompt.yaml
# Place PDF files in the data directory
cp your_document.pdf data/
# Run the enhanced custom prompt processor
python pdf_to_custom_prompt_enhanced.py
# Check results (including extracted images)
ls data/*-CUSTOM.md
ls data/images/| Processor | Prompt | Post-Processing | Image Extraction | Output Suffix | Use Case |
|---|---|---|---|---|---|
pdf_to_markdown_processor.py |
Markdown | ❌ | ❌ | -MD.md |
Quick markdown conversion |
pdf_to_markdown_processor_enhanced.py |
Markdown | ✅ | ✅ | -MD.md |
Full-featured markdown with images |
pdf_to_ocr_enhanced.py |
Free OCR | ✅ | ✅ | -OCR.md |
Raw text extraction |
pdf_to_custom_prompt.py |
Custom (YAML) | ❌ | ❌ | -CUSTOM.md |
Testing custom prompts |
pdf_to_custom_prompt_enhanced.py |
Custom (YAML) | ✅ | ✅ | -CUSTOM.md |
Custom prompts with full features |
To compare how different processors handle the same document:
# Place a PDF in the data directory
cp test_document.pdf data/
# Run all processors
python pdf_to_markdown_processor.py
python pdf_to_markdown_processor_enhanced.py
python pdf_to_ocr_enhanced.py
python pdf_to_custom_prompt.py
# Compare outputs
ls data/test_document-*.md- Edit
custom_prompt.yamlwith your desired prompt:
prompt: '<image>\nExtract all tables and format as CSV.'- Run the custom processor:
python pdf_to_custom_prompt_enhanced.py- Check the specialized output:
cat data/your_document-CUSTOM.mdThe FastAPI backend provides several endpoints for document processing.
GET http://localhost:8000/healthcurl -X POST "http://localhost:8000/ocr/image" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_image.jpg"curl -X POST "http://localhost:8000/ocr/pdf" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_document.pdf"curl -X POST "http://localhost:8000/ocr/batch" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "files=@image1.jpg" \
-F "files=@document.pdf" \
-F "files=@image2.png"{
"success": true,
"result": "# Document Title\n\nThis is the OCR result in markdown format...",
"page_count": 1
}{
"success": true,
"results": [
{
"success": true,
"result": "# Page 1 Content\n...",
"page_count": 1
},
{
"success": true,
"result": "# Page 2 Content\n...",
"page_count": 2
}
],
"total_pages": 2,
"filename": "document.pdf"
}import requests
class DeepSeekOCRClient:
def __init__(self, base_url="http://localhost:8000"):
self.base_url = base_url
def process_image(self, image_path):
with open(image_path, 'rb') as f:
response = requests.post(
f"{self.base_url}/ocr/image",
files={"file": f}
)
return response.json()
def process_pdf(self, pdf_path):
with open(pdf_path, 'rb') as f:
response = requests.post(
f"{self.base_url}/ocr/pdf",
files={"file": f}
)
return response.json()
# Usage
client = DeepSeekOCRClient()
result = client.process_pdf("document.pdf")
if result["success"]:
for page_result in result["results"]:
print(f"Page {page_result['page_count']}:")
print(page_result["result"])
print("---")class DeepSeekOCR {
constructor(baseUrl = 'http://localhost:8000') {
this.baseUrl = baseUrl;
}
async processImage(file) {
const formData = new FormData();
formData.append('file', file);
const response = await fetch(`${this.baseUrl}/ocr/image`, {
method: 'POST',
body: formData
});
return await response.json();
}
async processPDF(file) {
const formData = new FormData();
formData.append('file', file);
const response = await fetch(`${this.baseUrl}/ocr/pdf`, {
method: 'POST',
body: formData
});
return await response.json();
}
}
// Usage in browser
const ocr = new DeepSeekOCR();
document.getElementById('fileInput').addEventListener('change', async (e) => {
const file = e.target.files[0];
const result = await ocr.processPDF(file);
if (result.success) {
result.results.forEach(page => {
console.log(`Page ${page.page_count}:`, page.result);
});
}
});This project includes custom files that replace the original DeepSeek-OCR library code to fix critical issues and provide enhanced functionality. These replacements are applied transparently during the Docker build process.
Issue: The original DeepSeek-OCR library has a bug where the tokenize_with_images() method is called without the required prompt parameter during model initialization, causing server startup failures.
Solution: Custom run scripts have been created to properly handle the prompt parameter and prevent startup errors.
The following custom files in the project root replace their counterparts during Docker build:
custom_config.py: Custom configuration with customizable default prompt and settingscustom_image_process.py: Fixed version of the image processing module that handles the prompt parameter correctlycustom_run_dpsk_ocr_pdf.py: Enhanced PDF script that accepts--promptargument and fixes the initialization issuecustom_run_dpsk_ocr_image.py: Enhanced image script that accepts--promptargument and fixes the initialization issuecustom_run_dpsk_ocr_eval_batch.py: Enhanced batch script that accepts--promptargument and fixes the initialization issue
These custom files are automatically copied over the original library files during the Docker build process, ensuring the fixes are applied without requiring manual intervention.
-
Edit the Default Prompt:
# Edit custom_config.py PROMPT = '<image>\n<|grounding|>Your custom default prompt here.'
-
Use Custom Prompts with Direct Scripts:
# Using default prompt from custom_config.py python custom_run_dpsk_ocr_pdf.py --input your_file.pdf --output output_dir # Using custom prompt via command line python custom_run_dpsk_ocr_pdf.py --prompt "<image>\n<|grounding|>Extract tables as CSV." --input your_file.pdf
-
Use Custom Prompts with API:
# Using default prompt curl -X POST "http://localhost:8000/ocr/pdf" -F "file=@your_file.pdf" # Using custom prompt curl -X POST "http://localhost:8000/ocr/pdf" -F "file=@your_file.pdf" -F "prompt=<image>\n<|grounding|>Your custom prompt here."
-
Build and Run:
# Rebuild with custom configuration and fixes docker-compose build # Start the container docker-compose up -d
The Dockerfile automatically applies the custom files during the build process:
# Copy custom files to replace the originals (transparent replacement approach)
COPY custom_config.py ./DeepSeek-OCR-vllm/config.py
COPY custom_image_process.py ./DeepSeek-OCR-vllm/process/image_process.py
# Copy custom run scripts to replace the originals
COPY custom_run_dpsk_ocr_pdf.py ./DeepSeek-OCR-vllm/run_dpsk_ocr_pdf.py
COPY custom_run_dpsk_ocr_image.py ./DeepSeek-OCR-vllm/run_dpsk_ocr_image.py
COPY custom_run_dpsk_ocr_eval_batch.py ./DeepSeek-OCR-vllm/run_dpsk_ocr_eval_batch.pyThis transparent replacement approach ensures that:
- The critical prompt parameter fix is applied
- Custom configuration options are available
- No manual modification of the original library code is required
- The fixes persist across container rebuilds
For detailed documentation, see CUSTOM_CONFIG_README.md.
Edit docker-compose.yml to adjust these settings:
environment:
- CUDA_VISIBLE_DEVICES=0 # GPU device to use
- MODEL_PATH=/app/models/deepseek-ai/DeepSeek-OCR # Model path
- MAX_CONCURRENCY=50 # Max concurrent requests
- GPU_MEMORY_UTILIZATION=0.85 # GPU memory usage (0.1-1.0)environment:
- MAX_CONCURRENCY=100
- GPU_MEMORY_UTILIZATION=0.95environment:
- MAX_CONCURRENCY=10
- GPU_MEMORY_UTILIZATION=0.7# Reduce concurrency and GPU memory usage
# Edit docker-compose.yml:
environment:
- MAX_CONCURRENCY=10
- GPU_MEMORY_UTILIZATION=0.7# Check model directory structure
ls -la models/deepseek-ai/DeepSeek-OCR/
# Verify model files are present
docker-compose exec deepseek-ocr ls -la /app/models/deepseek-ai/DeepSeek-OCR/# Check GPU availability
nvidia-smi
# Check Docker GPU support
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi# Check if the API is running
curl http://localhost:8000/health
# Check container logs
docker-compose logs -f deepseek-ocr
# Restart the service
docker-compose restart deepseek-ocr# Check if PDF files are valid
file data/your_document.pdf
# Try processing a single PDF manually
curl -X POST "http://localhost:8000/ocr/pdf" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@data/your_document.pdf"If you encounter this error during server startup:
TypeError: DeepseekOCRProcessor.tokenize_with_images() missing 1 required positional argument: 'prompt'
This error has been fixed with the custom files included in the Docker build. If you still see it:
- Ensure you're using the updated Dockerfile (includes custom run scripts)
- Rebuild the container completely:
docker-compose down docker-compose build --no-cache docker-compose up -d
- Verify the fix is applied:
docker-compose exec deepseek-ocr ls -la /app/DeepSeek-OCR-vllm/run_dpsk_ocr_*.py # These should show recent timestamps from the build
The fix ensures that the tokenize_with_images() method is called with the correct prompt parameter during model initialization.
For debugging, you can run the container with additional tools:
# Run with shell access
docker-compose run --rm deepseek-ocr bash
# Check model loading
python -c "
import sys
sys.path.insert(0, '/app/DeepSeek-OCR-master/DeepSeek-OCR-vllm')
from config import MODEL_PATH
print(f'Model path: {MODEL_PATH}')
print(f'Model exists: {os.path.exists(MODEL_PATH)}')
"- Batch Processing: Process multiple files at once using the
/ocr/batchendpoint - Optimize DPI: The default DPI of 144 provides good balance between quality and speed
- GPU Utilization: Adjust
GPU_MEMORY_UTILIZATIONbased on your GPU capacity - Concurrency: Increase
MAX_CONCURRENCYfor better throughput on powerful GPUs - File Size: For large PDFs, consider splitting them into smaller chunks
DeepSeek-OCR/
├── README.md # This file
├── CUSTOM_CONFIG_README.md # Custom configuration documentation
├── pdf_to_markdown_processor.py # Basic markdown conversion
├── pdf_to_markdown_processor_enhanced.py # Enhanced markdown with post-processing
├── pdf_to_ocr_enhanced.py # OCR text extraction
├── pdf_to_custom_prompt.py # Custom prompt processing (raw)
├── pdf_to_custom_prompt_enhanced.py # Custom prompt with post-processing
├── custom_prompt.yaml # Configuration for custom prompts
├── custom_config.py # Custom configuration (replaces original config.py)
├── custom_image_process.py # Fixed image processing (replaces original)
├── custom_run_dpsk_ocr_pdf.py # Custom PDF script with prompt support (replaces original)
├── custom_run_dpsk_ocr_image.py # Custom image script with prompt support (replaces original)
├── custom_run_dpsk_ocr_eval_batch.py # Custom batch script with prompt support (replaces original)
├── test_custom_config.py # Test script for custom configuration
├── start_server.py # FastAPI server
├── Dockerfile # Docker container definition (includes custom files)
├── docker-compose.yml # Docker compose configuration
├── build.bat # Windows build script
├── data/ # Input/output directory for PDFs
│ ├── images/ # Extracted images (when using enhanced processors)
│ └── *.md # Generated markdown files
├── models/ # Model weights directory
└── DeepSeek-OCR/ # DeepSeek-OCR source code
└── DeepSeek-OCR-master/
└── DeepSeek-OCR-vllm/ # Original library files (replaced during build)
This project follows the same license as the DeepSeek-OCR project. Please refer to the original project's license file for details.
For issues related to:
- Docker setup: Check this README first
- DeepSeek-OCR model: Refer to the official repository
- vLLM: Refer to vLLM documentation
graph TD
A[Start] --> B{Choose Method}
B -->|Batch Processing| C[Place PDFs in data/ folder]
B -->|API Usage| D[Start Docker Container]
C --> E[Run python pdf_to_markdown_processor.py]
D --> F[Use API endpoints]
E --> G[Check data/ folder for .md files]
F --> H[Process results from API response]
G --> I[Done]
H --> I
style A fill:#e1f5fe
style I fill:#e8f5e8
style C fill:#fff3e0
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#f3e5f5