The purpose of this project is to provide a quick and easy way to spin up an HTTP API for PDF OCR as a Docker image. This API is designed to handle OCR requests for PDF files, making it simple to extract text from PDFs with minimal setup.
This project is a basic implementation intended for personal or small-scale use. There are no security features, rate limiting, or advanced protections beyond a simple API key that can be set in the .env
file. Use at your own risk if deploying in a production environment.
A big thanks to the PyMuPDF project and the pymupdf4llm extension for providing powerful PDF processing capabilities.
- Docker: Install Docker
- Docker Compose: Install Docker Compose
Please note that this project has only been tested briefly on Linux. Compatibility with other operating systems has not been extensively verified.
- Simple API: Provides a POST endpoint to upload a PDF file and return the OCR text.
- GET Request Support: Allows OCR processing of PDFs from a provided URL.
- Temporary File Handling: Files are processed within the container's temporary directory, minimizing permission issues.
- API Key Support: Basic security is implemented using an API key that can be set in the
.env
file. - Efficient OCR: Uses pymupdf4llm for fast and accurate PDF text extraction.
git clone https://github.com/yannelli/ArchieOCR.git
cd ArchieOCR
Thank you for providing that information. Let's update the README to include all these environment variables. Here's an updated section for the .env
file configuration:
Create a .env
file in the project root with the following content:
ENABLE_KEY=True
KEY=your-secret-key
MAX_TIMEOUT=300
DPI=800
PAGE_WIDTH=1224
These environment variables control various aspects of the application:
ENABLE_KEY
: Set toTrue
to enable API key authentication,False
to disable it.KEY
: Your secret API key for authentication (only used ifENABLE_KEY
isTrue
).MAX_TIMEOUT
: Maximum timeout in seconds for downloading files from URLs (default: 300).DPI
: DPI setting for PDF processing (default: 800).PAGE_WIDTH
: Page width setting for PDF processing (default: 1224).
You can adjust these values as needed for your specific use case.
docker compose up --build
This will build the Docker image and start the container, exposing the service on port 8080
.
-
POST /recognize
Upload a PDF file for OCR processing.
curl -X POST -F key=your-secret-key -F file=@/path/to/your/file.pdf http://localhost:8080/recognize
-
GET /recognize
Provide a URL pointing to a PDF file for OCR processing.
curl "http://localhost:8080/recognize?file=https://example.com/file.pdf&key=your-secret-key"
To stop the Docker container:
docker compose down
- Switched from Tesseract OCR to pymupdf4llm for improved PDF text extraction.
- Removed image processing capabilities to focus on PDF handling.
- Simplified dependencies and reduced the overall codebase.
This project is licensed under the MIT License. See the LICENSE file for more details.
This is a personal project, and I provide no warranty or guarantees of any kind. Use at your own risk.
Contributions are welcome! Feel free to open an issue or submit a pull request.