pdf-extract-api

Convert any image or PDF to Markdown text or JSON structured document with super-high accuracy, including tabular data, numbers or math formulas.

The API is built with FastAPI and uses Celery for asynchronous task processing. Redis is used for caching OCR results.

Features:

No Cloud/external dependencies all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via docker-compose no data is sent outside your dev/server environment,
PDF to Markdown conversion with very high accuracy using different OCR strategies including marker, surya-ocr or tessereact
PDF to JSON conversion using Ollama supported models (eg. LLama 3.1)
LLM Improving OCR results LLama is pretty good with fixing spelling and text issues in the OCR text
Removing PII This tool can be used for removing Personally Identifiable Information out of PDF - see examples
Distributed queue processing using [Celery]()
Caching using Redis - the OCR results can be easily cached prior to LLM processing
CLI tool for sending tasks and processing results

Screenshots

Converting MRI report to Markdown + JSON.

python client/cli.py ocr --file examples/example-mri.pdf --prompt_file examples/example-mri-2-json-prompt.txt

Before running the example see getting started

Converting Invoice to JSON and remove PII

python client/cli.py ocr --file examples/example-invoice.pdf --prompt_file examples/example-invoice-remove-pii.txt

Before running the example see getting started

Note: As you may observe in the example above, marker-pdf sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request #3 for adding alternative support for tabled model - which is optimized for tables.

Getting started

Prerequisites

Docker
Docker Compose

Clone the Repository

git clone https://github.com/CatchTheTornado/pdf-extract-api.git
cd pdf-extract-api

Setup environmental variables

Create .env file in the root directory and set the necessary environment variables. You can use the .env.example file as a template:

cp .env.example .env

Then modify the variables inside the file:

REDIS_CACHE_URL=redis://redis:6379/1
OLLAMA_API_URL=http://ollama:11434/api

# CLI settings
OCR_URL=http://localhost:8000/ocr
RESULT_URL=http://localhost:8000/ocr/result/{task_id}
CLEAR_CACHE_URL=http://localhost:8000/ocr/clear_cache

Build and Run the Docker Containers

Build and run the Docker containers using Docker Compose:

docker-compose up --build

... for GPU support run:

docker-compose -f docker-compose.gpu.yml up --build

This will start the following services:

FastAPI App: Runs the FastAPI application.
Celery Worker: Processes asynchronous OCR tasks.
Redis: Caches OCR results.
Ollama: Runs the Ollama model.

Hosted edition

If the on-prem is too much hassle ask us about the hosted/cloud edition of pdf-extract-api, we can setup it you, billed just for the usage.

CLI tool

The project includes a CLI for interacting with the API. To make it work first run:

cd client
pip install -r requirements.txt

Pull the LLama3.1 model

You might want to test out different models supported by LLama

python client/cli.py llm_pull --model llama3.1

Upload a File for OCR (converting to Markdown)

python client/cli.py ocr --file examples/example-mri.pdf --ocr_cache

Upload a File for OCR (processing by LLM)

python client/cli.py ocr --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt

Get OCR Result by Task ID

python client/cli.py result -task_id {your_task_id_from_upload_step}

Clear OCR Cache

python client/cli.py clear_cache

Test LLama

python llm_generate --prompt "Your prompt here"

Endpoints

OCR Endpoint

URL: /ocr
Method: POST
Parameters:
- file: PDF file to be processed.
- strategy: OCR strategy to use (marker or tesseract).
- ocr_cache: Whether to cache the OCR result (true or false).
- prompt: When provided, will be used for Ollama processing the OCR result
- model: When provided along with the prompt - this model will be used for LLM processing

Example:

curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr"

OCR Result Endpoint

URL: /ocr/result/{task_id}
Method: GET
Parameters:
- task_id: Task ID returned by the OCR endpoint.

Example:

curl -X GET "http://localhost:8000/ocr/result/{task_id}"

Clear OCR Cache Endpoint

URL: /ocr/clear_cache
Method: POST

Example:

curl -X POST "http://localhost:8000/ocr/clear_cache"

Ollama Pull Endpoint

URL: /llm_pull
Method: POST
Parameters:
- model: Pull the model you are to use first

Example:

curl -X POST "http://localhost:8000/llama_pull" -H "Content-Type: application/json" -d '{"model": "llama3.1"}'

Ollama Endpoint

URL: /llm_generate
Method: POST
Parameters:
- prompt: Prompt for the Ollama model.
- model: Model you like to query

Example:

curl -X POST "http://localhost:8000/llama_generate" -H "Content-Type: application/json" -d '{"prompt": "Your prompt here", "model":"llama3.1"}'

License

This project is licensed under the GNU General Public License. See the LICENSE file for details.

Important note on marker license*:

The weights for the models are licensed cc-by-nc-sa-4.0, but Marker's author will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the Datalab API. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options here.

Contact

In case of any questions please contact us at: info@catchthetornado.com

deifos/pdf-extract-api

pdf-extract-api

Features:

Screenshots

Getting started

Prerequisites

Clone the Repository

Setup environmental variables

Build and Run the Docker Containers

Hosted edition

CLI tool

Pull the LLama3.1 model

Upload a File for OCR (converting to Markdown)

Upload a File for OCR (processing by LLM)

Get OCR Result by Task ID

Clear OCR Cache

Test LLama

Endpoints

OCR Endpoint

OCR Result Endpoint

Clear OCR Cache Endpoint

Ollama Pull Endpoint

Ollama Endpoint

License

Contact