/fastapi_pdfextractor

An api using fastapi for extracting the text content of pdf using pdfminer. It also supports scanned images in pdf's by using tesseract and ocrmypdf.

Primary LanguagePython

fastapi_pdfextractor GitHub top language

A simple api using fastapi for extracting the text content of pdf using pdfminer. Different pdf parsers were tried like pypdf2, pdfminer.. but pdfminer gave better results. For added ocr support first install tesseract and ghost script as these are required dependencies for the code to work.
Try out and compare the output of pdfminer and tika through API endpoints. Access the results through API response or app/results directory.
Note: if tesseract is installed in some other location than default, then change the location accordingly in pdfapi.py file.

Clone project

git clone https://github.com/soham-1/fastapi_pdfextractor.git

Run locally

Install dependencies

pip install -r requirements.txt

Run Server

cd app
uvicorn pdfapi:app --host 0.0.0.0 --port 8000 --reload

Run on Docker

docker-compose up -d --build

Stop the container using

docker-compose stop fast_api

Restart it using

docker-compose up -d

Documentation

This api has following endpoints

  • /get_doc_list - for getting a list of all the available pdf's

  • /parse/{doc_name} - for getting the meta data and text content of pdf. available pdf's are sample_doc_1, sample_doc_2. sample_doc_3

  • /pdfminer_text/{doc} - returns text output of a pdf using pdfminer library

  • /pdfminer_text/{doc}/{page_no} - returns text output of a pdf of specified page_no

  • /tika_text/{doc} - returns text output of a pdf using py-tika library

  • /pdfminer_xml/{doc} - returns xml output

  • /pdfminer_xml/{doc}/{page_no} - returns xml output of a pdf of specified page_no

  • /pdfminer_html/{doc} - returns html output

  • /pdfminer_html/{doc}/{page_no}

  • /pdfminer_html_char/{doc} - returns character level html output

  • /pdfminer_html_char/{doc}/{page_no}

text pdf

get_doc_list

output

parse doc

pdf with scanned image

parse doc

output

get_doc_list