/textract

PDF Text Extraction Script

Primary LanguagePython

textract

PDF Text Extraction Script

This is a simple script for the approach taken to extract Text, Images and Tables from Unstructured Document (PDF).

The script reads sample PDF file on /data folder and return a text file that's stored on data/processed folder

This medium article, (Unstructured PDF Text Extraction)[https://medium.com/@khadijamahanga/unstructured-pdf-text-extraction-3a20db14791e] highlights more on the problem and approach taken.

Pre-requisite

This is poetry script, to run it you will need the following

  • Python 3.8+
  • Poetry

Get Started

  • Clone the repository and navigate to project root
  • Activate your poetry shell by running poetry shell
  • Run poetry install to install necessary packages as listed on pyproject.toml file
  • Run script with poetry run extract