KhadijaMahanga/textract

PDF Text Extraction Script

Python

textract

PDF Text Extraction Script

This is a simple script for the approach taken to extract Text, Images and Tables from Unstructured Document (PDF).

The script reads sample PDF file on /data folder and return a text file that's stored on data/processed folder

This medium article, (Unstructured PDF Text Extraction)[https://medium.com/@khadijamahanga/unstructured-pdf-text-extraction-3a20db14791e] highlights more on the problem and approach taken.

Pre-requisite

This is poetry script, to run it you will need the following

Python 3.8+
Poetry

Get Started

Clone the repository and navigate to project root
Activate your poetry shell by running poetry shell
Run poetry install to install necessary packages as listed on pyproject.toml file
Run script with poetry run extract