PDF Text Extraction Script
This is a simple script for the approach taken to extract Text, Images and Tables from Unstructured Document (PDF).
The script reads sample PDF file on /data
folder and return a text file that's stored on data/processed
folder
This medium article, (Unstructured PDF Text Extraction)[https://medium.com/@khadijamahanga/unstructured-pdf-text-extraction-3a20db14791e] highlights more on the problem and approach taken.
This is poetry script, to run it you will need the following
- Python 3.8+
- Poetry
- Clone the repository and navigate to project root
- Activate your poetry shell by running
poetry shell
- Run
poetry install
to install necessary packages as listed on pyproject.toml file - Run script with
poetry run extract