Python - PDF Table Extractor App

About

This small utility app was created to help with the tedious task of extracting data contained in tables of vendor PDF product data sheets.

Tabula has been used previously and is highly recommended, but I needed something that I could customise to my needs a little more.

Technologies

Languages

Python3
- Used to create the main application functionality

Libraries / Packages / Modules

Pandas
- Flask is a micro web framework written in Python.
Camelot
- Camelot is a Python library that can help you extract tables from PDFs.

Tools

VS Code
- Code Editor

Deployment

The website was developed using VS Code & Git pushed to GitHub, which hosts the repository. I made the following steps to deploy the site:

Cloning python

Prerequisites

Ensure the following are installed locally on your computer:

Python 3.6 or higher
PIP3 Python package installer
Git Version Control
Ghostscript Ghostscript is an interpreter for the PostScript® language and PDF files

Online Camelot Installation Guide

Camelot Documentation

Cloning the GitHub repository

navigate to simonjvardy/python-pdf-table-extractor GitHub repository.
Click the Code button
Copy the clone url in the dropdown menu
Using your favourite IDE open up your preferred terminal.
Navigate to your desired file location.

Copy the following code and input it into your terminal to clone Sportswear-Online:

git clone https://github.com/simonjvardy/python-pdf-table-extractor.git

Creation of a Python Virtual Environment

Note: The process may be different depending upon your own OS - please follow this Python help guide to understand how to create a virtual environment.

Install the Python Libraries

Run the following command in your terminal window:

pip install -r requirements.txt

Run the application locally

TODO

python app.py

Acknowledgements

freeCodeCamp YouTube video Python automation tutorial.