/python-pdf-table-extractor

Utility app to extract data in tables from PDF documents using Python with Pandas and Camelot libraries

Primary LanguageDockerfile

My Logo

Python - PDF Table Extractor App

About

This small utility app was created to help with the tedious task of extracting data contained in tables of vendor PDF product data sheets.

Tabula has been used previously and is highly recommended, but I needed something that I could customise to my needs a little more.


Technologies

Languages

  • Python3
    • Used to create the main application functionality

Libraries / Packages / Modules

  • Pandas

    • Flask is a micro web framework written in Python.
  • Camelot

    • Camelot is a Python library that can help you extract tables from PDFs.

Tools


Deployment

The website was developed using VS Code & Git pushed to GitHub, which hosts the repository. I made the following steps to deploy the site:

Cloning python

Prerequisites

Ensure the following are installed locally on your computer:

Online Camelot Installation Guide

Cloning the GitHub repository

  • navigate to simonjvardy/python-pdf-table-extractor GitHub repository.
  • Click the Code button
  • Copy the clone url in the dropdown menu
  • Using your favourite IDE open up your preferred terminal.
  • Navigate to your desired file location.

Copy the following code and input it into your terminal to clone Sportswear-Online:

git clone https://github.com/simonjvardy/python-pdf-table-extractor.git

Creation of a Python Virtual Environment

Note: The process may be different depending upon your own OS - please follow this Python help guide to understand how to create a virtual environment.

Install the Python Libraries

Run the following command in your terminal window:

pip install -r requirements.txt

Run the application locally

  • TODO
python app.py

Acknowledgements