/imr-scrape

Python tool for scraping compliances from IMR PDFs

Primary LanguagePythonApache License 2.0Apache-2.0

imrscrape

This Python module allows scraping of IMR (Independent Monitoring Report) PDFs to extract CASA (Court Approved Settlement Agreement) paragraph compliance and page information into a tabular format.

imrscrape is available as an importable Python module and as a CLI tool.

Installation

clone this repo:

git clone https://github.com/apd-forward/imr-scrape

run setup.py

python setup.py

CLI usage

Example

imrscrape -i ./imr-8-final.pdf -o ./imr-8-data.csv

Available Commands

  • -i --input [filepath] (required)

    Takes the filepath to the PDF of the IMR to be scraped

  • -o --output [filepath] (required)

    Take the filepath to a csv for the results

  • -qa

    returns a QA/QC report of possible missing paragraphs to stdout

Development

This module is written using Python >3.7.0 syntax. Dependencies for development are managed with pipenv. Code is formatted with black.