/gazettes-liquor-extractor

A script that scrapes liquor licenses from government gazettes

Primary LanguagePythonOtherNOASSERTION

Gazette Liquor Licensees

This first draft of this script was banged together out of curiosity at a Codebridge community evening. In its current state, it should be considered a very rough skeleton to be built upon or abandoned.

The purpose of the exercise was to determine whether a data processing pipeline can be established to extract liquor licensees from the gazettes.

TL;DR

After a non-comprehensive exploration of the extraction task, most of the data is successfully extracted -- but with occasional hiccups due to the unreliable nature of the input (PDF files converted to text).

  • If a fail-safe extraction is needed, this method is not recommended. In some places, the extraction yields "doubles" which are not separated. Because of the columnar nature of the data, these doubles are less reliable. It is possible to interrupt the extraction after the "simple matches" phase.
  • Instances such as the (1) in (4) Tsakani Tavern: Section 41(1)(A)(C) throw the script off its game.
  • If something is better than nothing, this script can be extended, for instance by storing the data in a usable format.

Usage

  • Add text files to the files folder
  • For the moment, add them manually to the files list in config.py
  • Run main.py

Converting before processing

In its present state, the script assumes that the PDF files have been converted to text using Adobe Acrobat XI. Different converters generate different output, so a text file generated by a different converter will break the script.

Several conversion options exist:

  • In theory, PDF files could be converted in batch using the API from Solid Documents, who provide the conversion within Acrobat. However, this is a .NET API.
  • To keep the same format, another pathway would be to use a batch conversion action within Acrobat XI.
  • Finally, another converter can be used, in which case the regular expressions will need to be rewritten.

Regular expressions module: regex vs. re

The script uses the regex module in preference to re because the standard re is sub-standard in comparison to regular expression engines available in other major languages. C via PCRE, .NET languages, Java, Ruby and Perl all have robust regular expression flavors.

In turn, the regex module is one of the best-rounded regular expression engines around.

Resources