/address_extractor

A script to extract US-style street addresses from a text file.

Primary LanguagePythonMIT LicenseMIT

address_extractor

A script to extract US-style street addresses from a text file

$ address_extractor
1600 Pennsylvania Ave NW, Washington, DC 20500 ^D
1 lines in input
,1600 Pennsylvania Ave NW,Washington DC 20500
$ address_extractor -o output.csv input.csv
4361 lines in input
*snip*
11 lines unable to be parsed
$ ls
output.csv

address_extractor takes text or a text file containing address-like data, one address per line, and parses it into a uniform format with the usaddress package.

Installation

This package is available from PyPi via pip:

pip install address_extractor

This will install the module as well as the command-line script as address_extractor.

Command-line Usage

address_extractor [-h] [-o OUTPUT] [--remove-post-zip] [input]

positional arguments:
  input                 the input file. Defaults to stdin.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        the output file. Defaults to stdout.
  --remove-post-zip, -r
                        when scanning the input lines, remove everything after
                        a sequence of 5 digits followed by a comma. The
                        parsing library used by this script chokes on
                        addresses containing this kind of information, often a
                        county name.

Lines that could not be parsed will be printed to STDERR. They can be saved to a file with standard bash redirection techniques:

$ address_extractor -o good_addresses.csv has_some_bad_addresses.txt 2> bad_addresses.txt

Usage as a Module

address_extractor can be used as a Python module:

>>> import address_extractor
>>> address_extractor.main(input=input_file_object, output=output_file_object, remove_post_zip=a_bool)

There are some small issues with this implementation:

  • If using sys.stdin or sys.stdout for input or output, respectively, the file objects will still be closed. This presents issues trying to use these in the future.
  • Errored lines are still printed to sys.stderr which may not be expected.

Versions and Stability

This package is uploaded as a 0.1.0 release. There are no tests and little error checking--it originated as a quick-'n-dirty script and I decided to release it as a package to gain familiarity with that process.

Issues, comments, and pull requests are welcome at the GitHub page!