/oa_post

post processing of open address data

Primary LanguagePython

oa_post

post processing of open address data

this repo is an example code to post process Open Address US statewide address csv files to a more usable state. the primary operations on these files are;

  • remove any null street number and street name rows
  • populate null ZIP code rows with ZIP codes from US census zipcode tablulation areas
  • populate null city rows with place names from US census place name areas
  • populate standard state codes
  • pass over any ill-formatted csv files (very small number found so far)
  • push out a new csv file

Dependencies (software)

  • PostGIS
  • Psycopg2

Dependenceis (data)

Example output

Issues

  • the code is single threaded and quite slow, given that it touches every row in large-ish tables several inddependent times
  • the data is no longer pure in the sense that rows have been updated with external sources. ZCTAs are not exact, but they are an excellent open data surrograte.
  • no row level metada for postprocess happens yet, but it could easily be enhanced that way

Future Enhancements

  • it would be worthwhile to test the notion w/o postgis as a dependency and move to using the libraries shapely and fiona. the hypothesis is that the most expensive part of this code currently is the time to update every row with a point geometry for each address, that shapely would handle this faster.
  • it would be fun to test this as a multi-threaded functional programming exercise to increase processing speed
  • it would be a good idea to set this up as a chron job, and incorporate some hashing or something to see if change had happened and therefore not reprosses the entire dataset