post processing of open address data
this repo is an example code to post process Open Address US statewide address csv files to a more usable state. the primary operations on these files are;
- remove any null street number and street name rows
- populate null ZIP code rows with ZIP codes from US census zipcode tablulation areas
- populate null city rows with place names from US census place name areas
- populate standard state codes
- pass over any ill-formatted csv files (very small number found so far)
- push out a new csv file
- PostGIS
- Psycopg2
-
the state download csv files from OpenAddress found under the U.S. addresses (list of states by region). NOTE: this script requires a directory structure of all state sub folders to be in one folder. e.g. us=> al, us=>ak ...
-
the nationwide zip code tabulation areas loaded as a table in postgis from US Census ZCTA TIGER shapefiles.
-
the individual state place shapefiles loaded as statewide tables in postgis from the US Census Place TIGER shapefile.
- the code is single threaded and quite slow, given that it touches every row in large-ish tables several inddependent times
- the data is no longer pure in the sense that rows have been updated with external sources. ZCTAs are not exact, but they are an excellent open data surrograte.
- no row level metada for postprocess happens yet, but it could easily be enhanced that way
- it would be worthwhile to test the notion w/o postgis as a dependency and move to using the libraries shapely and fiona. the hypothesis is that the most expensive part of this code currently is the time to update every row with a point geometry for each address, that shapely would handle this faster.
- it would be fun to test this as a multi-threaded functional programming exercise to increase processing speed
- it would be a good idea to set this up as a chron job, and incorporate some hashing or something to see if change had happened and therefore not reprosses the entire dataset