SimpleLab-Inc/wsb

META: modularize wsb labeled data downloader and transformer

richpauloo opened this issue · 3 comments

Presently, download_wsb_labeled.R and transform_wsb_labeled.R download and transform 11 state geometries from a github repo that hosts cleaned geojson polygons for each "labeled" state. This work is affiliated with NEIPS and IoW.

This issue replaces the 1 current downloader with 11 modular downloaders (one for each state) that only depend on the upstream URLs, which can be found by browsing code here: https://github.com/NIEPS-Water-Program/water-affordability/blob/main/rcode/access2_utility_data.R

The same is done for transformers.

When data cannot be recovered from an upstream URL, we can modify the downloader that currently targets all geojson files in the repo with only the states that could not be recovered, and focus on communication and outreach that helps finish those states.

States to address (at time of writing):

  • ca
  • ct
  • ks
  • nc
  • nj
  • nm
  • or
  • ok
  • pa
  • tx
  • wa
  • mo

More tasks:

  • Write a transformer that stitches together the transformed geojsons into "wsb_labeled.geojson"

@richpauloo this is getting more pressing the more additional states we actually have data for; e.g. pushing NC and MO through the current transformer may be more work than its worth. Lmk if you have 5 min to strategize.

Update 2: The below work for NC was done with data from this link, which only contains water source points, instead of this link, containing water systems. Discount the columns selected for NC as of now, but my general question checking the specific column names for the schema still stands

Update: I made a preliminary transformer for NC, choosing these columns for the schema: pwsid, ws_name (raw system_nam), state, county (raw sys_county), city (raw loc_city), source (raw src_name), owner (raw owner), st_areashape, centroid, area_hull, radius, geometry


Schema followup questions, starting with the transformer for NC:

The columns kept in transform_wsb_labeled.R are:
"pwsid", "gis_name", "population", "connections", "state", "county", "source", "st_areashape", "owner", "centroid_x", "centroid_y", "area_hull", "radius", "geometry"

"gis_name", "population", "connections", "source", and "owner" are columns from the California WSB data. From our conversation yesterday, I gathered that we should not keep "population", and we should add in a "city" column. For all states' transformers, should we keep any columns with the same meaning as "gis_name", "source", and "owner"? The "connections" column is specific to CA, and for a state like NC, there are four columns with different types of connections. More generally, could we make a list of desired columns and their descriptions to keep in all transformers?

[ outdated ] Here's the metadata for NC. There are about 40 columns total, and for each column we might want to keep in the schema, there are several potential matches. For example, there's "loc_city", which seems to list the water department or treatment facility, and "ownr_city". There's also both "src_name" and "system_nam", and many location columns (description, address, zip codes) for the source, system, and owner.

Completed with a series of issues
Search "transformer/{state}" or "downloader/{state}" for PR record.