SimpleLab-Inc/wsb

Joins & matching playbook

jess-goddard opened this issue · 10 comments

Objective: A clear table that enables us to do EDA on the data, as well as outputs for fuzzy joins for manual review

Table deliverable format @ryanshepherd and I discussed that for the EDA report, a final table that looks like this would be most useful (@richpauloo feel free to make suggestions but this is what we're targeting):

pwsid pws_name wsb tigris_match frs_match mhp_match state_code county_served city_served population_served connections primacy_agency_code service_area_type_code
CA0985746 Johnny Depp Water System 1 1 1 0 CA Santa Barbara Montecito 1 1 State RS

There may be other columns that could be useful for feature engineering to include for missingness analysis; @ryanshepherd I'd err on the side of over-inclusive of extra columns for now

The order of join operations are:

  1. Filter water_system to active (pws_activity_code == A or N or P) community water systems (pws_type_code == CWS). Join ECHO point geometry table (more comprehensive than FRS as far as lat/longs, though these are geo-coded to county or state) to EPA SDWIS pwsid in water_systems (could be water_system_facility table but for EDA we should stick to pwsid level analysis).

(1a, maybe later but not necessarily @ this point) By spatial intersection (T/F), join the point geometry (1) to TIGRIS place polygon geometries (which have no pwsid)

  1. Join two other tables to (1) for supplementary details: service_area table on pwsid to get service_area_type_code which can indicate municipal or mobile home park; and geographic_area to (1) on pwsid to get city_served (note these will join on pwsid not facility_id, so it is a bit of a messy data frame)

  2. Use regex/fuzzy join on city_served and pws_name to match NAME from TIGRIS place polygon geometries so that between (1) and (3), each FRS CWS has a TIGRIS geo_id

3a) Output fuzzy joins less than 100% match for manual review

  1. Use regex/fuzzy join on pws_name and/or where service_area_code = MH or MP to mobile home parks mhp_name to assign mhp centroids

4a) Output fuzzy joins less than 100% match for manual review

  1. Join (4) to labeled wsb geojson polygons by pwsid

((6, potentially? might not be needed for this report) Spatial join remaining point geometries (5) without a labeled system to labeled wsb geojson polygons)

future ideas when we do spatial matching
Additionally, we will need a heuristic to handle cases when spatial intersection of a point falls within > 1 polygon geometry. One idea is to output a review table for all "multi-intersections" and to manually assign, or to try to automate it with a scoring system: one point for spatial intersection, another point for fuzzy matching names.

Some initial findings on the FRS join:

We'll join on the PGM_SYS_ID field. For most interest types, that column is the pwsid; but when INTEREST_TYPE == "WATER TREATMENT PLANT", that column contains pwsid + " " + facility_id.

After parsing out pwsid's for WTP's, we find that the 60k FRS entries represent 32,329 unique pwsid's (duplications are due to multiple facilities per PWS). Joining this to the unfiltered SDWIS list gives almost 100% matchings. However, many of these matches are inactive or transient systems. When you exclude those, it turns out we've only got 16k distinct PWSID's in FRS that are also in our desired SDWIS subset.

I hunted around for a long time to try to figure out what's happening. I checked in the GDB prior to the transformer to see if I could find any additional PWSID's that were accidentally being filtered out, but I couldn't find any. So my conclusion is that FRS is simply missing a bunch of SDWIS ID's.

Was this expected?

So, I'm still pretty sure FRS only covers ~30% of the PWS's that we're looking for. However, I loaded all these datasets into QGIS (it's pretty cool), and at least visually it looks like the ECHO dataset is a superset of FRS. Almost all the FRS points lie on top of an ECHO point for the same PWS. And ECHO covers 96% of the PWS's of interest. So we may be able to rely entirely ECHO for centroids instead of FRS. I'll try to verify this by joining the datasets on PWSID and taking the distance between the FRS coordinates vs the ECHO coordinates.

@ryanshepherd this is really interesting. On the one hand, I didn't have a strong expectation of completeness for FRS–but i think Rich and I were under the initial assumption that most facilities were captured. If I understand you correctly, we have lat/long for 96% of pwsids in our master list if we use ECHO, but only 30% using FRS? I have two thoughts about this:

  1. ECHO claims it ingests data from FRS for its lat/long, which is why I thought of FRS as the primary source; but it's possible they have amassed this data more comprehensively, and what is actually available to us as a download is incomplete.
  2. Let's indeed move forward with the larger set, ECHO. This file should be greatly filtered down in the transformer, but is quite large initially.

Thanks @ryanshepherd. This is precisely the kind of insight and direction we need to move forward. Onwards! 🚀

I found something interesting re FRS / Echo:

  • When the Echo point is in FRS, the most common fac_collection_method (which tells you about how the centroid was picked) is "ADDRESS MATCHING-HOUSE NUMBER", "INTERPOLATION-PHOTO", "INTERPOLATION-MAP", and many other specific methods.
  • When the Echo point is not in FRS, the most common fac_collection_method is "County Centroid", "Zip Code Centroid", "State Centroid".

That appears to be the main differentiator. FRS only accepts lat/long that is specifically coded to the facility instead of falling back to a poorer quality method.

I wonder, when the method is "County Centroid" or some other low-quality method, if we'd get better quality by geocoding the address ourselves. I would guess that the EPA had already tried that, but...it also wouldn't surprise me at all if they hadn't.

@ryanshepherd yes... this was my gut feeling when I created the echo transformer yesterday, which is disappointing. I would agree that a geocode would be far superior in these cases. The question would be on what address. We have SDWIS administrative addresses, but we may want instead to look at FRS's bigger dataset that includes addresses. I haven't fully parsed it, but it's bigger than the geospatial database which only has facilities with lat/long addresses – https://www.epa.gov/frs/epa-state-combined-csv-download-files

This would be something we could implement post-EDA, as this week is simply about organizing the data together and characterizing what is/is not present

Data cleaning will be an ongoing issue here and unfortunately the data is not as accurate or complete as we would like to think. A related issue #45 that @ryanshepherd and I found yesterday highlights the importance of spatial intersection sanity checks, cleaning functions, and log files for review.

@ryanshepherd @richpauloo just as a note:

I explored the National Combined csv output from this link.
This is effectively the upstream data for ECHO. It varies slightly, as ECHO is maintained quarterly. However, ECHO is still likely the preferred dataset because they do at least geo-code to zip codes, counties, or states, where lat and long are not available in frs.

I did a bit of scratch work comparing ECHO and these combined tables, and found on the whole that we can move forward with ECHO without missing something critical in these other tables. In other words, for active CWS, the amount of missing lat/long data is on par in both datasets.

I also did a quick check on the # of county or state centroids for active CWS; it looks to be ~ 20K+; and of those, only ~400 have fac_zip or fac_city filled in. So while we may make some gains on those centroids, we are unlikely to have high quality centroids for those data.

Thanks for this research @jess-goddard. I think these details will be key to capture in the EDA, and I will tag you in places to help write out these learnings as I draft the report. Data quality checks on location missingness and accuracy should fall out of the analysis.