Joins & matching playbook

Question

Joins & matching playbook

jess-goddard opened this issue 3 years ago · 10 comments

Objective: A clear table that enables us to do EDA on the data, as well as outputs for fuzzy joins for manual review

Table deliverable format @ryanshepherd and I discussed that for the EDA report, a final table that looks like this would be most useful (@richpauloo feel free to make suggestions but this is what we're targeting):

pwsid	pws_name	wsb	tigris_match	frs_match	mhp_match	state_code	county_served	city_served	population_served	connections	primacy_agency_code	service_area_type_code
CA0985746	Johnny Depp Water System	1	1	1	0	CA	Santa Barbara	Montecito	1	1	State	RS

There may be other columns that could be useful for feature engineering to include for missingness analysis; @ryanshepherd I'd err on the side of over-inclusive of extra columns for now

The order of join operations are:

Filter water_system to active (pws_activity_code == A or N or P) community water systems (pws_type_code == CWS). Join ECHO point geometry table (more comprehensive than FRS as far as lat/longs, though these are geo-coded to county or state) to EPA SDWIS pwsid in water_systems (could be water_system_facility table but for EDA we should stick to pwsid level analysis).

(1a, maybe later but not necessarily @ this point) By spatial intersection (T/F), join the point geometry (1) to TIGRIS place polygon geometries (which have no pwsid)

Join two other tables to (1) for supplementary details: service_area table on pwsid to get service_area_type_code which can indicate municipal or mobile home park; and geographic_area to (1) on pwsid to get city_served (note these will join on pwsid not facility_id, so it is a bit of a messy data frame)
Use regex/fuzzy join on city_served and pws_name to match NAME from TIGRIS place polygon geometries so that between (1) and (3), each FRS CWS has a TIGRIS geo_id

3a) Output fuzzy joins less than 100% match for manual review

Use regex/fuzzy join on pws_name and/or where service_area_code = MH or MP to mobile home parks mhp_name to assign mhp centroids

4a) Output fuzzy joins less than 100% match for manual review

Join (4) to labeled wsb geojson polygons by pwsid

((6, potentially? might not be needed for this report) Spatial join remaining point geometries (5) without a labeled system to labeled wsb geojson polygons)

future ideas when we do spatial matching
Additionally, we will need a heuristic to handle cases when spatial intersection of a point falls within > 1 polygon geometry. One idea is to output a review table for all "multi-intersections" and to manually assign, or to try to automate it with a scoring system: one point for spatial intersection, another point for fuzzy matching names.

Answer 1 · 2022-02-20T03:32:33.000Z

Some initial findings on the FRS join:

We'll join on the PGM_SYS_ID field. For most interest types, that column is the pwsid; but when INTEREST_TYPE == "WATER TREATMENT PLANT", that column contains pwsid + " " + facility_id.

After parsing out pwsid's for WTP's, we find that the 60k FRS entries represent 32,329 unique pwsid's (duplications are due to multiple facilities per PWS). Joining this to the unfiltered SDWIS list gives almost 100% matchings. However, many of these matches are inactive or transient systems. When you exclude those, it turns out we've only got 16k distinct PWSID's in FRS that are also in our desired SDWIS subset.

I hunted around for a long time to try to figure out what's happening. I checked in the GDB prior to the transformer to see if I could find any additional PWSID's that were accidentally being filtered out, but I couldn't find any. So my conclusion is that FRS is simply missing a bunch of SDWIS ID's.

Was this expected?

Answer 2 · 2022-02-20T06:07:22.000Z

So, I'm still pretty sure FRS only covers ~30% of the PWS's that we're looking for. However, I loaded all these datasets into QGIS (it's pretty cool), and at least visually it looks like the ECHO dataset is a superset of FRS. Almost all the FRS points lie on top of an ECHO point for the same PWS. And ECHO covers 96% of the PWS's of interest. So we may be able to rely entirely ECHO for centroids instead of FRS. I'll try to verify this by joining the datasets on PWSID and taking the distance between the FRS coordinates vs the ECHO coordinates.

Answer 3 · 2022-02-20T19:41:56.000Z

@ryanshepherd this is really interesting. On the one hand, I didn't have a strong expectation of completeness for FRS–but i think Rich and I were under the initial assumption that most facilities were captured. If I understand you correctly, we have lat/long for 96% of pwsids in our master list if we use ECHO, but only 30% using FRS? I have two thoughts about this:

ECHO claims it ingests data from FRS for its lat/long, which is why I thought of FRS as the primary source; but it's possible they have amassed this data more comprehensively, and what is actually available to us as a download is incomplete.
Let's indeed move forward with the larger set, ECHO. This file should be greatly filtered down in the transformer, but is quite large initially.

Answer 4 · 2022-02-20T22:13:11.000Z

Thanks @ryanshepherd. This is precisely the kind of insight and direction we need to move forward. Onwards! 🚀

Answer 5 · 2022-02-22T04:25:22.000Z

I found something interesting re FRS / Echo:

When the Echo point is in FRS, the most common fac_collection_method (which tells you about how the centroid was picked) is "ADDRESS MATCHING-HOUSE NUMBER", "INTERPOLATION-PHOTO", "INTERPOLATION-MAP", and many other specific methods.
When the Echo point is not in FRS, the most common fac_collection_method is "County Centroid", "Zip Code Centroid", "State Centroid".

That appears to be the main differentiator. FRS only accepts lat/long that is specifically coded to the facility instead of falling back to a poorer quality method.

I wonder, when the method is "County Centroid" or some other low-quality method, if we'd get better quality by geocoding the address ourselves. I would guess that the EPA had already tried that, but...it also wouldn't surprise me at all if they hadn't.

Answer 6 · 2022-02-22T14:28:06.000Z

@ryanshepherd yes... this was my gut feeling when I created the echo transformer yesterday, which is disappointing. I would agree that a geocode would be far superior in these cases. The question would be on what address. We have SDWIS administrative addresses, but we may want instead to look at FRS's bigger dataset that includes addresses. I haven't fully parsed it, but it's bigger than the geospatial database which only has facilities with lat/long addresses – https://www.epa.gov/frs/epa-state-combined-csv-download-files

Answer 7 · 2022-02-22T14:28:35.000Z

This would be something we could implement post-EDA, as this week is simply about organizing the data together and characterizing what is/is not present

Answer 8 · 2022-02-22T17:39:10.000Z

Data cleaning will be an ongoing issue here and unfortunately the data is not as accurate or complete as we would like to think. A related issue #45 that @ryanshepherd and I found yesterday highlights the importance of spatial intersection sanity checks, cleaning functions, and log files for review.

Answer 9 · 2022-02-23T21:57:28.000Z

@ryanshepherd @richpauloo just as a note:

I explored the National Combined csv output from this link.
This is effectively the upstream data for ECHO. It varies slightly, as ECHO is maintained quarterly. However, ECHO is still likely the preferred dataset because they do at least geo-code to zip codes, counties, or states, where lat and long are not available in frs.

I did a bit of scratch work comparing ECHO and these combined tables, and found on the whole that we can move forward with ECHO without missing something critical in these other tables. In other words, for active CWS, the amount of missing lat/long data is on par in both datasets.

I also did a quick check on the # of county or state centroids for active CWS; it looks to be ~ 20K+; and of those, only ~400 have fac_zip or fac_city filled in. So while we may make some gains on those centroids, we are unlikely to have high quality centroids for those data.

Answer 10 · 2022-02-25T18:28:14.000Z

Thanks for this research @jess-goddard. I think these details will be key to capture in the EDA, and I will tag you in places to help write out these learnings as I draft the report. Data quality checks on location missingness and accuracy should fall out of the analysis.