Joins & matching playbook
jess-goddard opened this issue · 10 comments
Objective: A clear table that enables us to do EDA on the data, as well as outputs for fuzzy joins for manual review
Table deliverable format @ryanshepherd and I discussed that for the EDA report, a final table that looks like this would be most useful (@richpauloo feel free to make suggestions but this is what we're targeting):
pwsid | pws_name | wsb | tigris_match | frs_match | mhp_match | state_code | county_served | city_served | population_served | connections | primacy_agency_code | service_area_type_code |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CA0985746 | Johnny Depp Water System | 1 | 1 | 1 | 0 | CA | Santa Barbara | Montecito | 1 | 1 | State | RS |
There may be other columns that could be useful for feature engineering to include for missingness analysis; @ryanshepherd I'd err on the side of over-inclusive of extra columns for now
The order of join operations are:
- Filter
water_system
to active (pws_activity_code == A or N or P
) community water systems (pws_type_code == CWS
). Join ECHO point geometry table (more comprehensive thanFRS
as far as lat/longs, though these are geo-coded to county or state) to EPA SDWISpwsid
inwater_systems
(could bewater_system_facility
table but for EDA we should stick to pwsid level analysis).
(1a, maybe later but not necessarily @ this point) By spatial intersection (T/F), join the point geometry (1) to TIGRIS place polygon geometries (which have no pwsid)
-
Join two other tables to (1) for supplementary details:
service_area
table onpwsid
to getservice_area_type_code
which can indicate municipal or mobile home park; andgeographic_area
to (1) onpwsid
to getcity_served
(note these will join onpwsid
notfacility_id
, so it is a bit of a messy data frame) -
Use regex/fuzzy join on
city_served
andpws_name
to matchNAME
from TIGRIS place polygon geometries so that between (1) and (3), each FRS CWS has a TIGRISgeo_id
3a) Output fuzzy joins less than 100% match for manual review
- Use regex/fuzzy join on
pws_name
and/or whereservice_area_code = MH or MP
to mobile home parksmhp_name
to assign mhp centroids
4a) Output fuzzy joins less than 100% match for manual review
- Join (4) to labeled wsb geojson polygons by
pwsid
((6, potentially? might not be needed for this report) Spatial join remaining point geometries (5) without a labeled system to labeled wsb geojson polygons)
future ideas when we do spatial matching
Additionally, we will need a heuristic to handle cases when spatial intersection of a point falls within > 1 polygon geometry. One idea is to output a review table for all "multi-intersections" and to manually assign, or to try to automate it with a scoring system: one point for spatial intersection, another point for fuzzy matching names.
Some initial findings on the FRS join:
We'll join on the PGM_SYS_ID
field. For most interest types, that column is the pwsid
; but when INTEREST_TYPE == "WATER TREATMENT PLANT"
, that column contains pwsid + " " + facility_id
.
After parsing out pwsid's for WTP's, we find that the 60k FRS entries represent 32,329 unique pwsid's (duplications are due to multiple facilities per PWS). Joining this to the unfiltered SDWIS list gives almost 100% matchings. However, many of these matches are inactive or transient systems. When you exclude those, it turns out we've only got 16k distinct PWSID's in FRS that are also in our desired SDWIS subset.
I hunted around for a long time to try to figure out what's happening. I checked in the GDB prior to the transformer to see if I could find any additional PWSID's that were accidentally being filtered out, but I couldn't find any. So my conclusion is that FRS is simply missing a bunch of SDWIS ID's.
Was this expected?
So, I'm still pretty sure FRS only covers ~30% of the PWS's that we're looking for. However, I loaded all these datasets into QGIS (it's pretty cool), and at least visually it looks like the ECHO dataset is a superset of FRS. Almost all the FRS points lie on top of an ECHO point for the same PWS. And ECHO covers 96% of the PWS's of interest. So we may be able to rely entirely ECHO for centroids instead of FRS. I'll try to verify this by joining the datasets on PWSID and taking the distance between the FRS coordinates vs the ECHO coordinates.
@ryanshepherd this is really interesting. On the one hand, I didn't have a strong expectation of completeness for FRS–but i think Rich and I were under the initial assumption that most facilities were captured. If I understand you correctly, we have lat/long for 96% of pwsids in our master list if we use ECHO, but only 30% using FRS? I have two thoughts about this:
- ECHO claims it ingests data from FRS for its lat/long, which is why I thought of FRS as the primary source; but it's possible they have amassed this data more comprehensively, and what is actually available to us as a download is incomplete.
- Let's indeed move forward with the larger set, ECHO. This file should be greatly filtered down in the transformer, but is quite large initially.
Thanks @ryanshepherd. This is precisely the kind of insight and direction we need to move forward. Onwards! 🚀
I found something interesting re FRS / Echo:
- When the Echo point is in FRS, the most common
fac_collection_method
(which tells you about how the centroid was picked) is "ADDRESS MATCHING-HOUSE NUMBER", "INTERPOLATION-PHOTO", "INTERPOLATION-MAP", and many other specific methods. - When the Echo point is not in FRS, the most common
fac_collection_method
is "County Centroid", "Zip Code Centroid", "State Centroid".
That appears to be the main differentiator. FRS only accepts lat/long that is specifically coded to the facility instead of falling back to a poorer quality method.
I wonder, when the method is "County Centroid" or some other low-quality method, if we'd get better quality by geocoding the address ourselves. I would guess that the EPA had already tried that, but...it also wouldn't surprise me at all if they hadn't.
@ryanshepherd yes... this was my gut feeling when I created the echo transformer yesterday, which is disappointing. I would agree that a geocode would be far superior in these cases. The question would be on what address. We have SDWIS administrative addresses, but we may want instead to look at FRS's bigger dataset that includes addresses. I haven't fully parsed it, but it's bigger than the geospatial database which only has facilities with lat/long addresses – https://www.epa.gov/frs/epa-state-combined-csv-download-files
This would be something we could implement post-EDA, as this week is simply about organizing the data together and characterizing what is/is not present
Data cleaning will be an ongoing issue here and unfortunately the data is not as accurate or complete as we would like to think. A related issue #45 that @ryanshepherd and I found yesterday highlights the importance of spatial intersection sanity checks, cleaning functions, and log files for review.
@ryanshepherd @richpauloo just as a note:
I explored the National Combined
csv output from this link.
This is effectively the upstream data for ECHO. It varies slightly, as ECHO is maintained quarterly. However, ECHO is still likely the preferred dataset because they do at least geo-code to zip codes, counties, or states, where lat
and long
are not available in frs.
I did a bit of scratch work comparing ECHO and these combined tables, and found on the whole that we can move forward with ECHO without missing something critical in these other tables. In other words, for active CWS, the amount of missing lat/long data is on par in both datasets.
I also did a quick check on the # of county or state centroids for active CWS; it looks to be ~ 20K+; and of those, only ~400 have fac_zip
or fac_city
filled in. So while we may make some gains on those centroids, we are unlikely to have high quality centroids for those data.
Thanks for this research @jess-goddard. I think these details will be key to capture in the EDA, and I will tag you in places to help write out these learnings as I draft the report. Data quality checks on location missingness and accuracy should fall out of the analysis.