SimpleLab-Inc/wsb

Transformer function that drops invalid points/polygons that fall out-of-state

Closed this issue · 4 comments

Yesterday, @ryanshepherd and I were exploring FRS data and discovered that ~1% of unique pwsids (556/32,329) have > 1 distinct lat/lng. Furthermore, of these groups, ~71% of pwsids (395/556) with > 1 distinct lat/lng have a spatial spread (difference of range) that exceeds what we expect for a moderate sized city (Los Angeles = 9x area of SF = 36 km in one dimension).

Why is this the case? Garbage lat/lng. We know it's in FRS, but we also expect it's in ECHO and so on.

We need a simple transformer function that joins all data with geometries to known state polygons, and drops rows where state_reported != state_intersected. In this case, state_reported is the first 2 characters in the pwsid, and state_intersected is the state we get from spatial intersection. These rows should not be written to staging, but rather, sunk to a log file for review. We will need:

  • R transformer @richpauloo
  • re-write SCHO Python transformer in R to apply f_drop_imposters() @jess-goddard (Rich can apply the function if the rest of the ECHO transformer is there)

Please work on the transformer/drop-invalid-states branch.

Next, we need to address all transformers and begin this data quality review process.

  • FRS
  • MHP
  • ECHO

To illustrate the garbage data, here is a CA pswid that plots in CA and Indiana:

library(tidyverse)
library(fs)
library(sf)

staging_path <- Sys.getenv("WSB_STAGING_PATH")
frs <- path(staging_path, "frs.geojson") %>% st_read()
frs %>% filter(pwsid == "CA1502034") %>% mapview::mapview()

image

@richpauloo I'll look into this; but quickly: for SDWIS, we don't have addresses geo-coded and they are administrative addresses... so I probably wouldn't apply this step here.

Also, looking @ the extent of the function and the built in us state polygons with R, I almost wonder if it's more time efficient for me to re-write echo in R and apply this function, so that we're using the same maps and it spares me time on creating the equivalent function in python

We spoke on the phone about this yesterday, and yes, I think it's more efficient to rewrite the ECHO transformer in R to apply the drop_imposters function. This seems like a key output to analyze in the EDA: imposters in FRS, ECHO, MHP

@richpauloo see recent commit to #46 ~ should be ready to merge and i can simplify to one transformer later

Addressed in #46.