chris-prener/censusxy

Non-ASCII characters lead to NA geocoding records

dvmasterov opened this issue · 3 comments

Maybe this is obvious, but I am seeing some strange behavior with non-ASCII characters on Mac OS. Here's my version info:

> packageVersion("censusxy")
[1] ‘1.0.0’
> getRversion()
[1] ‘4.0.2’

library(censusxy)
library(dplyr)
library(stringi)
library(sf)

> # this works (as does the web ui)
> g<-cxy_single('412 45th Strèet','Oakland','CA','94609', return = 'geographies', vintage = 'Current_Current')
> summary(as.factor(g$cxy_status))
integer(0)
> 
> 
> # this breaks
> my_df <- data.frame(street= '412 45th Strèet', city = 'Oakland', state='CA', zip ='94609')
> geocoded_data <- cxy_geocode(my_df,
+                                     street = "street", 
+                                     city = "city", 
+                                     state = "state", 
+                                     zip = "zip",
+                                     output = "full", 
+                                     class = "dataframe", 
+                                     return="geographies",
+                                     vintage ='Current_Current')
> 
> summary(as.factor(geocoded_data$cxy_status))
NA's 
   1 
> 
> my_df <- my_df %>% 
+   mutate_if(is.character,
+             stri_trans_general,
+             id = "latin-ascii")
> 
> # this breaks
> geocoded_data <- cxy_geocode(my_df,
+                              street = "street", 
+                              city = "city", 
+                              state = "state", 
+                              zip = "zip",
+                              output = "full", 
+                              class = "dataframe", 
+                              return="geographies",
+                              vintage ='Current_Current')
> 
> summary(as.factor(geocoded_data$cxy_status))
Match 
    1 

This is an issue with the Census Bureau side, and not the package itself. My guess is they can't handle non-ASCII characters. If you can transform them as you've done, I would recommend doing that.

I think a warning or at least a mention in the manual of these requirements would be a nice add. This is not well-documented on the census side (as far as I found). I agree that this is on their end, but it would be nice to spare future users this headache.

Yeah, we can make note of it somehow... do you have some "real world" examples of streets with non ASCII characters I could use?