jessecambon/tidygeocoder

Log of non-passed addresses

darrellcarvalho opened this issue · 5 comments

I'm not sure if this would be a feature request or a request for clarification - is there a means (or could a means be implemented) to log addresses not passed to the geocoding services?

Example: of my CSV, I have ~1419 addresses. Of these, 1395 get passed at the initial geocode() function call, regardless of method set. It would be useful to find a report of the unsubmitted entries and possibly why they were not passable.

Hi @darrellcarvalho do you have a small reproducible example you could share? One thing I would check is if you have duplicate addresses in your dataset. Only unique addresses are passed to geocoding services so that could explain a discrepancy between your dataset size and the number of addresses sent.

@jessecambon that is absolutely a likely culprit, I'll do a check on that and get back to you. Otherwise, I don't know how I would make a repex that might capture the issue at hand, but the full dataset itself is here

https://github.com/darrellcarvalho/ebt_restaurants/blob/cec4c4a2b2db22422015f25f3514082dce9449ba/data/raw/EBT_Restaurants.csv

Hi @darrellcarvalho ,

I had a look to your csv and indeed it has duplicated addresses. But note (at least on this reprex) that still the result gives the right number of rows.

On the reprex I selected 9 different rows with 2 unique addresses. Although just 2 addresses are sent, the resulting table is correct. This is because tidygeocoder merges back the result to the table (left join @jessecambon ?):

library(tidygeocoder)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


data <- read.csv("https://raw.githubusercontent.com/darrellcarvalho/ebt_restaurants/cec4c4a2b2db22422015f25f3514082dce9449ba/data/raw/EBT_Restaurants.csv")


# Geocode with OSM

# Use full address
data <- data %>%
  mutate(full_address = paste0(data$street_address, ", ", data$city))

head(data)
#>         business_name         street_address st_num           st_name
#> 1     Jack in the Box         100 E Holt Ave    100        E Holt Ave
#> 2              Subway         100 W Broadway    100        W Broadway
#> 3 Yoshinoya Beef Bowl    100 W Colorado Blvd    100   W Colorado Blvd
#> 4     Jack in the Box        100 W Duarte Rd    100       W Duarte Rd
#> 5          McDonald's          1000 E 4th St   1000          E 4th St
#> 6              Subway 1000 E Washington Blvd   1000 E Washington Blvd
#>          city zip_code area_code exchange_code line_number
#> 1      Pomona    91767       818           880        9253
#> 2  Long Beach    90802       323           795         135
#> 3    Glendale    91204       323           564        9934
#> 4    Monrovia    91016       661           273        8261
#> 5  Long Beach    90802       661           948        4119
#> 6 Los Angeles    90021       310           609        3303
#>                          full_address
#> 1              100 E Holt Ave, Pomona
#> 2          100 W Broadway, Long Beach
#> 3       100 W Colorado Blvd, Glendale
#> 4           100 W Duarte Rd, Monrovia
#> 5           1000 E 4th St, Long Beach
#> 6 1000 E Washington Blvd, Los Angeles

# All info
nrow(data)
#> [1] 1419

# No dup info
length(unique(data$full_address))
#> [1] 1376


# Some dupes
sample <- data %>% filter(full_address %in% c(
  "18111 Nordhoff St, Northridge",
  "111 E 223rd St, Carson"
))

nrow(sample)
#> [1] 9

# Uniques
length(unique(sample$full_address))
#> [1] 2


geocode <- geocode(sample, full_address)
#> Passing 2 addresses to the Nominatim single address geocoder
#> Query completed in: 2.1 seconds

# Just 2 addresses passed, but the result still works!
nrow(geocode)
#> [1] 9

head(geocode[c("business_name", "area_code", "lat", "long")])
#> # A tibble: 6 x 4
#>   business_name                     area_code   lat  long
#>   <chr>                                 <int> <dbl> <dbl>
#> 1 Jack in the Box                         562  33.8 -118.
#> 2 Jack in the Box                         310  33.8 -118.
#> 3 Burger King                             213  34.2 -119.
#> 4 Freudian Sip (Arbor Grill Market)       818  34.2 -119.
#> 5 Freudian Sip (Building 18)              818  34.2 -119.
#> 6 Freudian Sip (Oviatt Library)           818  34.2 -119.

Created on 2022-01-12 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Spanish_Spain.1252
#>  ctype    Spanish_Spain.1252
#>  tz       Europe/Paris
#>  date     2022-01-12
#>  pandoc   2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.1.1)
#>  backports      1.4.1   2021-12-13 [1] CRAN (R 4.1.2)
#>  cli            3.1.0   2021-10-27 [1] CRAN (R 4.1.1)
#>  crayon         1.4.2   2021-10-29 [1] CRAN (R 4.1.1)
#>  curl           4.3.2   2021-06-23 [1] CRAN (R 4.1.1)
#>  DBI            1.1.2   2021-12-20 [1] CRAN (R 4.1.2)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr        * 1.0.7   2021-06-18 [1] CRAN (R 4.1.1)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.1.1)
#>  evaluate       0.14    2019-05-28 [1] CRAN (R 4.1.1)
#>  fansi          0.5.0   2021-05-25 [1] CRAN (R 4.1.1)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.1)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics       0.1.1   2021-10-25 [1] CRAN (R 4.1.1)
#>  glue           1.6.0   2021-12-17 [1] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.1)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
#>  httr           1.4.2   2020-07-20 [1] CRAN (R 4.1.1)
#>  jsonlite       1.7.2   2020-12-09 [1] CRAN (R 4.1.1)
#>  knitr          1.37    2021-12-16 [1] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr       2.0.1   2020-11-17 [1] CRAN (R 4.1.1)
#>  pillar         1.6.4   2021-10-18 [1] CRAN (R 4.1.1)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.1)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.1)
#>  R.cache        0.15.0  2021-04-30 [1] CRAN (R 4.1.1)
#>  R.methodsS3    1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo           1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils        2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.1.1)
#>  rlang          0.4.12  2021-10-18 [1] CRAN (R 4.1.1)
#>  rmarkdown      2.11    2021-09-14 [1] CRAN (R 4.1.1)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.1.1)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.1)
#>  styler         1.6.2   2021-09-23 [1] CRAN (R 4.1.1)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  tidygeocoder * 1.0.5   2021-11-02 [1] CRAN (R 4.1.2)
#>  tidyselect     1.1.1   2021-04-30 [1] CRAN (R 4.1.1)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.1)
#>  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.1.1)
#>  withr          2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
#>  xfun           0.29    2021-12-14 [1] CRAN (R 4.1.2)
#>  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.1.1)
#> 
#>  [1] C:/Users/diego/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2/library
#> 
#> ------------------------------------------------------------------------------

@dieghernan Ah I see, so as I'm understanding it, it only passes one instance of the duplicate addresses but applies the coordinates to all observations on the return? That makes sense. I really appreciate the clarification!

Yup, that's what it does by default. The code for this functionality is in this file.

If you want to only return one row per unique address (ie. remove the duplication) then you can use unique_only = TRUE.