Log of non-passed addresses
darrellcarvalho opened this issue · 5 comments
I'm not sure if this would be a feature request or a request for clarification - is there a means (or could a means be implemented) to log addresses not passed to the geocoding services?
Example: of my CSV, I have ~1419 addresses. Of these, 1395 get passed at the initial geocode() function call, regardless of method set. It would be useful to find a report of the unsubmitted entries and possibly why they were not passable.
Hi @darrellcarvalho do you have a small reproducible example you could share? One thing I would check is if you have duplicate addresses in your dataset. Only unique addresses are passed to geocoding services so that could explain a discrepancy between your dataset size and the number of addresses sent.
@jessecambon that is absolutely a likely culprit, I'll do a check on that and get back to you. Otherwise, I don't know how I would make a repex that might capture the issue at hand, but the full dataset itself is here
Hi @darrellcarvalho ,
I had a look to your csv and indeed it has duplicated addresses. But note (at least on this reprex) that still the result gives the right number of rows.
On the reprex I selected 9 different rows with 2 unique addresses. Although just 2 addresses are sent, the resulting table is correct. This is because tidygeocoder merges back the result to the table (left join @jessecambon ?):
library(tidygeocoder)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data <- read.csv("https://raw.githubusercontent.com/darrellcarvalho/ebt_restaurants/cec4c4a2b2db22422015f25f3514082dce9449ba/data/raw/EBT_Restaurants.csv")
# Geocode with OSM
# Use full address
data <- data %>%
mutate(full_address = paste0(data$street_address, ", ", data$city))
head(data)
#> business_name street_address st_num st_name
#> 1 Jack in the Box 100 E Holt Ave 100 E Holt Ave
#> 2 Subway 100 W Broadway 100 W Broadway
#> 3 Yoshinoya Beef Bowl 100 W Colorado Blvd 100 W Colorado Blvd
#> 4 Jack in the Box 100 W Duarte Rd 100 W Duarte Rd
#> 5 McDonald's 1000 E 4th St 1000 E 4th St
#> 6 Subway 1000 E Washington Blvd 1000 E Washington Blvd
#> city zip_code area_code exchange_code line_number
#> 1 Pomona 91767 818 880 9253
#> 2 Long Beach 90802 323 795 135
#> 3 Glendale 91204 323 564 9934
#> 4 Monrovia 91016 661 273 8261
#> 5 Long Beach 90802 661 948 4119
#> 6 Los Angeles 90021 310 609 3303
#> full_address
#> 1 100 E Holt Ave, Pomona
#> 2 100 W Broadway, Long Beach
#> 3 100 W Colorado Blvd, Glendale
#> 4 100 W Duarte Rd, Monrovia
#> 5 1000 E 4th St, Long Beach
#> 6 1000 E Washington Blvd, Los Angeles
# All info
nrow(data)
#> [1] 1419
# No dup info
length(unique(data$full_address))
#> [1] 1376
# Some dupes
sample <- data %>% filter(full_address %in% c(
"18111 Nordhoff St, Northridge",
"111 E 223rd St, Carson"
))
nrow(sample)
#> [1] 9
# Uniques
length(unique(sample$full_address))
#> [1] 2
geocode <- geocode(sample, full_address)
#> Passing 2 addresses to the Nominatim single address geocoder
#> Query completed in: 2.1 seconds
# Just 2 addresses passed, but the result still works!
nrow(geocode)
#> [1] 9
head(geocode[c("business_name", "area_code", "lat", "long")])
#> # A tibble: 6 x 4
#> business_name area_code lat long
#> <chr> <int> <dbl> <dbl>
#> 1 Jack in the Box 562 33.8 -118.
#> 2 Jack in the Box 310 33.8 -118.
#> 3 Burger King 213 34.2 -119.
#> 4 Freudian Sip (Arbor Grill Market) 818 34.2 -119.
#> 5 Freudian Sip (Building 18) 818 34.2 -119.
#> 6 Freudian Sip (Oviatt Library) 818 34.2 -119.
Created on 2022-01-12 by the reprex package (v2.0.1)
Session info
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.2 (2021-11-01)
#> os Windows 10 x64 (build 22000)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Spanish_Spain.1252
#> ctype Spanish_Spain.1252
#> tz Europe/Paris
#> date 2022-01-12
#> pandoc 2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
#>
#> - Packages -------------------------------------------------------------------
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.1)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
#> cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.1)
#> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1)
#> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.1)
#> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
#> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.1)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.1)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.1)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
#> generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.1)
#> glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.2)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.1)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.1)
#> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.1)
#> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.1)
#> pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.1)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.1)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1)
#> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1)
#> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.1)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
#> tidygeocoder * 1.0.5 2021-11-02 [1] CRAN (R 4.1.2)
#> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.1)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.1)
#> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.2)
#> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.1)
#>
#> [1] C:/Users/diego/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.2/library
#>
#> ------------------------------------------------------------------------------
@dieghernan Ah I see, so as I'm understanding it, it only passes one instance of the duplicate addresses but applies the coordinates to all observations on the return? That makes sense. I really appreciate the clarification!
Yup, that's what it does by default. The code for this functionality is in this file.
If you want to only return one row per unique address (ie. remove the duplication) then you can use unique_only = TRUE
.