jlacko/RCzechia

Reconsider return value of `geocode()` when no match

petrbouchal opened this issue · 7 comments

When using geocode() via {purrr}, working with the no-match values can get difficult. It might be more straightforward to always return a tibble, of length zero if no match is returned by the API. This is the approach taken by {tidygeocoder} and I find it a bit easier to use.

{tidygeocoder} is kind of a gold standard in realm of geocoding; flattery in form of imitation might be a good idea.

Just to summarise / clarify: you are requesting that the output is always in form of a (tidy) data frame. With zero rows if no match was found. This is of course doable.

However, since you opened the issue, can you advise me on value of knowing the difference between these two:

  • "no rows returned because no match found" - implemented as return value NA
  • "no rows returned, because ČÚZK API is down or not accepting requests (status 503)" - implemented as return value NULL

When I was implementing the API (some time before I learned about tidygeocoder) this distinction seemed as material.

Yes that's right, thanks for checking.

Good point, those are different of course. Personally, my mental model of code accessing API tends to be that when the API returns HTTP status > 399, I get an error, or at least a warning. I don't know if that is overkill in this case? Even in the vectorised case, I would need to rerun the entire call/set of calls to the API if some of them returned NULL - the HTTP status code won't be different for different inputs (from one row to the next), but depends mainly on whether the server is down or I have broken some kind of policy?

Alternatively: I don't know off the top of my head how easy it is to handle NULLs in a column of data frames (e.g. when unnesting or retrieving length of the result) - I suspect while still more difficult than a zero-length df, NULL is better than NA and might be an OK fallback to the status 503 situation?

Would you mind cloning the development branch tidygeo-flattery - https://github.com/jlacko/RCzechia/tree/tidygeo-flattery - and double checking that the result works in your use case?

I am somewhat afraid of raising errors & warnings in case of unavailable internet resources (I was through this in CRAN checks before, and prof. Ripley was quite explicit in explaining the CRAN policy in this matter; I have little desire to go through that again) but there is a message.

Apart from that both geo and revgeo functions are now guaranteed to return a (possibly empty) data frame.

Thanks, this looks great!

Here's a reprex of what I see for "normal use" and for my use case:

library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(tidyr)
library(RCzechia)
#> Loading required package: sf
#> Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1

df <- tibble(id = 1:2,
             address_in = c("Václavské náměstí 1, Brno", # should return nothing
                            "Svinošice 115"))

geocode(address = df$address_in)
#> Simple feature collection with 3 features and 3 fields
#> Attribute-geometry relationship: 3 constant, 0 aggregate, 0 identity
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 16.57605 ymin: 49.33335 xmax: 16.5809 ymax: 49.33713
#> geographic CRS: WGS 84
#>          target                 typ                   address
#> 1 Svinošice 115        AdresniMisto č.p. 115, 67922 Svinošice
#> 2 Svinošice 115 ParcelaDefinicniBod             Svinošice 115
#> 3 Svinošice 115 ParcelaDefinicniBod         Svinošice 354/115
#>                    geometry
#> 1 POINT (16.57697 49.33662)
#> 2 POINT (16.57605 49.33335)
#> 3  POINT (16.5809 49.33713)

df2 <- df %>% 
  mutate(gcd = map(address_in, geocode),
         n_matches = map_int(gcd, nrow),
         gcd_slct = map(gcd, ~filter(., typ == "AdresniMisto")))
#> Impossible to geocode any of the addresses provided.
#> Warning in min(cc[[1]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in min(cc[[2]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in max(cc[[1]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning in max(cc[[2]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf

# are those min/max warnings emitted inside `geocode()`?
geocode(df$address_in[1])
#> Impossible to geocode any of the addresses provided.
#> Warning in min(cc[[1]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in min(cc[[2]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in max(cc[[1]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning in max(cc[[2]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Simple feature collection with 0 features and 3 fields
#> Attribute-geometry relationship: 3 constant, 0 aggregate, 0 identity
#> bbox:           xmin: Inf ymin: Inf xmax: -Inf ymax: -Inf
#> geographic CRS: WGS 84
#> [1] target   typ      address  geometry
#> <0 rows> (or 0-length row.names)

df2
#> # A tibble: 2 x 5
#>      id address_in                gcd              n_matches gcd_slct        
#>   <int> <chr>                     <list>               <int> <list>          
#> 1     1 Václavské náměstí 1, Brno <sf[,4] [0 × 4]>         0 <sf[,4] [0 × 4]>
#> 2     2 Svinošice 115             <sf[,4] [3 × 4]>         3 <sf[,4] [1 × 4]>

df2 %>% 
  unnest(gcd)
#> # A tibble: 3 x 8
#>      id address_in  target  typ     address                   geometry n_matches
#>   <int> <chr>       <chr>   <chr>   <chr>                  <POINT [°]>     <int>
#> 1     2 Svinošice … Svinoš… Adresn… č.p. 115…      (16.57697 49.33662)         3
#> 2     2 Svinošice … Svinoš… Parcel… Svinošic…      (16.57605 49.33335)         3
#> 3     2 Svinošice … Svinoš… Parcel… Svinošic…       (16.5809 49.33713)         3
#> # … with 1 more variable: gcd_slct <list>

df2 %>% 
  unnest(gcd_slct)
#> # A tibble: 1 x 8
#>      id address_in   gcd         n_matches target     typ       address         
#>   <int> <chr>        <list>          <int> <chr>      <chr>     <chr>           
#> 1     2 Svinošice 1… <sf[,4] [3…         3 Svinošice… AdresniM… č.p. 115, 67922…
#> # … with 1 more variable: geometry <POINT [°]>

Created on 2021-03-10 by the reprex package (v0.3.0)

In terms of output structure this is exactly as expected/needed and the warning is fine (as soon as I remember that it only applies to one iteration even if it says "none", but that's up to me.)

Part of me wonders: should this be just a message? Is a geocoding non-match something to worry about, or merely something normal a user might want to be simply informed about? Also, should warnings be reserverd for situations where the API is unavailable/returns 503 (if returning errors is faux pas per CRAN?)

Couple of things to consider:

  • not quite sure whether the other warnings should be there?
  • the wording of the warning might be clearer on the cause, e.g. "the geocoding API did not return matches for any of the input addresses" so it's clear it is a no-match scenario, not an API fail scenario.
  • not part of this issue, but I noticed that it might be slightly confusing that after I pass input via the address argument, in the function's output data.frame it appears in the target column, while the address column contains the address returned by the API. It might be worth updating the names?

Thanks @petrbouchal for the valuable input! I have updated the tidygeo_flattery branch, incorporating most of your suggestions (the only thing I am wary of is message / warning distinction; I prefer to keep all internet related stuff as messages - and not warnings, even if that would be more appropriate in an ideal world - as I live in constant fear of the gods of CRAN).

To be specific: I have created a fallback object that is returned in case of no match. This ensures that the output of geocode() function is guaranteed to be a {sf} data.frame. This has unfortunate consequence that the result has 1 row.

While it may be possible to create a data.frame with no rows - and just zero lenght vectors for columns - it is not possible to create a {sf} data frame with no rows; the base case is an object with a single empty (but existing) geometry. So one row, with NAs for address, type and result, it must be.

Something along these lines:

Simple feature collection with 1 feature and 3 fields (with 1 geometry empty)
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
  address  type result                 geometry
1    <NA> <NA>   <NA> GEOMETRYCOLLECTION EMPTY

Please advise if having this return value is preferrable to simple NULL (without {sf}, or indeed data.frame, class).

In addition to this I have

  • amended the unsuccessful geocoding message to read CUZK API returned no match for any of the adresses provided. (so that it is clear it was not a fault of a technical nature, but at data level). This carries more relevant information than my original "impossible" (but why impossible?) wording; thanks!
  • revised the columns returned by geocode() - the columns are renamed (a breaking change, but your suggestion is sound & your confusion with my original logic understandable):
    • address - the address searched (address input)
    • type - type of record matched by API
    • result - address as returned by API / recorded in RÚIAN
    • geometry - hidden column with spatial point data

In other words I return the value provided in address argument to geocode() in address field, and the standardized RUIAN information is in result field (and not address as previously).

As this constitutes a breaking change I would greatly appreciate your comments before merging to master, and eventually CRAN.

Thanks, this looks good to me! I also realised that in fact the tibble returned by {tidygeocoder} also has a row with NAs when the geocoder returns no match - in its case it is NA coordinates.

Up to you of course to consider how comfortable you are introducing breaking changes - I don't want to force that on you!

In any case, thanks!

Thanks for your kind words!
I have realized that I need to introduce breaking changes in other context anyhow - the name of the 'geometry' column in the new RUIAN based datasets is not 'geometry' as should be, but varies - sometimes 'OriginalniHranice', sometimes 'GeneralizovaneHranice' and other times just 'geom'. This somehow crept in between versions 1.5.0 and the current one.
A standardization is in order if one expects to use rbind or bind_rows or the like on the datasets - a typical use case would be combining MOMC & Obce in a single data frame.
And since I will be making one breaking change I might as well do two at once; and users will always have the option of downgrading version if this should inconvenience them too much.
I will merge the tidygeo-flattery branch to DEV master, and proceed to CRAN as time allows.
Thanks again, your advice & input was greatly appreciated!