ropensci/rinat

get_inat_obs() returns unrelated observations with taxon= some invalid synonyms

Closed this issue · 9 comments

with rinat_0.1.4.99 get_inat_obs() using taxon= sometimes returns completely unrelated records. So far it appears that this happens with taxonomic names that are correct (in ITIS) but currently invalid. Note that submitting an accepted name can result in simpleError Your search returned zero results.
Vespertilio linereus
http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=946985

mimimum reproducible example (although different values for maxresults give me different unrelated results):

library(rinat)
oops1 <- get_inat_obs(taxon="Vespertilio linereus", maxresults=10, quality='research')
str(oops1)

library(rinat)
oops1 <- get_inat_obs(taxon="Vespertilio linereus", maxresults=10, quality='research')
str(oops1)
'data.frame': 10 obs. of 33 variables:
$ scientific_name : chr "Myiopsitta monachus" "Columbina inca" "Junonia coenia" "Columbina inca" ...
$ datetime : chr "2015-10-02 15:32:27 -0500" "2015-10-01 16:32:41 -0500" "2010-08-31 00:00:00 -0500" "2015-10-01 16:32:33 -0500" ...
$ description : chr "" "" "" "" ...
$ place_guess : chr "" "Paso de Ovejas, Veracruz, México" "Bedford Avenue, Raleigh NC" "Paso de Ovejas, Veracruz, México" ...
$ latitude : num 41.8 19.3 35.8 19.3 36 ...
$ longitude : num -87.7 -96.4 -78.7 -96.4 -75.6 ...
$ tag_list : chr "" "" "" "" ...
$ common_name : chr "Monk Parakeet" "Inca Dove" "Common Buckeye" "Inca Dove" ...
$ url : chr "http://www.inaturalist.org/observations/2034475" "http://www.inaturalist.org/observations/2034451" "http://www.inaturalist.org/observations/2034445" "http://www.inaturalist.org/observations/2034439" ...
$ image_url : chr "http://static.inaturalist.org/photos/2466260/medium.?1443818029" "http://static.inaturalist.org/photos/2466219/medium.JPG?1443817153" "http://static.inaturalist.org/photos/2466210/medium.JPG?1443816918" "http://static.inaturalist.org/photos/2466202/medium.JPG?1443816858" ...
$ user_login : chr "elfaulkner" "aureliomolinahdz" "coatlicue" "aureliomolinahdz" ...
$ id : int 2034475 2034451 2034445 2034439 2034432 2034436 2034401 2034395 2034381 2034342
$ species_guess : chr "Monk parakeet" "Tórtola cola larga" "Common Buckeye" "Tórtola cola larga" ...
$ iconic_taxon_name : chr "Aves" "Aves" "Insecta" "Aves" ...
$ taxon_id : int 19349 3544 48505 3544 48505 424575 366731 67731 67731 4956
$ id_please : chr "false" "true" "false" "true" ...
$ num_identification_agreements : int 1 1 1 1 1 1 1 1 1 1
$ num_identification_disagreements: int 0 0 0 0 0 0 0 0 0 0
$ observed_on_string : chr "2015-10-02 3:32:27 PM CDT" "2015-10-01 16:32:41" "2010-08-31" "2015-10-01 16:32:33" ...
$ observed_on : chr "2015-10-02" "2015-10-01" "2010-08-31" "2015-10-01" ...
$ time_observed_at : chr "2015-10-03 09:32:27 +1300" "2015-10-02 10:32:41 +1300" "" "2015-10-02 10:32:33 +1300" ...
$ time_zone : chr "Central Time (US & Canada)" "Central Time (US & Canada)" "Eastern Time (US & Canada)" "Central Time (US & Canada)" ...
$ positional_accuracy : int 8 NA 805 NA 192 NA 30 23 NA 52
$ geoprivacy : chr "" "" "" "" ...
$ positioning_method : chr "gps" "" "" "" ...
$ positioning_device : chr "gps" "" "" "" ...
$ out_of_range : chr "false" "false" "" "false" ...
$ user_id : int 19145 43120 62388 43120 62388 135242 56928 118621 62971 62971
$ created_at : chr "2015-10-03 09:33:33 +1300" "2015-10-03 09:19:07 +1300" "2015-10-03 09:14:32 +1300" "2015-10-03 09:14:12 +1300" ...
$ updated_at : chr "2015-10-03 09:39:15 +1300" "2015-10-03 09:30:46 +1300" "2015-10-03 09:30:26 +1300" "2015-10-03 09:24:20 +1300" ...
$ quality_grade : chr "research" "research" "research" "research" ...
$ license : chr "CC-BY-NC" "CC-BY-NC" "CC-BY-NC" "CC-BY-NC" ...
$ oauth_application_id : int NA NA NA NA NA NA NA NA NA NA

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] rinat_0.1.4.99

loaded via a namespace (and not attached):
[1] Rcpp_0.12.1 digest_0.6.8 MASS_7.3-43 R6_2.1.1 grid_3.2.2 plyr_1.8.3
[7] jsonlite_0.9.17 gtable_0.1.2 magrittr_1.5 scales_0.3.0 httr_1.0.0 ggplot2_1.0.1
[13] stringi_0.5-5 curl_0.9.3 reshape2_1.4.1 fortunes_1.5-2 proto_0.3-10 tools_3.2.2
[19] stringr_1.0.0 munsell_0.4.2 maps_2.3-11 colorspace_1.2-6

My use case is expanding my taxon names to all synonyms and children (subspecies), then submitting to rinat::get_inat_obs() as well as rgbif::get_occ() and rbison::bison(). Given that gbif, bison, and iNaturalist sometimes disagree on accepted v invalid taxa, and don't quite return all observations listed for subspecies when queried at the species level, I need to do this and discard duplicae records to get complete observation holdings.

@emhart did you have time to look at this?

I will take a look when I'm off vacation in a couple days.

On Mon, Oct 5, 2015 at 1:48 PM Scott Chamberlain notifications@github.com
wrote:

@emhart https://github.com/emhart did you have time to look at this?


Reply to this email directly or view it on GitHub
#13 (comment).

here's page for that taxon on inat http://www.inaturalist.org/taxa/198042-Lasiurus-cinereus-cinereus

using it's taxon id we get http://inaturalist.org/observations.json?&quality_grade=research&taxon_id=198042 @philippi is that the specific taxon you want?

Looks like the taxon_id param from the API is not in the get_inat_obs() function , okay if we add it @emhart

@philippi note that the docs do say for taxon_name and taxon_id params that

Note that this will also select observations of descendant taxa

which I think is what is happening, is no results are found for Vespertilio linereus, but your getting other taxa in addition

perhaps correct on the getting other taxa, but note that the other taxa I'm getting in addition are not synonyms or downstream:

table(oops1$iconic_taxon_name)
Arachnida Aves Fungi Insecta Mollusca Reptilia
1 2 1 4 1 1
oops1$scientific_name
[1] "Hericium erinaceus" "Phoebis sennae" "Helix aspersa"
[4] "Perithemis tenera" "Hentzia palmarum" "Poecilanthrax lucifer"
[7] "Limenitis archippus" "Colaptes auratus" "Cyanocitta cristata"
[10] "Sceloporus grammicus"

What I want are all observations in iNaturalist that might reasonably be informative about the
distribution of what ITIS considers Lasiurus cinereus (TSN 180017 status valid). To me that includes observations under names of subspecies and under synonyms. I keep both the core name (L cinereus) and the expanded name, so if necessary I can identify which observations used which name.

What I hope would happen for "Vespertillo linereus" is that either no records are returned, or a message that name isn't accepted in iNaturalist. I don't like getting observations of "random" or arbitrary taxa.

@sckott Hmmm, I took a look and I agree we should add the taxon_id and taxon_name, I'll push a fix in a few. @philippi the reason you get back all those crazy results is because when you search with the taxon name field and it doesn't find any results the API returns every record as in this example. This is an interesting edge case though where a name change seems to throw this error.

As far as what you'd like to see, I don't think adding taxon_id will fix it. Part of the issue is the API design, for instance this does give you the behaviour you're looking for:
oops1 <- get_inat_obs(q="Vespertillo linereus", maxresults=10, quality='research')

Using the query parameter vs taxon yields different behaviours from the API if there are no results. The former returns a "no results" whereas the latter, for unknown reasons returns every record in the database.

As far as getting:

all observations in iNaturalist that might reasonably be informative about the
distribution of what ITIS considers Lasiurus cinereus (TSN 180017 status valid). To me that includes observations under names of subspecies and under synonyms.

I think your best workflow would be to use taxize to clean your species names or convert them to ID's and then search on the ID's. That functionality isn't built into the inaturalist API.

@sckott @philippi taxon is now gone: it has been split into taxon_id and taxon_name so the following both work.

taxon_name <- get_inat_obs(taxon_name="Lasiurus cinereus", maxresults=10, quality='research')
taxon_id <- get_inat_obs(taxon_id=198042, maxresults=10, quality='research')

Now with commit 21f5db when all results are returned, it throws an error that there are no results:

oops1 <- get_inat_obs(taxon_name="Vespertillo lieus", maxresults=10, quality='research',meta=T)

Error in get_inat_obs(taxon_name = "Vespertillo lieus", maxresults = 10, :
Your search returned zero results. Either your species of interest has no records or you entered an invalid search

Do these updates help @philippi ?

Yes, that is great. Thank you!

Either throwing an error or returning no records are both expected
responses; getting back real observations but of wrong taxa is unexpected
and would trip up at least some users (who haven't struggled with older
sources of observations to realize how wonderful rinat, rgbif, etc., really
are!, but then also don't know to check the returned observations for being
responsive). Because NPS now has an official agreement with iNaturalist
for all NPS BioBlitz observations (and, likely, all other species
observations we collect in parks) to go directly into iNaturalist, we're
likely to get lots of folks using rinat in the next year or so.

I agree completely about using taxize first, as you mentioned earlier.
For bison I use taxize to expand to all accepted names in ITIS and query by
accepted TSN.
For gbif I do the same with gbif backbone (not quite CoL).
iNaturalist does not use any single name reference: it lets area curators
overrule CoL/gbif or ITIS.

This is a major problem for spocc: a name that is accepted & valid for one
occurrence API is not valid & accepted for another provider. If I could
write R code in your style & idiom, I'd submit code adding an optional
parameter to the individual package get_obs functions, wheich when true
would take the names & convert them to all possible valid/accepted tags.
Maybe in 6 months or so.

Thanks again for all of your work!

Tom 2

On Sat, Oct 17, 2015 at 10:19 PM, Edmund Hart notifications@github.com
wrote:

Now with commit 21f5db
21f5db2
when all results are returned, it throws an error that there are no results:

oops1 <- get_inat_obs(taxon_name="Vespertillo lieus", maxresults=10, quality='research',meta=T)

Error in get_inat_obs(taxon_name = "Vespertillo lieus", maxresults = 10, :
Your search returned zero results. Either your species of interest has no
records or you entered an invalid search

Do these updates help @philippi https://github.com/philippi ?


Reply to this email directly or view it on GitHub
#13 (comment).

@philippi

This is a major problem for spocc: a name that is accepted & valid for one occurrence API is not valid & accepted for another provider.

Sorry about this issue. It's sort of a fundamental problem with trying to interact with different data sources that all (or at least most) have their own set of accepted names. One safe way to go is just use GBIF, as a lot of things flow in to GBIF in the end. However, of course that's not always possible.

Names are a hard problem 😭

I'd submit code adding an optional parameter to the individual package get_obs functions

I assume you do know you can pass additional params to each source?