/taxastand

Standardize taxonomy across different data sources

Primary LanguageROtherNOASSERTION

taxastand

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. DOI

The goal of taxastand is to standardize species names from different sources, a common task in biology.

Very often different biologists use different synonyms to refer to the same species. If we want to join data from different sources, their taxonomic names must be standardized first. This is what taxastand seeks to do in a reproducible and efficient manner.

Important note

This package is in early development. There may be major, breaking changes to functionality in the near future. If you use this package, I highly recommend using a package manager like renv so that later updates won’t break your code.

Taxonomic standard

taxastand is based on matching names to a single taxonomic standard, that is, a database of accepted names and synonyms. As long as a single taxonomic standard is used, we can confidently resolve names from disparate sources.

The taxonomic standard must conform to Darwin Core standards. The user must provide this database (as a dataframe). There are many sources of taxonomic data online, including GBIF, Catalog of Life, and ITIS to name a few. The taxadb package provides convenient functions for downloading various taxonomic databases that use Darwin Core.

Installation

taxastand can be installed from r-universe or github.

install.packages("taxastand", repos = 'https://joelnitta.r-universe.dev')

OR

# install.packages("remotes")
remotes::install_github("joelnitta/taxastand")

Dependencies

taxastand depends on taxon-tools for taxonomic name matching.

There are two options for using this dependency.

  • Install docker and set docker = TRUE when using taxastand functions.

OR

  • Install the two programs included in taxon-tools, parsenames and matchnames.

Similar work

  • ROpenSci has a task view summarizing many tools available for taxonomy.

  • taxize is the “granddaddy” of taxonomy packages in R. It can search around 20 different taxonomic databases for names and retrieve taxonomic information.

  • TNRS, the Taxonomic Name Resolution Service, is a web application that resolves taxonomic names of plants according to one of six databases.

  • taxizedb downloads taxonomic databases and provides tools to interface with them through SQL.

  • taxadb also downloads and searches taxonomic databases. It can interface with them either through SQL or in-memory in R.

  • taxonstand has a very similar goal to taxastand, but only uses The Plant List (TPL) as its taxonomic standard and does not allow the user to provide their own. Note that TPL is no longer being updated as of 2013.

Motivation

Although existing web-based solutions for taxonomic name resolution are very useful, they may not be ideal for all situations: the choice of reference database to use for standardization is limited, they may not be able to handle very large queries, and the user has no guarantee that the same input will yield the same output at a later date due to changes in the remote database.

Furthermore, matching of taxonomic names is not straightforward, since they are complex data structures including multiple components (e.g., genus, specific epithet, basionym author, combination author, etc). Of the tools mentioned above only TNRS can fuzzily match taxonomic names based on their parsed components, but it does not allow for use of a local reference database.

The motivation for taxastand is to provide greater flexibility and reproducibility by allowing for complete version control of the code and database used for name resolution, while implementing fuzzy matching of parsed taxonomic names.

Example

Here is an example of fuzzy matching followed by resolution of synonyms using the dataset included with the package.

library(taxastand)

# Load example reference taxonomy in Darwin Core format
data(filmy_taxonomy)

# Take a look at the columns used by taxastand
head(filmy_taxonomy[c(
  "taxonID", "acceptedNameUsageID", "taxonomicStatus", "scientificName")])

# As a test, resolve a misspelled name
ts_resolve_names("Gonocormus minutum", filmy_taxonomy)

# We can now use the `resolved_name` column of this result for downstream
# analyses joining on other datasets that have been resolved to the same
# reference taxonomy.
#>    taxonID acceptedNameUsageID taxonomicStatus
#> 1 54115096                  NA   accepted name
#> 2 54133783            54115097         synonym
#> 3 54115097                  NA   accepted name
#> 4 54133784            54115098         synonym
#> 5 54115098                  NA   accepted name
#> 6 54133785            54115099         synonym
#>                              scientificName
#> 1             Cephalomanes atrovirens Presl
#> 2                Trichomanes crassum Copel.
#> 3 Cephalomanes crassum (Copel.) M. G. Price
#> 4           Trichomanes densinervium Copel.
#> 5 Cephalomanes densinervium (Copel.) Copel.
#> 6         Trichomanes infundibulare Alderw.
#>                query                        resolved_name
#> 1 Gonocormus minutum Crepidomanes minutum (Bl.) K. Iwats.
#>                     matched_name resolved_status matched_status match_type
#> 1 Gonocormus minutus (Bl.) Bosch   accepted name        synonym auto_fuzzy

Citing this package

If you use this package, please cite it! Here is an example:

Nitta, JH (2021) taxastand: Taxonomic name standardization in R. https://doi.org/10.5281/zenodo.5726390

The example DOI above is for the overall package.

Here is the latest DOI, which you should use if you are using the latest version of the package:

DOI

You can find DOIs for older versions by viewing the “Releases” menu on the right.

You should also cite the software that taxastand relies on, taxon-tools: https://github.com/camwebb/taxon-tools