Species Distribution Models for Prioritization in Denmark

Derek Corcoran 21/12, 2022

The goal of SpeciesDistributionModelsDanmark is to explore and generate Species distribution models for de prioritization of Denmark based on its biological diversity and landuse.

1 Taxonomic cleaning

First we will read a file with all the taxa found in arter.dk, on the 21st of September of 2022.

for this we will need the follwing packages (Wickham and Bryan 2022; Chamberlain et al. 2020; Chamberlain and Boettiger 2017):

Load packages

library(readxl)
library(taxize)
library(rgbif)
library(janitor)
library(dplyr)
library(stringr)

We first read the file with all the presences:

Taxa <- readxl::read_xlsx("2022-09-21.xlsx") |> 
  janitor::clean_names() |> 
  dplyr::select(videnskabeligt_navn)

This file has 60958 entries, however it only has length(unique(Taxa$videnskabeligt_navn)) unique entries in the attribute videnskabeligt_navn

1.1 Cleaning using Taxize

First we generate a new data frame considering only the unique videnskabeligt_navn:

NewTaxa <- data.frame(Taxa = sort(unique(Taxa$videnskabeligt_navn)), score = NA, matched_name2 = NA) |> 
  tibble::rowid_to_column(var = "TaxaID")

and then we clean it using taxize first

Taxize clean

dir.create("Results")

for(i in 1:nrow(NewTaxa)){
  try({
    Temp <- taxize::gnr_resolve(NewTaxa$Taxa[i],
                                         data_source_ids = "11", canonical = TRUE, best_match_only = T) |> 
      dplyr::select(score, matched_name2)
    NewTaxa[i,3:4] <- Temp
      if((i %% 50) == 0){
      message(paste(i, "of", nrow(NewTaxa), "Ready!", Sys.time()))
      readr::write_csv(NewTaxa, "Results/Cleaned_Taxa_Taxize.csv")
    }
    gc()
  }, silent = T)
  
}

This cleaning ends up eliminating 1313 taxa which are mostly Families, subfamilies or hybrid species, as seen in table 1.1

TaxaID Taxa score
51 Abraeinae NA
52 Abraeini NA
89 Acaenitinae NA
154 Acanthocinini NA
168 Acanthoderini NA
210 Acari NA
381 Achelata NA
406 Achillea ptarmica × salicifolia NA
452 Aciculata NA
463 Aciliini NA

Table 1.1: First 10 taxa eliminated by taxize

Of the reminding species that were identidied by taxize there are still some unique species in out initial file that ended up being identified as duplicate species some examples can be seen in table 1.2

TaxaID Taxa score matched_name2
1829 Alisma plantago-aquatica 0.988 Alisma plantago-aquatica
1830 Alisma plantago-aquatica f. submersa 0.988 Alisma plantago-aquatica
2451 Ammophila arenaria 0.988 Ammophila arenaria
2453 Ammophila arenaria × Calamagrostis epigejos nm. epigeioidea 0.988 Ammophila arenaria
2454 Ammophila arenaria × Calamagrostis epigejos nm. intermedia 0.988 Ammophila arenaria
2455 Ammophila arenaria × Calamagrostis epigejos nm. subarenaria 0.988 Ammophila arenaria
3047 Anemone apennina 0.988 Anemone apennina
3048 Anemone apennina var. apennina 0.988 Anemone apennina
3607 Anthyllis vulneraria subsp. vulneraria 0.999 Anthyllis vulneraria vulneraria
3611 Anthyllis vulneraria var. vulneraria 0.999 Anthyllis vulneraria vulneraria
4963 Arrhenia acerosa 0.988 Arrhenia acerosa
4964 Arrhenia acerosa var. acerosa 0.988 Arrhenia acerosa

Table 1.2: First 12 duplicate species

All and all, we started with 60,915 unique taxa and ended up with 59,197 unique taxa

1.2 Cleaning using RGBIF

In order to do this cleanly we will just get one observation of each taxa found by Taxize in its column matched_name2

Unique taxize names

Cleaned_Taxize <- NewTaxa |> 
  dplyr::filter(!is.na(matched_name2)) |> 
  dplyr::group_by(matched_name2) |> 
  dplyr::filter(TaxaID == min(TaxaID)) |> 
  ungroup()

and then we will pass this through rgbif, change the input name (vertbatim_name), to matched_name2, so that it is the same as in cleaned_Taxize, then we kept the matched, name, the confidence on the finding for RGBIF, and all the taxonomic groups

rgbif call

rgbif_find <- rgbif::name_backbone_checklist(Cleaned_Taxize$matched_name2) |>
  # Change name to match the cleaned_taxize dataset
  dplyr::rename(matched_name2 = verbatim_name) |> 
  dplyr::relocate(matched_name2, .before = everything()) |> 
  dplyr::select(matched_name2, confidence, kingdom, phylum, order, family, genus, species)

readr::write_csv(rgbif_find, "Results/Cleaned_Taxa_rgbif.csv")

Since we are only interested in taxa that is at least resolved to the species level, we filter out groups that have not resolved to that level:

Species only

Species_Only <- rgbif_find |> 
  dplyr::filter(!is.na(species))

which eliminates 17,278, rows of our data set, leaving us with 41,934 of data. However, we still have to filter synonyms out, and subspecies out. In in table 1.3, we can see the first 10 records in Species_Only, that lead to duplicated species names, here we find both synonyms, but also subsepecies. So by the end, we end up with 40328 unique species

matched_name2 confidence kingdom phylum order species
Abies concolor 98 Plantae Tracheophyta Pinales Abies concolor
Abies lowiana 98 Plantae Tracheophyta Pinales Abies concolor
Abies nordmanniana 99 Plantae Tracheophyta Pinales Abies nordmanniana
Abies nordmanniana equi-trojani 97 Plantae Tracheophyta Pinales Abies nordmanniana
Abies nordmanniana nordmanniana 98 Plantae Tracheophyta Pinales Abies nordmanniana
Abrothallus bertianus 97 Fungi Ascomycota Abrothallales Abrothallus parmeliarum
Abrothallus parmeliarum 97 Fungi Ascomycota Abrothallales Abrothallus parmeliarum
Acalitus stenaspis 99 Animalia Arthropoda Trombidiformes Acalitus stenaspis
Acanthis cabaret 98 Animalia Chordata Passeriformes Acanthis flammea
Acanthis flammea 99 Animalia Chordata Passeriformes Acanthis flammea

Table 1.3: First 10 duplicate species

Finally we make a data.frame with the final species list

Final species list

FinalSpeciesList <- Species_Only |> 
  group_by(species) |> 
  dplyr::filter(confidence == max(confidence))
readr::write_csv(FinalSpeciesList, "Results/FinalSpeciesList.csv")

2 Presence download

3 Presence cleaning

Chamberlain, Scott, and Carl Boettiger. 2017. “R Python, and Ruby Clients for GBIF Species Occurrence Data.” PeerJ PrePrints. https://doi.org/10.7287/peerj.preprints.3304v1.

Chamberlain, Scott, Eduard Szoecs, Zachary Foster, Zebulun Arendsee, Carl Boettiger, Karthik Ram, Ignasi Bartomeus, et al. 2020. Taxize: Taxonomic Information from Around the Web. https://github.com/ropensci/taxize.

Wickham, Hadley, and Jennifer Bryan. 2022. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.