Species Distribution Models for Prioritization in Denmark

Derek Corcoran 21/12, 2022

1 Taxonomic cleaning
- 1.1 Cleaning using Taxize
- 1.2 Cleaning using RGBIF
2 Presence download
3 Presence cleaning

The goal of SpeciesDistributionModelsDanmark is to explore and generate Species distribution models for de prioritization of Denmark based on its biological diversity and landuse.

1 Taxonomic cleaning

First we will read a file with all the taxa found in arter.dk, on the 21st of September of 2022.

for this we will need the follwing packages (Wickham and Bryan 2022; Chamberlain et al. 2020; Chamberlain and Boettiger 2017):

Load packages

library(readxl)
library(taxize)
library(rgbif)
library(janitor)
library(dplyr)
library(stringr)

We first read the file with all the presences:

Taxa <- readxl::read_xlsx("2022-09-21.xlsx") |> 
  janitor::clean_names() |> 
  dplyr::select(videnskabeligt_navn)

This file has 60958 entries, however it only has length(unique(Taxa$videnskabeligt_navn)) unique entries in the attribute videnskabeligt_navn

1.1 Cleaning using Taxize

First we generate a new data frame considering only the unique videnskabeligt_navn:

NewTaxa <- data.frame(Taxa = sort(unique(Taxa$videnskabeligt_navn)), score = NA, matched_name2 = NA) |> 
  tibble::rowid_to_column(var = "TaxaID")

and then we clean it using taxize first

Taxize clean

dir.create("Results")

for(i in 1:nrow(NewTaxa)){
  try({
    Temp <- taxize::gnr_resolve(NewTaxa$Taxa[i],
                                         data_source_ids = "11", canonical = TRUE, best_match_only = T) |> 
      dplyr::select(score, matched_name2)
    NewTaxa[i,3:4] <- Temp
      if((i %% 50) == 0){
      message(paste(i, "of", nrow(NewTaxa), "Ready!", Sys.time()))
      readr::write_csv(NewTaxa, "Results/Cleaned_Taxa_Taxize.csv")
    }
    gc()
  }, silent = T)
  
}

This cleaning ends up eliminating 1313 taxa which are mostly Families, subfamilies or hybrid species, as seen in table 1.1

TaxaID	Taxa	score
51	Abraeinae	NA
52	Abraeini	NA
89	Acaenitinae	NA
154	Acanthocinini	NA
168	Acanthoderini	NA
210	Acari	NA
381	Achelata	NA
406	Achillea ptarmica × salicifolia	NA
452	Aciculata	NA
463	Aciliini	NA

Table 1.1: First 10 taxa eliminated by taxize

Of the reminding species that were identidied by taxize there are still some unique species in out initial file that ended up being identified as duplicate species some examples can be seen in table 1.2

TaxaID	Taxa	score	matched_name2
1829	Alisma plantago-aquatica	0.988	Alisma plantago-aquatica
1830	Alisma plantago-aquatica f. submersa	0.988	Alisma plantago-aquatica
2451	Ammophila arenaria	0.988	Ammophila arenaria
2453	Ammophila arenaria × Calamagrostis epigejos nm. epigeioidea	0.988	Ammophila arenaria
2454	Ammophila arenaria × Calamagrostis epigejos nm. intermedia	0.988	Ammophila arenaria
2455	Ammophila arenaria × Calamagrostis epigejos nm. subarenaria	0.988	Ammophila arenaria
3047	Anemone apennina	0.988	Anemone apennina
3048	Anemone apennina var. apennina	0.988	Anemone apennina
3607	Anthyllis vulneraria subsp. vulneraria	0.999	Anthyllis vulneraria vulneraria
3611	Anthyllis vulneraria var. vulneraria	0.999	Anthyllis vulneraria vulneraria
4963	Arrhenia acerosa	0.988	Arrhenia acerosa
4964	Arrhenia acerosa var. acerosa	0.988	Arrhenia acerosa

Table 1.2: First 12 duplicate species

All and all, we started with 60,915 unique taxa and ended up with 59,197 unique taxa

1.2 Cleaning using RGBIF

In order to do this cleanly we will just get one observation of each taxa found by Taxize in its column matched_name2

Unique taxize names

Cleaned_Taxize <- NewTaxa |> 
  dplyr::filter(!is.na(matched_name2)) |> 
  dplyr::group_by(matched_name2) |> 
  dplyr::filter(TaxaID == min(TaxaID)) |> 
  ungroup()

and then we will pass this through rgbif, change the input name (vertbatim_name), to matched_name2, so that it is the same as in cleaned_Taxize, then we kept the matched, name, the confidence on the finding for RGBIF, and all the taxonomic groups

rgbif call

rgbif_find <- rgbif::name_backbone_checklist(Cleaned_Taxize$matched_name2) |>
  # Change name to match the cleaned_taxize dataset
  dplyr::rename(matched_name2 = verbatim_name) |> 
  dplyr::relocate(matched_name2, .before = everything()) |> 
  dplyr::select(matched_name2, confidence, kingdom, phylum, order, family, genus, species)

readr::write_csv(rgbif_find, "Results/Cleaned_Taxa_rgbif.csv")

Since we are only interested in taxa that is at least resolved to the species level, we filter out groups that have not resolved to that level:

Species only

Species_Only <- rgbif_find |> 
  dplyr::filter(!is.na(species))

which eliminates 17,278, rows of our data set, leaving us with 41,934 of data. However, we still have to filter synonyms out, and subspecies out. In in table 1.3, we can see the first 10 records in Species_Only, that lead to duplicated species names, here we find both synonyms, but also subsepecies. So by the end, we end up with 40328 unique species

matched_name2	confidence	kingdom	phylum	order	species
Abies concolor	98	Plantae	Tracheophyta	Pinales	Abies concolor
Abies lowiana	98	Plantae	Tracheophyta	Pinales	Abies concolor
Abies nordmanniana	99	Plantae	Tracheophyta	Pinales	Abies nordmanniana
Abies nordmanniana equi-trojani	97	Plantae	Tracheophyta	Pinales	Abies nordmanniana
Abies nordmanniana nordmanniana	98	Plantae	Tracheophyta	Pinales	Abies nordmanniana
Abrothallus bertianus	97	Fungi	Ascomycota	Abrothallales	Abrothallus parmeliarum
Abrothallus parmeliarum	97	Fungi	Ascomycota	Abrothallales	Abrothallus parmeliarum
Acalitus stenaspis	99	Animalia	Arthropoda	Trombidiformes	Acalitus stenaspis
Acanthis cabaret	98	Animalia	Chordata	Passeriformes	Acanthis flammea
Acanthis flammea	99	Animalia	Chordata	Passeriformes	Acanthis flammea

Table 1.3: First 10 duplicate species

Finally we make a data.frame with the final species list

Final species list

FinalSpeciesList <- Species_Only |> 
  group_by(species) |> 
  dplyr::filter(confidence == max(confidence))
readr::write_csv(FinalSpeciesList, "Results/FinalSpeciesList.csv")

2 Presence download

3 Presence cleaning

Chamberlain, Scott, and Carl Boettiger. 2017. “R Python, and Ruby Clients for GBIF Species Occurrence Data.” PeerJ PrePrints. https://doi.org/10.7287/peerj.preprints.3304v1.

Chamberlain, Scott, Eduard Szoecs, Zachary Foster, Zebulun Arendsee, Carl Boettiger, Karthik Ram, Ignasi Bartomeus, et al. 2020. Taxize: Taxonomic Information from Around the Web. https://github.com/ropensci/taxize.