Derek Corcoran 21/12, 2022
The goal of SpeciesDistributionModelsDanmark is to explore and generate Species distribution models for de prioritization of Denmark based on its biological diversity and landuse.
First we will read a file with all the taxa found in arter.dk, on the 21st of September of 2022.
for this we will need the follwing packages (Wickham and Bryan 2022; Chamberlain et al. 2020; Chamberlain and Boettiger 2017):
Load packages
library(readxl)
library(taxize)
library(rgbif)
library(janitor)
library(dplyr)
library(stringr)
We first read the file with all the presences:
Taxa <- readxl::read_xlsx("2022-09-21.xlsx") |>
janitor::clean_names() |>
dplyr::select(videnskabeligt_navn)
This file has 60958 entries, however it only has
length(unique(Taxa$videnskabeligt_navn))
unique entries in the
attribute videnskabeligt_navn
First we generate a new data frame considering only the unique videnskabeligt_navn:
NewTaxa <- data.frame(Taxa = sort(unique(Taxa$videnskabeligt_navn)), score = NA, matched_name2 = NA) |>
tibble::rowid_to_column(var = "TaxaID")
and then we clean it using taxize first
Taxize clean
dir.create("Results")
for(i in 1:nrow(NewTaxa)){
try({
Temp <- taxize::gnr_resolve(NewTaxa$Taxa[i],
data_source_ids = "11", canonical = TRUE, best_match_only = T) |>
dplyr::select(score, matched_name2)
NewTaxa[i,3:4] <- Temp
if((i %% 50) == 0){
message(paste(i, "of", nrow(NewTaxa), "Ready!", Sys.time()))
readr::write_csv(NewTaxa, "Results/Cleaned_Taxa_Taxize.csv")
}
gc()
}, silent = T)
}
This cleaning ends up eliminating 1313 taxa which are mostly Families, subfamilies or hybrid species, as seen in table 1.1
TaxaID | Taxa | score |
---|---|---|
51 | Abraeinae | NA |
52 | Abraeini | NA |
89 | Acaenitinae | NA |
154 | Acanthocinini | NA |
168 | Acanthoderini | NA |
210 | Acari | NA |
381 | Achelata | NA |
406 | Achillea ptarmica × salicifolia | NA |
452 | Aciculata | NA |
463 | Aciliini | NA |
Table 1.1: First 10 taxa eliminated by taxize
Of the reminding species that were identidied by taxize there are still some unique species in out initial file that ended up being identified as duplicate species some examples can be seen in table 1.2
TaxaID | Taxa | score | matched_name2 |
---|---|---|---|
1829 | Alisma plantago-aquatica | 0.988 | Alisma plantago-aquatica |
1830 | Alisma plantago-aquatica f. submersa | 0.988 | Alisma plantago-aquatica |
2451 | Ammophila arenaria | 0.988 | Ammophila arenaria |
2453 | Ammophila arenaria × Calamagrostis epigejos nm. epigeioidea | 0.988 | Ammophila arenaria |
2454 | Ammophila arenaria × Calamagrostis epigejos nm. intermedia | 0.988 | Ammophila arenaria |
2455 | Ammophila arenaria × Calamagrostis epigejos nm. subarenaria | 0.988 | Ammophila arenaria |
3047 | Anemone apennina | 0.988 | Anemone apennina |
3048 | Anemone apennina var. apennina | 0.988 | Anemone apennina |
3607 | Anthyllis vulneraria subsp. vulneraria | 0.999 | Anthyllis vulneraria vulneraria |
3611 | Anthyllis vulneraria var. vulneraria | 0.999 | Anthyllis vulneraria vulneraria |
4963 | Arrhenia acerosa | 0.988 | Arrhenia acerosa |
4964 | Arrhenia acerosa var. acerosa | 0.988 | Arrhenia acerosa |
Table 1.2: First 12 duplicate species
All and all, we started with 60,915 unique taxa and ended up with 59,197 unique taxa
In order to do this cleanly we will just get one observation of each taxa found by Taxize in its column matched_name2
Unique taxize names
Cleaned_Taxize <- NewTaxa |>
dplyr::filter(!is.na(matched_name2)) |>
dplyr::group_by(matched_name2) |>
dplyr::filter(TaxaID == min(TaxaID)) |>
ungroup()
and then we will pass this through rgbif, change the input name (vertbatim_name), to matched_name2, so that it is the same as in cleaned_Taxize, then we kept the matched, name, the confidence on the finding for RGBIF, and all the taxonomic groups
rgbif call
rgbif_find <- rgbif::name_backbone_checklist(Cleaned_Taxize$matched_name2) |>
# Change name to match the cleaned_taxize dataset
dplyr::rename(matched_name2 = verbatim_name) |>
dplyr::relocate(matched_name2, .before = everything()) |>
dplyr::select(matched_name2, confidence, kingdom, phylum, order, family, genus, species)
readr::write_csv(rgbif_find, "Results/Cleaned_Taxa_rgbif.csv")
Since we are only interested in taxa that is at least resolved to the species level, we filter out groups that have not resolved to that level:
Species only
Species_Only <- rgbif_find |>
dplyr::filter(!is.na(species))
which eliminates 17,278, rows of our data set, leaving us with 41,934 of data. However, we still have to filter synonyms out, and subspecies out. In in table 1.3, we can see the first 10 records in Species_Only, that lead to duplicated species names, here we find both synonyms, but also subsepecies. So by the end, we end up with 40328 unique species
matched_name2 | confidence | kingdom | phylum | order | species |
---|---|---|---|---|---|
Abies concolor | 98 | Plantae | Tracheophyta | Pinales | Abies concolor |
Abies lowiana | 98 | Plantae | Tracheophyta | Pinales | Abies concolor |
Abies nordmanniana | 99 | Plantae | Tracheophyta | Pinales | Abies nordmanniana |
Abies nordmanniana equi-trojani | 97 | Plantae | Tracheophyta | Pinales | Abies nordmanniana |
Abies nordmanniana nordmanniana | 98 | Plantae | Tracheophyta | Pinales | Abies nordmanniana |
Abrothallus bertianus | 97 | Fungi | Ascomycota | Abrothallales | Abrothallus parmeliarum |
Abrothallus parmeliarum | 97 | Fungi | Ascomycota | Abrothallales | Abrothallus parmeliarum |
Acalitus stenaspis | 99 | Animalia | Arthropoda | Trombidiformes | Acalitus stenaspis |
Acanthis cabaret | 98 | Animalia | Chordata | Passeriformes | Acanthis flammea |
Acanthis flammea | 99 | Animalia | Chordata | Passeriformes | Acanthis flammea |
Table 1.3: First 10 duplicate species
Finally we make a data.frame with the final species list
Final species list
FinalSpeciesList <- Species_Only |>
group_by(species) |>
dplyr::filter(confidence == max(confidence))
readr::write_csv(FinalSpeciesList, "Results/FinalSpeciesList.csv")
Chamberlain, Scott, and Carl Boettiger. 2017. “R Python, and Ruby Clients for GBIF Species Occurrence Data.” PeerJ PrePrints. https://doi.org/10.7287/peerj.preprints.3304v1.
Chamberlain, Scott, Eduard Szoecs, Zachary Foster, Zebulun Arendsee, Carl Boettiger, Karthik Ram, Ignasi Bartomeus, et al. 2020. Taxize: Taxonomic Information from Around the Web. https://github.com/ropensci/taxize.
Wickham, Hadley, and Jennifer Bryan. 2022. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.