README for species distribution modeling

Overview

Code and sample data for running species distribution models from data harvested from iNaturalist. Future implementations would also use data from the eButterfly project.

Dependancies

Seven additional R packages are required:

rgdal
raster
- See https://stackoverflow.com/questions/4649156/installing-gdal-config-on-my-linux for troubleshooting suggestions.
sp
dismo
maptools
gtools
SSDM

Structure

data
- inaturalist: data harvested from iNaturalist
  - 50931-iNaturalist.txt: Gray Hairstreak, Strymon melinus
  - 509627-iNaturalist.txt: Western Giant Swallowtail, Papilio rumiko
  - 59125-iNaturalist.txt: Great Copper, Lycaena xanthoides
- wc2-5: climate data at 2.5 minute resolution from WorldClim
- gbif: data harvested from GBIF for iNaturalist taxon_id values; most files not under version control (> 2GB each);
  - taxon-ids.txt: tab-delimited text files of unique species-level taxon_id values for records from Canada, Mexico, and United States; incluedes two columns: taxonID and scientificName
output (not included in repository, but this structure is assumed on local)
- images
- rasters
scripts
- ensemble-sdm-iNat-xanthoides.R: Development script for ensemble SDMs
- gbif-butterflies.sh: First pass of processing GBIF data dump to get taxon_ids for iNaturalist data; see also get-taxon-id-from-gbif.py
- get-observation-data.R: Harvest data from iNaturalist using their API; called from command line terminal
  - Usage: Rscript --vanilla scripts/get-observation-data.R <taxon_id>
  - Example: Rscript --vanilla scripts/get-observation-data.R 60606
  - Output: Comma-separated file (csv) of observations from iNaturalist
    - Filename: data/inaturalist/<taxon_id>-iNaturalist.csv
    - Example: data/inaturalist/60606-iNaturalist.csv
- get-taxon-id-from-gbif.py: Extract relevant taxon_id values from GBIF data dump; see also gbif-butterflies.sh. Produces data/gbif/taxon-ids.txt
- run-sdm.R: Run species distribution model and create map and raster output; called from command line terminal
  - Usage: Rscript --vanilla scripts/run-sdm.R <path/to/data/file> <output-file-prefix> <path/to/output/directory/> <number of background replicates>[optional] <threshold for occurrance>[optional]
  - Example: Rscript --vanilla scripts/run-sdm.R data/inaturalist/60606-iNaturalist.csv 60606 output/ 50 0.5
  - Output: Three files:
    - A png-formatted graphics file showing predicted distribution
      - Filename: <path/to/output/directory/>/<output-file-prefix>-prediction-<number of background replicates>.png
      - Example: output/60606-prediction-50.png
    - A pair of raster files (.grd) with:
      - The predicted distribution (presence/absence)
        
        Filename: <path/to/output/directory/>/<output-file-prefix>-prediction-threshold-<number of background replicates>.grd
        
        Example: output/60606-prediction-threshold-50.grd
      - The probability of occurrence (scaled between 0 and 1)
        
        Filename: <path/to/output/directory/>/<output-file-prefix>-prediction-<number of background replicates>.grd
        
        Example: output/60606-prediction-50.grd
- run-sdm-algo.R: Run species distribution model, choosing among three algorithms (CTA, RF, or GLM) and create map and raster output; called from command line terminal
  - Usage: Rscript --vanilla scripts/run-sdm-algo.R <path/to/data/file> <output-file-prefix> <path/to/output/directory/> <algorithm string: CTA, GLM, or RF>[optional] <number of background replicates>[optional] <threshold for occurrance>[optional]
  - Example: Rscript --vanilla scripts/run-sdm-algo.R data/inaturalist/60606-iNaturalist.csv 60606 output/ CTA 10 0.7
  - Output: Three files:
    - A png-formatted graphics file showing predicted distribution
      - Filename: <path/to/output/directory/>/<output-file-prefix>-prediction-<algorithm string>-<number of background replicates>.png
      - Example: output/60606-prediction-CTA-50.png
    - A pair of raster files (.grd) with:
      - The predicted distribution (presence/absence)
        
        Filename: <path/to/output/directory/>/<output-file-prefix>-prediction-threshold-<algorithm string>-<number of background replicates>.grd
        
        Example: output/60606-prediction-threshold-CTA-50.grd
      - The probability of occurrence (scaled between 0 and 1)
        
        Filename: <path/to/output/directory/>/<output-file-prefix>-prediction-<algorithm string>-<number of background replicates>.grd
        
        Example: output/60606-prediction-CTA-50.grd
- sdm-for-ACIC-lecture.R: Script to create map graphic used in ACIC lecture
- sdm-iNat-melinus.R: Pilot species distribution modeling for Strymon melinus
- sdm-iNat-xanthoides.R: Pilot species distribution modeling for Lycaena xanthoides
- stack-sdms.R: Stack multiple SDMs from multiple species into species richness map
  - Usage: Usage: Rscript --vanilla scripts/stack-sdms.R <path/to/raster/files> <output-file-prefix> <path/to/output/directory/>
  - Example: Usage: Rscript --vanilla scripts/stack-sdms.R output richness output/
  - Output: Two files:
    - A png-formatted graphics file showing species richness (i.e. # species) on map of North America
      - Filename: <path/to/output/directory/>/<output-file-prefix>-stack.png
      - Example: output/richness-stack.png
    - A raster file (.grd) with species richness
      - Filename: <path/to/output/directory/>/<output-file-prefix>-stack.grd
      - Example: output/richness-stack.grd

General initial approach:

Retrieve historical climate data http://www.worldclim.org
Get a list of all species in databases (eButterfly & iNaturalist)
Get lat/long data for one species from databases
Extract data for one month
Perform quality check (minimum # observations, appropriate latitude & longitude format)
Run SDM
Create graphic with standardized name for use on eButterfly

Repeat steps 4-7 for remaining months
Repeat steps 3-7 for remaining species

Species Identifiers

Challenge: To perform analyses on all North American species of butterflies, we will need the taxon_id for all species we are interested in. There is not an easy way to do this using the iNaturalist API (see the discussion in the Resources section below). However, we can download an iNaturalist database dump from GBIF at https://www.google.com/url?q=https%3A%2F%2Fwww.gbif.org%2Fdataset%2F50c9509d-22c7-4a22-a47d-8c48425ef4a7&sa=D&sntz=1&usg=AFQjCNEzY1KC-xcJO1vgk6fhrSW-1_FoCA. The flat csv file does not contain enough information; namely it lack the taxon_id field. However, the Darwin Core archive does include files with the necessary information. The files occurrence.txt and verbatim.txt have the fields we need; the latter is a smaller file, so we'll use that one (the column headers appear identical in both files, but some curation was performed to produce the occurrence.txt file). Among other fields, the ones we will be interested in are:

countryCode: We want US, CA, and MX records only
taxonID: This field has values to use in the taxon_id field in the iNaturalist API
scientificName: The name of the organism

Update: the file data/gbif/taxon-ids.txt has the taxonID and scientificName field values. However, the data are for observations of species rank; that is, subspecies taxon IDs were not recorded. Will need to see if the API will return observations for a species-level taxon ID if an identification has been made at the subspecies level.

jcoliver/ebutterfly-sdm

README for species distribution modeling

Overview

Dependancies

Structure

General initial approach:

Species Identifiers

Resources

Species distribution models in R

Tests of spatial overlap

iNaturalist