The sWorm team (https://www.idiv.de/en/sworm), led by myself with PIs Nico Eisenhauer (https://www.idiv.de/en/groups_and_people/employees/details/7.html) and Erin Cameron (https://www.smu.ca/research/profiles/faculty/Cameron70.html), compiled a global database of earthworm communities. This database was used in a previous manuscript (Phillips et al., 2019 Science - https://science.sciencemag.org/content/366/6464/480).
In total the database contains over 10,840 sites in 60 different countries, 22,690 non-zero observations and 184 species. All data has been compiled from 182 published articles and 17 unpublished datasets.
The complete dataset will be hosted on the iDiv Data Portal (https://idata.idiv.de/), but will also be available on Edaphobase during 2021.
A manuscript outlining the database will be published in Scientific Data.
This repository contains the code needed to collate the individual datasets, process the data, clean the data and then put the data into three tables (metadata table, site-level table and species occurrence table). It also contains the code used to create the figures used in the associated manuscript.
None of this code needs to be run in order to use the data that has been deposited in the iDiv Data Portal. However, it may be useful to see how the data was cleaned.
The majority of the scripts will not be able to be run (due to not having access to the original data files).
This script downloads each individual GoogleSheet (from my personal GoogleDrive) and saves as a Excel file.
Following the data being saved as Excel files, this script puts all those files together into (three) csv files. The data is formatted during this script, to ensure that columns contain the information they are meant to contain, as well as site-level metrics being calculated (i.e., species richness, total abundance and total biomass).
This creates a list of all species binomials that are present in the data, as well as the country they were found in, and the name of the data collector. The 'earthworm experts' then revised the species names.
The new species names are appended to the data. In addition, the data is cleaned - removing unnecessary columns and relabelling and sorting into new orders etc.
Data that was not able to be made open-access was removed at this point.
The files saved at this point were those that were made available.
A script used to just process the data to find out certain characteristics of it - for example, the number of sites, the number of countries etc. These figures were used in the manuscript.
A script to produce all figures used in the manuscript.
A script to process which data providers needed to be asked to be co-authors. This information was then used to email them.
A script to process which data provider has not responded to our co-authorship offer, so we could re-email them.
A script to process which data providers had not responded to our co-authorship offer, so we could determine which data needed to be removed (i.e., raw data that had been given to use where we could no longer establish confirmation that open-access was ok. As it happened, there was no data where this applied.)
A script to amalgamate all the co-author details (names and addresses) in order to do the author and institute list on the manuscript. Remy also helped with this script (Thanks Remy! - who is not on Github :( )
Co-author Remy wrote the code for Figure 4 in the manuscript (Thanks Remy! - who is not on Github :( ). In order for him to do this, I sent him a minimum working example so that he could see what I wanted.
Co-author Remy then sent me back a completely new script that did what I wanted for Figure 4. I just needed to add in the actual numbers.
To ensure that all columns were in the format needed, as well as putting categorical data in an appropriate order.
Simple function that just returns the most recently made file (as all files were created with dates in the name)
A function to re-order a dataframe into a specific order. Not my code, link is at the bottom showing where I got the code from (Thanks Stackoverflow!!)
The sWorm team want people to use this data :) All we ask is that you cite the Phillips et al. manuscript in Scientific Data, and maybe also Phillips et al., 2019 Science. And let us know when you publish your work (we would love to keep track of the impact of this data).
In order to ensure we are not working on the same scientific questions, you can also contact any of the main people involved (myself, Nico and Erin) and let us know about your project.
If you need anymore insights into the data, then it would be best to contact me (helen[dot]r[dot]phillips at googlemail.com).