This repository contains instructions on how to handle the data, from its raw input until obtaining the results reported in the final paper.
Due to the fact that some files in this project are too big to fit in the repository, we had to split them into several parts so everything could be uploaded.
To use the files, one have to join them again. To do this, you simply clone
or download
the repository, go to a folder, for example the data
one, and use this command in your terminal:
$ cat data.zip.part_* > data.zip
The following subsections described how the final data was obtained in order to proceed to the analysis. We filtered data from the following groups:
- Molluscs
- Arachnids
- Crustaceans
- Insects
- Fishes
- Amphibians
- Reptiles
- Birds
- Mammals
- Algae
- Pteridophytes
- Angiosperms
We obtained the raw data for the species of each group considering these databases:
- SpeciesLink
- VertNet
- Bison
- GBIF
Not all groups have all four databases.
Therefore, in the following steps, when the word group
appears we are refering to any of the groups mentioned before. The same goes for the word database
.
Open one of the /group/raw_data/group_database.{xlsx or csv or dat}
database files and copy the columns scientificname
, eventdate
, decimallongitude
, and decimallatitude
into the respective files specie.dat
, dates.dat
, lon.dat
, and lat.dat
.
For example, for the Birds
group we select the database file /birds/raw_data/SpeciesLink.xslx
, there will be 48 columns, which we select just the ones mentioned above.
For some groups, there are files for each species. The procedure is the same as the one described above. You just have to change the database
with the species
name. For example, for the Algae group, the database file /algae/raw_data/Hedera helix.xlsx
will have 42 columns, which we select just the ones mentioned above. These files were converted to .dat
files and added to the group/sketch
folder.
This procedure can be easily done using the pandas
module from python
. However, at the time, bash
commands were used.
Once you have the file specie.dat
, dates.dat
, lon.dat
, and lat.dat
you have to use the /group/raw_data/concatenate.sh
script to put together all the columns of the current database.
After creating all the database files with four columns (specie, eventdate, longitude, latitude), use the date_corrector.sh
to correct the dates, i.e., put them all in the same format YYYY-MM-DD
.
Create a file species.dat
containing all the species of interest. It is very importante that the last line of this file is blank.
Create a folder group/sketch
.
Run create_all_files.sh
to create all the species files. This script will run find_specie.sh
, which will create a file group/sketch/specie
.
All instructions needed to create the files are within the the file python/creating_final_data.ipynb
. Several filter were applied to the data. One can find the following folders within group/data_ok
:
- 500_plus: files of species with more than 500 occurrences;
- 1000_plus: files of species with more than 1000 occurrences;
- 1900: files of species with occurrences after 1900;
- all: files of species with all their occurrences;
- all_species_together_origin_all: a unique file pooling native occurrences from all species from the group;
- before_1900: files of species with occurrences before 1900;
- origin: files of species with more than 500 or 1000 native occurrences;
- origin_all: files of species with all their native occurrences;
- origin_before_1900: files of species with native occurrences before 1900.
The notebook python/creating_final_data.ipynb
is also used to filter the occurrences into native and non-native. This procedure is done for each species. The files within the folder world_boundaries
are used to asses if the occurrences are in or out the native range of each specie. The filtered data can be found within the folders python/filtered_regions/origin_data
and python/filtered_regions/non_origin_data
. Within these folder we have subdivisions as mentioned before. However, we have other folders within python/filtered/regions
:
- origin_continents: files for each group with their species and native continents;
- raw_data: the raw data from before, already joined from the databases.
Within the notebook python/data_analysis.ipynb
one can find the code to perform the fit considering the number of occurrences within native and non-native regions.