Fetch North American Breeding bird data with mistnet train & test split
The code in this repository uses the scripts provided by David Harris's mistnet model to process the BBS data into a training and test dataset, as well as cross validation folds. The mistnet paper is available here.
Please note: The dataset fetched here is proprietary. Please make sure to read the terms of use here.
The code downloads the BBS data, uses David Harris's (modified) scripts to split
them into training and test sets, and saves the results as csvs into the
subfolder csv_bird_data
.
Please note that the scripts have been updated to use the latest release of the BBS dataset. This meant I had to remove some checks. I will run further checks on the data in the coming months and make updates if required, but use at your own risk for now!
Requirements
To run, the code requires:
- python (tested under python 2.7.14)
- R (tested under 3.4.3) with packages
geosphere
,raster
,caret
andlubridate
- The UNIX command line tool
wget
How to run
Make sure (!) to clone this repository with its submodules by using:
git clone --recurse-submodules CLONE_URL
Once cloned, you should be able to simply run:
python prepare_dataset.py
Note that this can take a while, since it has to download a lot of files and
process the results (probably around 30 minutes in total, or so). If everything
goes to plan, you should find a folder called csv_bird_data
with the following
contents:
├── fold.ids.csv
├── in.test.csv
├── in.train.csv
├── latlon.csv
├── route.presence.absence.csv
├── species.data.csv
└── x.csv