Species distribution modelling using machine learning (SDM_ML)
This repository contains code to fit and evaluate (multi-) species distribution models in python. It accompanies the manuscript "Multi-Output Gaussian Processes for Species Distribution Modelling", which is currently under review.
Requirements & setup
- Python 3 (preferably using anaconda / miniconda)
- The requirements listed in
requirements.txt
(preferably installed usingconda install --file requirements.txt
) - If you would like to use the BRT model, you will also need to make sure that
your R installation has the packages
dismo
andgbm
installed. - The library
ml_tools
. Please clone the repository from here: https://github.com/martiningram/ml_tools . Then, install by cd ing into the cloned directory and runningpip install -e .
- The library
svgp
. Please clone from here: https://github.com/martiningram/svgp . The install just likeml_tools
by cd-ing into the cloned directory and runningpip install -e .
. - With all these requirements in place, install the
sdm_ml
package by runningpip install -e .
in the base directory.
A Dockerfile is also available in docker/Dockerfile
.
Getting started
All models implemented in this framework share the same API, which consists of
the functions fit
, predict_log_marginal_probabilities
,
predict_marginal_probabilities
, save_model
, and calculate_log_likelihood
.
We start by showing how to fit the simplest model -- a logistic regression.
from sdm_ml.scikit_model import ScikitModel
from sklearn.linear_model import LogisticRegression
from functools import partial
import numpy as np
model = ScikitModel(model_fun=partial(LogisticRegression, penalty='none',
solver='newton-cg'))
# Make up some fake data
X = np.random.randn(5, 4)
y = np.random.randint(0, 2, size=(5, 2))
# We can now fit the model like so:
model.fit(X, y)
# We can predict the training set like so:
y_pred = model.predict_marginal_probabilities(X)
print('The predicted probabilities are:')
print(y_pred)
# And the log likelihood:
train_lik = model.calculate_log_likelihood(X, y)
print(f'The log likelihood at each site is: {train_lik}')
# Finally, we save the model results in case we want to look at them later
model.save_model('./saved_logistic_regression')
Fitting a multi-output GP (MOGP)
Fitting the MOGP is very similar, see below.
from sdm_ml.gp.hierarchical_mogp import HierarchicalMOGP
import numpy as np
# Configuration for the MOGP
n_latent = 2
n_inducing = 10
kernel = 'matern_3/2'
# Build the model
model = HierarchicalMOGP(n_inducing=n_inducing, n_latent=n_latent,
kernel=kernel)
# The rest is the same as the logistic regression example:
n_data = 20
n_cov = 4
n_outputs = 5
np.random.seed(2)
# Make up some fake data
X = np.random.randn(n_data, n_cov)
y = np.random.randint(0, 2, size=(n_data, n_outputs))
# We can now fit the model like so (please allow a few minutes for this,
# especially on CPU):
model.fit(X, y)
# We can predict the training set like so:
y_pred = model.predict_marginal_probabilities(X)
print('The predicted probabilities are:')
print(y_pred)
# And the log likelihood:
train_lik = model.calculate_log_likelihood(X, y)
print(f'The log likelihood at each site is: {train_lik}')
# Finally, we save the model results in case we want to look at them later
model.save_model('./saved_mogp')
Reproducing the paper experiments
Code to run the experiments in the paper (with the exception of HMSC; see below) can be found in the script:
scripts/evaluate_model.py
To run this code, you will have to obtain the datasets used in the paper:
- The breeding bird survey dataset can be fetched using the helpers in this
repository: repo.
Once downloaded, please set the environment variable
BBS_PATH
to point to the foldercsv_bird_data
generated. - The datasets used in Anna Norberg's review paper are available here: zenodo
link. Once downloaded, please
set the environment variable
NORBERG_PATH
to point to theDATA
folder in that dataset containing the individual.csv
files.
Please also set the environment variable SDM_ML_EVAL_PATH
to the directory you
would like to save results to.
In the script, you will find a dictionary called models
which defines which
models are run. You can comment and uncomment them to select which to run.
Please note that trying to run all of them at once may consume too much RAM on
your machine. If you run into trouble, try running them one (or a few) at a
time.
Please note also that the SOGP and particularly the MOGP models benefit strongly from GPU acceleration and will run much more quickly if one is available.
Running HMSC
The only model not implemented in our common evaluation framework is HMSC. Code
to run HMSC can be found in the subfolder ./R/hmsc/
.
To run HMSC, please do the following:
- Run
./R/hmsc/fetch_datasets.py
to obtain the datasets (see the previous section for how to allow sdm_ml to find them). - Take a look at the
./R/hmsc/run.sh
shell script to see how to run HMSC. The parameters for the number of samples etc. are given in our paper. - Once run, convergence can be checked using
./R/hmsc/compute_convergence.R
.