A classification framework to enhance your habitat distribution models

This is the code for the framework of the paper Phytosociology meets artificial intelligence: a deep learning classification framework for biodiversity monitoring of European flora through accurate habitat type prediction based on vegetation-plot records published in Applied Vegetation Science.
This framework aims to facilitate the training and sharing of Habitat Distribution Models (HDMs) using various types of input covariates including cover abundances of plant species and information about plot location.

Table of Contents

Python version 3.7 or higher and CUDA are required.

On many systems Python comes pre-installed. You can try running the following command to check and see if a correct version is already installed:

python --version

If Python is not already installed or if it is installed with version 3.6 or lower, you will need to install a functional version Python on your system by following the official documentation that contains a detailed guide on how to setup Python.

To check whether CUDA is already installed or not on your system, you can try running the following command:

nvcc --version

If it is not, make sure to follow the instructions here.

The framework is optimized for data files from the European Vegetation Archive (EVA). These files contain all the information required for the proper functioning of the framework, i.e., for each vegetation plot the full list of vascular plant species, the estimates of cover abundance of each species, the location and the EUNIS classification. Once the database is downloaded (more information here), make sure you rename species and header data files respectively as eva_species.csv and eva_header.csv. All columns from the files are not needed, but if you decide to remove some of them to save space on your computer, make sure that the values are still tab-separated and that you keep at least:

  • the columns PlotObservationID, Matched concept and Cover % from the species file (vegetation-plot data)
  • the columns PlotObservationID, Cover abundance scale, Date of recording, Expert System, Longitude and Latitude from the header file (plot attributes)

Firstly, hdm-framework can be installed via repository cloning:

git clone https://github.com/cesar-leblanc/hdm-framework.git
cd hdm-framework

Secondly, make sure that the dependencies listed in the environment.yml and requirements.txt files are installed. One way to do so is to use conda:

conda env create -f environment.yml
conda activate hdm-env

Thirdly, to check that the installation went well, use the following command:

python main.py --pipeline 'check' 

If the framework was properly installed, it should output:

No missing files.

No missing dependencies.

Environment is properly configured.

Make sure to place the species and header data files inside the Datasets folder before going further.

To pre-process the data from the European Vegetation Archive and create the input data and the target labels, run the following command:

python main.py --pipeline 'dataset' 

Some changes can be made from this command to create another dataset. Here is an example to only keep vegetation plots from France and Germany who were recorded after 2000 and classified to the level 2 of the EUNIS hierarchy:

python main.py --pipeline 'dataset' --countries 'France, Germany' --min_year 2000 --level 2

To evaluate the parameters of a classifier on the dataset previously obtained using cross validation, run the following command:

python main.py --pipeline 'evaluation' 

Some changes can be made from this command to evaluate other parameters. Here is an example to evaluate a TabNet Classifier using the top-3 macro average multiclass accuracy:

python main.py --pipeline 'evaluation' --model 'tnc' --average 'macro' --k 3 

To train a classifier from the dataset previously obtained and save its weights, run the following command:

python main.py --pipeline 'training' 

Some changes can be made from this command to train another classifier. Here is an example to train a Random Forest Classifier with 50 trees using the cross-entropy loss:

python main.py --pipeline 'training' --model 'rfc' --n_estimators 50 -- criterion 'log_loss'

Before making predictions, make sure you include two new files that describe the vegetation data of your choice in the Datasets folder: test_species.csv and test_header.csv. The two files should contain the following columns (with tab-separated values):

  • PlotObservationID (integer), Matched concept (string) and Cover % (float) for the species data, which respectively describe the plot identifier, the taxon names and the percentage cover
  • PlotObservationID (integer), Longitude (float) and Latitude (float) for the header data, which respectively describe the plot identifier, the plot longitude and the plot latitude

To predict the classes of the new samples using a previously trained classifier, make sure the weights of the desired model are stored in the Models folder and then run the following command:

python main.py --pipeline 'prediction' 

Some changes can be made from this command to predict differently. Here is an example to predict using a XGBoosting Classifier without the external criteria nor the GBIF normalization:

python main.py --pipeline 'prediction' --model 'xgb' --features 'species' --gbif_normalization False

This section lists every major frameworks/libraries used to create the models included in the project:

  • PyTorch - MultiLayer Perceptron classifier (MLP)
  • scikit-learn - Random Forest Classifier (RFC)
  • XGBoost - XGBoost classifier (XGB)
  • pytorch_tabnet - TabNet Classifier (TNC)
  • RTDL - Feature Tokenizer + Transformer classifier (FTT)

This roadmap outlines the planned features and milestones for the project. Please note that the roadmap is subject to change and may be updated as the project progress.

  • Implement multilingual user support
    • English
    • French
  • Integrate new popular algorithms
    • MLP
    • RFC
    • XGB
    • TNC
    • FTT
    • KNN
    • GNB
  • Add more habitat typologies
    • EUNIS
    • NPMS
  • Include other data aggregators
    • EVA
    • TAVA
  • Offer several powerful frameworks
    • PyTorch
    • TensorFlow
    • JAX
  • Allow data parallel training
    • Multithreading
    • Multiprocessing
  • Supply different classification strategies
    • Top-k classification
    • Average-k classification

This framework is distributed under the Unlicense, meaning that it is dedicated to public domain. See UNLICENSE.txt for more information.

If you plan to contribute new features, please first open an issue and discuss the feature with us. See CONTRIBUTING.md for more information.

It is strongly unadvised to:

  • not perform normalization of species names against the GBIF backbone, as it could become a major obstacle in your ecological studies if you seek to combine multiple datasets
  • not include the external criteria when preprocessing the datasets, as it could lead to inconsistencies while training models or making predictions

hdm-framework is a community-driven project with several skillful engineers and researchers contributing to it.
hdm-framework is currently maintained by César Leblanc with major contributions coming from Alexis Joly, Pierre Bonnet, Maximilien Servajean, and the amazing people from the Pl@ntNet Team in various forms and means.

┌── data                               <-Folder containing data-related scripts.
│   ├── __init__.py                    <- Initialization script for the 'data' package.
│   ├── load_data.py                   <- Module for loading data into the project.
│   ├── preprocess_data.py             <- Module for data preprocessing operations.
│   └── save_data.py                   <- Module for saving data or processed data.
├── Data                               <- Folder containing the created data.
├── Datasets                           <- Folder containing various datasets for the project.
│   ├── EVA                            <- Folder containing original EVA datasets.
│   ├── NPMS                           <- Folder containing original NPMS datasets.
│   ├── arborescent_species.npy        <- List of all arborescent species.
│   ├── digital_elevation_model.tif    <- Digital elevation model data in TIFF format.
│   ├── eunis_habitats.xlsx            <- Excel file containing the list of EUNIS habitat.
│   ├── red_list_habitats.xlsx         <- Excel file containing the list of red list habitat data.
│   ├── ecoregions.dbf                 <- Database file for ecoregion data.
│   ├── ecoregions.prj                 <- Projection file for ecoregion shapefile.
│   ├── ecoregions.shp                 <- Shapefile for ecoregion data.
│   ├── ecoregions.shx                 <- Index file for ecoregion shapefile.
│   ├── united_kingdom_regions.dbf     <- Database file for United Kingdom regions data.
│   ├── united_kingdom_regions.prj     <- Projection file for United Kingdom regions shapefile.
│   ├── united_kingdom_regions.shp     <- Shapefile for United Kingdom regions data.
│   ├── united_kingdom_regions.shx     <- Index file for United Kingdom regions shapefile.
│   ├── vegetation.dbf                 <- Database file for vegetation data.
│   ├── vegetation.prj                 <- Projection file for vegetation shapefile.
│   ├── vegetation.shp                 <- Shapefile for vegetation data.
│   ├── vegetation.shx                 <- Index file for vegetation shapefile.
│   ├── world_countries.dbf            <- Database file for world countries data.
│   ├── world_countries.prj            <- Projection file for world countries shapefile.
│   ├── world_countries.shp            <- Shapefile for world countries data.
│   ├── world_countries.shx            <- Index file for world countries shapefile.
│   ├── world_seas.dbf                 <- Database file for world seas data.
│   ├── world_seas.prj                 <- Projection file for world seas shapefile.
│   ├── world_seas.shp                 <- Shapefile for world seas data.
│   └── world_seas.shx                 <- Python script (details needed).
├── Experiments                        <- Folder for experiment-related files.
│   ├── ESy                            <- Folder containing the expert system.
│   ├── cmd_lines.txt                  <- Text file with command line instructions.
│   ├── data_visualization.ipynb       <- Jupyter notebook for data visualization.
│   ├── results_analysis.ipynb         <- Jupyter notebook for results analysis.
│   ├── model_interpretability.py      <- Module for model interpretability.
│   └── test_set.ipynb                 <- Jupyter notebook for creating a test set.
├── Images                             <- Folder for image resources.
│   ├── hdm-framework.pdf              <- Overview of hdm-framework image.
│   ├── logo.png                       <- Project logo image.
│   ├── neuron-based_models.pdf        <- Key aspect of neuron-based models image.
│   ├── transformer-based_models.pdf   <- Key aspect of transformer-based models image.
│   └── tree-based_models.pdf          <- Key aspect of tree-based models image.
├── models                             <- Folder for machine learning models.
│   ├── ftt.py                         <- Module for the FTT model.
│   ├── __init__.py                    <- Initialization script for the 'models' package.
│   ├── mlp.py                         <- Module for the MLP model.
│   ├── rfc.py                         <- Module for the RFC model.
│   ├── tnc.py                         <- Module for the TNC model.
│   └── xgb.py                         <- Module for the XGB model.
├── Models                             <- Folder containing the trained models.
├── pipelines                          <- Folder containing pipeline-related scripts.
│   ├── check.py                       <- Module for checking the configuration.
│   ├── dataset.py                     <- Module for creating the train dataset.
│   ├── evaluation.py                  <- Module for evaluating the models.
│   ├── __init__.py                    <- Initialization script for the 'pipelines' package.
│   ├── prediction.py                  <- Module for making predictions.
│   └── training.py                    <- Module for training the models.
├── .github                            <- Folder for GitHub-related files.
│   ├── ISSUE_TEMPLATE                 <- Folder for issues-related files.
│   │   ├── bug_report.md              <- Template for reporting bugs.
│   │   └── feature_request.md         <- Template for requesting new features.
│   │
│   └── pull_request_template.md       <- Template for creating pull requests.
├── cli.py                             <- Command-line interface script for the project.
├── CODE_OF_CONDUCT.md                 <- Code of conduct document for project contributors.
├── CONTRIBUTING.md                    <- Guidelines for contributing to the project.
├── environment.yml                    <- YAML file specifying project dependencies.
├── __init__.py                        <- Initialization script for the root package.
├── main.py                            <- Main script for running the project.
├── README.md                          <- README file containing project documentation.
├── requirements.txt                   <- Text file listing project requirements.
├── SECURITY.md                        <- Security guidelines and information for the project.
├── UNLICENSE.txt                      <- License information for the project (Unlicense).
└── utils.py                           <- Utility functions for the project.

