The objective of the cloud-mask library is to process and run ML algorithms at the pixel level. It is based on the only publicly available dataset of manually labelled Sentinel-2 scenes for cloud masks. It gathers 97 tiles with more than 5 million manually labelled pixels, curated by Hollstein et al. The database is provided by EnMAP and available on the gitlab repository Database File of Manually classified Sentinel-2A Data.

A report summarizing the findings and results is available here.

Installation

To install the package, you must first clone the git repository to the desired folder

git clone git@github.com:j-desloires/s2-pixel-detection.git

Then, open Anaconda prompt and create the environment from environment.yml

cd cloud-mask
conda env create -f environment.yml
conda activate cloud-mask
pip install .

If you want to run the jupyter notebooks, you must set the environment name:

ipython kernel install --user --name=cloud-mask

Dependencies

This package has a dependency with

eo-flow python module developped by Sinergise. You must clone and install this library on your environment.
adapt python module developed by Michelin. The method CNN has the class attributes self.encoder, self.task and self.discriminator to be compatible with the library.

Usage

# Load and prepare experiments
from cloudmask.data_load import load_file, experiments

# Perform the experiments
## Load the prepared train and test dataset
from cloudmask.data_load.utils_exp import load_test_data, load_train_data, subset_data
## Hyperparameters tuning given a standard model with .fit and .predict methods
from cloudmask.base_models.base_ml_tuning import get_predictions, get_dictionary_metrics
## LGBM with focal loss for multiclass classification (one-vs-rest)
from cloudmask.ml_task.ml_models import OneVsRestLightGBMWithCustomizedLoss, FocalLoss
## CNN1D applied on the spectrum dimension
from cloudmask.ml_task import convnet_models

You can find out how to call the different modules to run the experiments from python scripts in the folder /examples/scripts. It is assumed that you have already downloaded the data from this link into the examples/data folder on the repo. The scripts are organized to run all the experiments as follows:

Data loading
1. Load the data and group pixels per tile. All the pixels will be saved into a np.array of (N, 13) dimensions.
2. Prepare the experiments for ML task to split the data into a training, validation and test sets given a geographical scale (tile, country or continent).
Explore the data to compute descriptive statistics here.
Run standard ml algorithms (e.g. RandomForest) with random_search hyperparameters with leave-one-group-out cross-validation here.
Run LightGBM with Focal loss to take the imbalanced multiclass problem here.
Run CNN1D models. You can play with the hyperparameters given the dictionary here.
Load results for ml algorithms. The metrics were already saved during the training phase here. We focus here on LightGBM with focal loss.
Load results for deep learning algorithms. We consider the model that gave the best metric on the validation sets for a given test set (must be improved) here.

Then you can test the scripts by changing the root_path to where the repo is saved locally.