Project Canopy - DRC slash-and-burn detection model documentation

This repository contains the code used to train a new model detecting the presence of slash-and-burn agriculture in Sentinel-2 satellite images of the Congo Basin. This code covers every step in the process of training the model, from downloading the satellite imagery to post-processing.

The model was based on the one developed in CB Feature Detection. The previous model was trained specifically for logging roads (ISL), while this model is for slash-and-burn (SAB). The notebooks and files pertaining to the previous model can be found in the "old_notebooks" or "old_files" folders within each directory.

The final model files can be found on S3 here: s3://canopy-production-ml/inference/model_files/ (currently in glacier)

best_SAB_model.h5 and best_SAB_weights.h5 are the SAB model
model-best.h5 and model_weights_best.h5 are the ISL model

This ReadMe contains general information on each directory in this repository, in rough order of when the code contained within those directories should be run.

Please contact David Nagy (davidanagy@gmail.com) or Misha Lepetic (misha@projectcanopy.org) with any questions.

sample-code

Description: Sample code intended as an introduction to downloading and loading the data and using it to train a model

Requirements:

Access to the Project Canopy S3 storage

Assets:

DRC_labels_SAB_train/val/test_v1.csv: Copies of the CSV file used to train and test the final SAB model
DRC_labels_SAB_train/val_sample.csv: A 10% sample of the train and test CSV files; use this if you don't have storage/computing power to use the full dataset

Notebooks:

sample_notebook - A notebook with basic code used to download training data, load it, and use it to train a basic model

Scripts:

dataloader.py - Code for loading training chips
sample_model.py - An extremely bare-bones model, intended purely for introductory purposes

s2cloudless

Description: Downloading cloudfree imagery from Google Earth Engine using the S2Cloudless algorithm

Requirements:

Registered a Google Earth Engine account
Linked that account to a Google Cloud Services account

Assets:

polygons_101320.csv: A CSV file containing information on the "Misha polygons"; polygons that Misha chose as containing forest disturbances for training
DRC_squares_3.geojson: A GeoJSON file containing a polygon that's "gridded out" into 10km x 10km squares. Gridding is required because Google Earth Engine has a size limit for individual downloads.
reuse_training_data/labels_and_boundaries_sab.csv: A CSV file containing labels and coordinates for the training data used in our pervious model

Notebooks:

s2cloudless_polygon_export - Example code of using the s2cloudless algorithm to download a satellite image from within a certani area of interest
s2cloudless_DRC_export - Uses the s2cloudless algorithm to download cloudfree satellite images in 2021 and 2019 within our area of interest (mainly the forests in the Democratic Republic of the Congo).
reuse_training_data/Filter_old_polygons - Re-downloads the training data used in our previous model with the s2cloudless algorithm, keeping the original labels. This code was largely written by Wendy Mak.

Scripts:

reuse_training_data/s2cloudless_pipeline.py - A pipeline for downloading images in Google Earth Engine using the s2cloudless algorithm. Mostly written by Wendy Mak.

Suggested future directions:

Download images from 2017 as well. When I tried to do this, the search was unable to find any satellite images from that year, and I lacked the time to find out why.

data-prep

Description: Additional data preparation before training the model, mainly creating label files and adding a "Normalized Burn Ratio" band

Requirements:

Access to the Project Canopy AWS account, specifically filenames in S3

Assets:

All CSV files in this folder are label files for training, validation, and test data. The "SAB" files are the one used to train and evaluate the final model.
The files in deprecated_label_files were made in order to train a model on multiple deforestation drivers (an effort that was eventually abandoned).

Notebooks:

making_label_files - Makes a "base" (raw, unbalanced) label file from the training data stored in S3
rebalance_csvs - Rebalances the label CSV to eliminate the huge class imbalance in the raw data
add_nbr - Per Lloyd Hughes's suggestion, adds a Normalized Burn Ratio band to geotiff files

Scripts:

none

Suggested future directions:

Maybe try out additional "extra bands" to see if they improve the model?

sagemaker-staging

Description: Code used both to train the model and run inference on the full dataset in SageMaker

Requirements:

Access to AWS S3 (for training data) and SageMaker (for running the training)

Assets:

Training chips can be found here: s3://canopy-production-ml/chips/model2_s2cloudless/training_v2/null/ (currently in Glacier)
Full dataset can be found here: s3://canopy-production-ml/full_drc/ (currently in Glacier)
Pre-trained models can be found here: s3://canopy-production-ml/pretrained_models/ (currently in Glacier)
resnet50.pth: A ResNet backbone pre-trained on Sentinel-2 data; can also be found here
weights_resnet.onnx: The above model converted from PyTorch to Onnx by Lloyd Hughes
sentinel_resnet_tf: The above model converrted from Onnx to Tensorflow using onnx2keras
Current best versions of the model can be found here: s3://canopy-production-ml/inference/model_files/ (currently in Glacier)
model-best.h5 and model_weights_best.h5 are the ISL (logging roads) model
best_SAB_model.h5 and best_SAB_weights.h5 are the SAB (slash-and-burn) model

Notebooks:

training/training_notebook - Used to train the model in SageMaker using the scripts in docker_test_folder
inference/inference_pipeline - Used to run inference on the full dataset in SageMaker using the best model and best weights

Scripts:

training/docker_test_folder/Dockerfile - Dockerfile used when training the model on SageMaker
training/docker_test_folder/training.py - The version of the model code I was working on until the deadline
training/docker_test_folder/training_used_for_current_model.py - The version of the model code used to train the "best SAB model," using the "sentinel_resnet" pretrained model found on line 500

Suggested future directions:

Adjust the "resnet_sentinel" code in training.py (starting on line 581) so it provides good results (unfreeze more layers?)
Try out adding a Dropout() layer and retrain the model (suggestion by Daniel Firebanks-Quevado)
Depending on the results, try out playing around with the threshold if needed, or Monte Carlo Dropout (suggestion by Daniel Firebanks-Quevado)

model-development

Description: Notebooks relevant to both developing model code and testing trained models for accuracy, recall, etc.

Assets:

DRC_labels_SAB_train/val/test.csv: Labels for training, eval, and test data
SAB_labels.json: Necessary json file for testing trained SAB models

Notebooks:

Canopy_RGB_Train - A model trained purely on RGB bands. This notebook was written by Shailesh Sridhar and Midhush
Canopy_Additional_Bands_model - A model using the ResNet50 architecture that first separates out the non-RGB bands, then adds them back in later. We were unable to figure out how to get this code to run on SageMaker. This notebook was written by Shailesh Sridhar and Midhush
evaluation_master - Code used to evaluate trained models

Scripts:

test_generator.py - Builds a generator for testing trained models. Used with evaluation_master.ipynb

Suggested future directions:

Integrate the code found in Canopy_Additional_Bands_model.ipynb into the model code found in sagemaker-staging/training/docker_test_folder
Find both ISL and SAB models that improve on our current metrics: ~80% accuracy and recall for ISL (old model); 69% accuracy and 64% recall for SAB (new model)

inference

Description: Local code run relevant to making predictions on the full DRC dataset

Assets:

raw_predictions.zip: The results of the inference code found in sagemaker-staging/inference/inference_pipeline.ipynb
predictions/ISL_2019_preds.geojson, etc: Model predictions in the correct geojson format

Notebooks:

json_to_geojson - Code used to translate the raw predictions into the correct geojson format; results in the files in the predictions folder

Scripts:

cortex_drc - My attempts to use Cortex to run inference on the full dataset. I couldn't get it to work but I kept the code here for your possible interest.

Suggested future directions:

Figure out Cortex

analytics

Description: Post-inference improvement on predictions

Assets:

none

Notebooks:

remove_orphans - Code used to remove "orphan predictions"; i.e., single or double predictions with no other predictions nearby

Scripts:

none

Suggested future directions:

Integrate the Open Street Map filtering found in old_notebooks/osm_filter

display

Description: Code used to display results online (no new code compared to the previous model so everything is in old_files)

Project-Canopy/SSRC_New_Model_Development

Project Canopy - DRC slash-and-burn detection model documentation

sample-code

s2cloudless

data-prep

sagemaker-staging

model-development

inference

analytics

display