/contrastive-sc

Primary LanguageJupyter Notebook

Constrative-sc

This repository contains the pytorch implementation of the paper "Contrastive sel-supervised clustering of scRNA-seq data", by Madalina Ciortan under the supervision of Matthieu Defrance (BMC Bioinformatics )

We adapted the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. An artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered with a general clustering algorithm (i.e. KMeans or Leiden community detection). Our method, contrastive-sc, has been compared with another ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski).

Overview of the repository

  • notebooks folder contains all jupyter notebooks to run the project, as detailed below.
  • others folder contains the code to reproduce all experiments with scanpy, sczi, scDeepCluster
  • R folder contains the scrips to generate the simulated data in folder R/simulated_data (both balanced and imbalanced)
  • outoput contains model dumps and the results of running all experiments, needed to reproduce the plots
  • docker contains the Dockerfile to create the image used to run all python experiments
  • real_data contains the biological scRNA-seq data, downloaded from scDeepCluster, as detailed below
  • train.py contains the main functionalities for training and evaluating the model results
  • model.py contains the network definition
  • st_loss.py contains the implementation of the loss functions
  • utils.py contains various utility functions

Overview of notebooks

  • Main.ipynb represents the main entry point, contains code snipped to train the model on scRNA-seq data
  • Benchmark_real_data, Benchmark_simulated_data contain the code to reproduce all experiments on contrastive-sc
  • Plots_simulated_data, Plots_real_scRNAseq contains code to reproduce all figures
  • Grid_search* comprise all ablation studies on network architecture, learning rate, data augmentation strategies, gene selection strategy

Environment Setup

We have employed a docker container to facilitate reproducing the paper results.

Python environment

It can be launched by running the following:

cd docker  
docker build -t contrastive-sc .

The image has been created for GPU usage. In order to run it on CPU, in the Dockerfile, the line "pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime" should be replaced with a CPU version.

The command above created a docker container tagged as contrastive-sc . Assuming the project has been cloned locally in a parent folder named notebooks, the image can be launched locally with:

docker run -it --runtime=nvidia -v ~/notebooks:/workspace/notebooks -p 8888:8888 contrastive-sc

This starts up a jupyter notebook server, which can be accessed at http://localhost:8888/tree/notebooks

R environment

We followed the instructions on this tutorial in order to create an R docker container which comes with most single-cell related libraries already installed. In order to launch it on port 8787, execute the following:

docker run -d -p 8787:8787 -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v ~/notebooks/deep_clustering:/home/rstudio/projects vbarrerab/rstudio_singlecell

Data

The simulated datasets can be downloaded from this Google Drive link (~400MB). Alternatively, it can be generated by running R/all_balanced.r or R/all_imbalanced.R.

The single cell data has been collected from scDeepCluster repository and scziDesk repository. It should be saved to real_data folder.

Reproducing the competing methods' results

The implementation used for benchmarking the methods in R used the script made available by scziDesk and can be found in R/run_methods.r. It has been enriched with the computation of silhouette and calinski scores.

The remaining python methods have been made available in others folder.