Constrative-sc

This repository contains the pytorch implementation of the paper "Contrastive sel-supervised clustering of scRNA-seq data", by Madalina Ciortan under the supervision of Matthieu Defrance (BMC Bioinformatics )

We adapted the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. An artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered with a general clustering algorithm (i.e. KMeans or Leiden community detection). Our method, contrastive-sc, has been compared with another ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski).

Overview of the repository

notebooks folder contains all jupyter notebooks to run the project, as detailed below.
others folder contains the code to reproduce all experiments with scanpy, sczi, scDeepCluster
R folder contains the scrips to generate the simulated data in folder R/simulated_data (both balanced and imbalanced)
outoput contains model dumps and the results of running all experiments, needed to reproduce the plots
docker contains the Dockerfile to create the image used to run all python experiments
real_data contains the biological scRNA-seq data, downloaded from scDeepCluster, as detailed below
train.py contains the main functionalities for training and evaluating the model results
model.py contains the network definition
st_loss.py contains the implementation of the loss functions
utils.py contains various utility functions

Overview of notebooks

Main.ipynb represents the main entry point, contains code snipped to train the model on scRNA-seq data
Benchmark_real_data, Benchmark_simulated_data contain the code to reproduce all experiments on contrastive-sc
Plots_simulated_data, Plots_real_scRNAseq contains code to reproduce all figures
Grid_search* comprise all ablation studies on network architecture, learning rate, data augmentation strategies, gene selection strategy

Environment Setup

We have employed a docker container to facilitate reproducing the paper results.

Python environment

It can be launched by running the following:

cd docker  
docker build -t contrastive-sc .

The image has been created for GPU usage. In order to run it on CPU, in the Dockerfile, the line "pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime" should be replaced with a CPU version.

The command above created a docker container tagged as contrastive-sc . Assuming the project has been cloned locally in a parent folder named notebooks, the image can be launched locally with:

docker run -it --runtime=nvidia -v ~/notebooks:/workspace/notebooks -p 8888:8888 contrastive-sc

This starts up a jupyter notebook server, which can be accessed at http://localhost:8888/tree/notebooks

R environment

We followed the instructions on this tutorial in order to create an R docker container which comes with most single-cell related libraries already installed. In order to launch it on port 8787, execute the following:

docker run -d -p 8787:8787 -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v ~/notebooks/deep_clustering:/home/rstudio/projects vbarrerab/rstudio_singlecell

Data

The simulated datasets can be downloaded from this Google Drive link (~400MB). Alternatively, it can be generated by running R/all_balanced.r or R/all_imbalanced.R.

The single cell data has been collected from scDeepCluster repository and scziDesk repository. It should be saved to real_data folder.

Reproducing the competing methods' results

The implementation used for benchmarking the methods in R used the script made available by scziDesk and can be found in R/run_methods.r. It has been enriched with the computation of silhouette and calinski scores.

The remaining python methods have been made available in others folder.

hahahacccc/contrastive-sc