This repository contains the pytorch implementation of the paper "Contrastive sel-supervised clustering of scRNA-seq data", by Madalina Ciortan under the supervision of Matthieu Defrance (BMC Bioinformatics )
We adapted the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. An artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered with a general clustering algorithm (i.e. KMeans or Leiden community detection). Our method, contrastive-sc, has been compared with another ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski).
- notebooks folder contains all jupyter notebooks to run the project, as detailed below.
- others folder contains the code to reproduce all experiments with scanpy, sczi, scDeepCluster
- R folder contains the scrips to generate the simulated data in folder R/simulated_data (both balanced and imbalanced)
- outoput contains model dumps and the results of running all experiments, needed to reproduce the plots
- docker contains the Dockerfile to create the image used to run all python experiments
- real_data contains the biological scRNA-seq data, downloaded from scDeepCluster, as detailed below
- train.py contains the main functionalities for training and evaluating the model results
- model.py contains the network definition
- st_loss.py contains the implementation of the loss functions
- utils.py contains various utility functions
- Main.ipynb represents the main entry point, contains code snipped to train the model on scRNA-seq data
- Benchmark_real_data, Benchmark_simulated_data contain the code to reproduce all experiments on contrastive-sc
- Plots_simulated_data, Plots_real_scRNAseq contains code to reproduce all figures
- Grid_search* comprise all ablation studies on network architecture, learning rate, data augmentation strategies, gene selection strategy
We have employed a docker container to facilitate reproducing the paper results.
It can be launched by running the following:
cd docker
docker build -t contrastive-sc .
The image has been created for GPU usage. In order to run it on CPU, in the Dockerfile, the line "pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime" should be replaced with a CPU version.
The command above created a docker container tagged as contrastive-sc . Assuming the project has been cloned locally in a parent folder named notebooks, the image can be launched locally with:
docker run -it --runtime=nvidia -v ~/notebooks:/workspace/notebooks -p 8888:8888 contrastive-sc
This starts up a jupyter notebook server, which can be accessed at http://localhost:8888/tree/notebooks
We followed the instructions on this tutorial in order to create an R docker container which comes with most single-cell related libraries already installed. In order to launch it on port 8787, execute the following:
docker run -d -p 8787:8787 -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v ~/notebooks/deep_clustering:/home/rstudio/projects vbarrerab/rstudio_singlecell
The simulated datasets can be downloaded from this Google Drive link (~400MB). Alternatively, it can be generated by running R/all_balanced.r or R/all_imbalanced.R.
The single cell data has been collected from scDeepCluster repository and scziDesk repository. It should be saved to real_data folder.
The implementation used for benchmarking the methods in R used the script made available by scziDesk and can be found in R/run_methods.r. It has been enriched with the computation of silhouette and calinski scores.
The remaining python methods have been made available in others folder.