/MarcoPolo

MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

Primary LanguagePythonOtherNOASSERTION

MarcoPolo

MarcoPolo is a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering

Stars PyPI Downloads

Overview

MarcoPolo is a novel clustering-independent approach to identifying DEGs in scRNA-seq data. MarcoPolo identifies informative DEGs without depending on prior clustering, and therefore is robust to uncertainties from clustering or cell type assignment. Since DEGs are identified independent of clustering, one can utilize them to detect subtypes of a cell population that are not detected by the standard clustering, or one can utilize them to augment HVG methods to improve clustering. An advantage of our method is that it automatically learns which cells are expressed and which are not by fitting the bimodal distribution. Additionally, our framework provides analysis results in the form of an HTML file so that researchers can conveniently visualize and interpret the results.

Datasets URL
Human liver cells (MacParland et al.) https://chanwkimlab.github.io/MarcoPolo/HumanLiver/
Human embryonic stem cells (The Koh et al.) https://chanwkimlab.github.io/MarcoPolo/hESC/
Peripheral blood mononuclear cells (Zheng et al.) https://chanwkimlab.github.io/MarcoPolo/Zhengmix8eq/

Preparing dataset

MarcoPolo works jointly with AnnData, a flexible and efficient data format for scRNA-seq data widely used in python community. This enables MarcoPolo to seamlessly work with other popular single cell software packages such as scanpy, or more broadly, other packages included in the scverse project, etc as they also work based on AnnData.

You should prepare your scRNA-seq data in AnnData object before running MarcoPolo. Please refer to the AnnData's Getting started page for more information about AnnData. If your data is in seurat object, you can very easily convert it to AnnData following the instructions here.

As MarcoPolo runs on raw count data, anndata should contain the raw count data in .X. The structure of Anndata is described here.

Running MarcoPolo with Google Colab

You can easily try MarcoPolo with Google Colab: Open In Colab

Google colab is a free cloud environment for running Python code. Colab allows you to execute MarcoPolo in your browser without any configurations and GPU resources.

Running MarcoPolo with your local machine

How to install MarcoPolo

We recommend using the following pipeline to install MarcoPolo.

  1. Anaconda

Please refer to https://docs.anaconda.com/anaconda/install/linux/ to install Anaconda. Then, please make a new conda environment and activate it.

conda create -n MarcoPolo python=3.8
conda activate MarcoPolo
  1. Pytorch

Please install PyTorch from https://pytorch.org/ (If you want to install CUDA-supported PyTorch, please install CUDA in advance)

  1. MarcoPolo

You can simply install MarcoPolo by using the pip command:

pip install marcopolo-pytorch

If MarcoPolo installed on your machine is outdated, you can get an updated version of MarcoPolo by using the pip command:

pip install marcopolo-pytorch --upgrade

How to run MarcoPolo

Please refer to this notebook for the usage of MarcoPolo.

Citation

If you use any part of this code or our data, please cite our paper.

@article{kim2022marcopolo,
  title={MarcoPolo: a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering},
  author={Kim, Chanwoo and Lee, Hanbin and Jeong, Juhee and Jung, Keehoon and Han, Buhm},
  journal={Nucleic Acids Research},
  year={2022}
}

Contact

If you have any inquiries, please feel free to contact

  • Chanwoo Kim (Paul G. Allen School of Computer Science & Engineering @ the University of Washington)