/SCORE

Primary LanguageJupyter NotebookMIT LicenseMIT

SCORE logo

SCORE: Single-cell Chromatin Organization Representation and Embedding

A Python package developed by the Jin Lab for combining, benchmarking, and extending methods of embedding, clustering, and visualizing single-cell Hi-C data.

Installation

git clone https://github.com/JinLabBioinfo/SCORE.git;
cd SCORE;
pip install .

Installation should only take a few minutes.

(Optional) Tensorflow and PyTorch GPU support

Some methods such as Va3DE rely on GPU accelerated Tensorflow builds. Make sure you are using a GPU-build by running

pip install tensorflow[and-cuda]

We also provide Va3DE as a standalone package which can be installed here: https://github.com/JinLabBioinfo/Va3DE

Other methods such as Higashi rely on GPU accelerated builds of PyTorch.

SCORE Usage

You can verify that the installation was successful by running

score --help

Tutorials

We provide some tutorials to help you get started:

Supported Embedding Methods

The following embedding methods can be run using the --embedding_algs argument (not case sensitive):

We also provide additional baseline methods for benchmarking:

  • 1D_PCA (sum all interactions at each bin, embed 1D counts with PCA)
  • 2D_PCA (extract band of interactions, embed with PCA)
  • scVI (sum all interactions at each bin, train scVI model)
  • scVI_2D (extract band of interactions, train scVI model)

Basic CLI usage

We provide a small example dataset in the examples/data directory. To run SCORE you simple need to provide an input .scool file and a metadata reference file. You can specify the embedding tool(s) you wish to test using the --embedding_algs argument

score embed --dset oocyte_zygote \  # name for saving results
            --scool oocyte_zygote_mm10_1M.scool \  # path to scool file
            --reference oocyte_zygote_ref \  # metadata reference
            --embedding_algs InnerProduct \  # embedding method name
            --n_strata 20 \

This will create a new results directory (or a directory specified by --out) where results are stored under the name specified by --dset. Visualizations are generated for celltypes and other metadata provided, and if multiple celltype labels are provided, clustering metrics will be computed and stored as well. Additional analysis and visualization can be easily performed with the anndata_obj.h5ad Scanpy object which is saved with each run. Most baseline methods on this small dataset should only take a few minutes to run.

We also provide the datasets analyzed in our benchmark publication at various resolutions which can be downloaded from the following to reproduce our results:

wget hiview10.gene.cwru.edu/~xww/scHi-C_data.tar.gz

For example, to reproduce the short-range complex tissue analysis, we can run:

score embed --dset pfc \  # name for saving results
            --scool pfc_200kb.scool \  # path to scool file
            --reference pfc_ref \  # metadata reference
            --embedding_algs InnerProduct \  # embedding method name
            --n_strata 10 \  # 0-2Mb
            --min_depth 50000  # filter low depth cells

score embed --dset pfc \  # name for saving results
            --scool pfc_200kb.scool \  # path to scool file
            --reference pfc_ref \  # metadata reference
            --embedding_algs InnerProduct \  # embedding method name
            --strata_offset 10 \  # ignore first 10 strata (i.e 0-2Mb)
            --n_strata 100 \
            --min_depth 50000

Including multiple embedding methods and executing multiple runs using --n_runs will produce a local benchmark on the dataset provided:

score embed --dset embryo \  # name for saving results
            --scool embryo_500kb.scool \  # path to scool file
            --reference embryo_ref \  # metadata reference
            --ignore_filter \  # keep all cells
            --embedding_algs 1d_pca InnerProduct scHiCluster \
            --n_runs 10