/DISC

A highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning.

Primary LanguageHTMLApache License 2.0Apache-2.0

DISC

PyPI

A highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning.

  • Free software: Apache License 2.0

Requirements

Installation

  • Install TensorFlow

    If you have an Nvidia GPU, be sure to install a version of TensorFlow that supports it first -- DISC runs much faster with GPU:

    pip install "tensorflow-gpu>= 1.13.1,<2.0.0"
    

    We typically tensorflow-gpu==1.13.1.

    Here are requirements for GPU version TensorFlow:

    * Hardware
        * NVIDIA GPU card with CUDA Compute Capability 3.5 or higher.
    * Software
        * NVIDIA GPU drivers - CUDA 10.0 requires 410.x or higher.
        * CUDA Toolkit - TensorFlow_ supports CUDA 10.0 (TensorFlow >= 1.13.0)
        * CUPTI ships with the CUDA Toolkit.
        * cuDNN SDK (>= 7.4.1)
    

    See this for further information.

  • Install DISC with pip

    To install with pip, run the following from a terminal:

    pip install disc
    
  • Install DISC from GitHub

    To clone the repository and install manually, run the following from a terminal:

    git clone git://github.com/iyhaoo/DISC.git
    
    cd disc
    
    python setup.py install
    

Usage

  • Quick Start

    (1). How to run DISC:

    disc \
    --dataset=matrix.loom \
    --out-dir=out_dir
    

    where matrix.loom is a loom-formatted raw count matrix with genes in rows and cells in columns and out_dir is the target path for output folder.

    (2). What DISC outputs:

    • log.tsv: records DISC training information.
    • summary.pdf: shows the fitting line and optimal point and will be updated in real time when DISC is running.
    • summary.tsv: records the raw data in summary.pdf.
    • result: imputaion result folder, which contains:
      • imputation.loom: the imputed matrix with genes in rows and cells in columns.
      • feature.loom: the feature matrix with feature in rows and cells in columns.
      • running_info.hdf5: a hdf5-formatted file, contains some useful information of matrix.loom (e.g. library size, the expressed counts and cells for each genes, imputed genes, etc.).
    • models: For every save interval, DISC freezes its parameters into this folder (in pb format).
  • Data availability

    The sources of our data are listed here.

    • MELANOMA :
      8,640 cells from the melanoma WM989 cell line were sequenced using Drop-seq, where 32,287 genes were detected (scRNA-seq). In addition, RNA FISH experiment of across 7,000-88,000 cells from the same cell line was conducted and 26 genes were detected (FISH).
    • SSCORTEX :
      Mouse somatosensory cortex of CD-1 mice at age of p28 and p29 were profiled by 10X where 7,477 cells were detected (scRNA-seq). In addition, osmFISH experiment of 4,839 cells from somatosensory cortex, hippocampus and ventricle of a CD-1 mouse at age of p22 was conducted and 33 genes were detected (FISH).
    • CBMC :
      Cord blood mononuclear cells were profiled by CITE-seq, where 8,005 human cells were detected in total (scRNA-seq).
    • PBMC :
      2,700 freeze-thaw peripheral blood mononuclear cells (PBMC) from a healthy donor were profiled by 10X, where 32,738 genes were detect (scRNA-seq).
    • JURKAT_293T :
      3258 jurkat cells (scRNA-seq) and 2885 293T cells (scRNA-seq) were profiled by 10X separately. This dataset has bulk RNA-seq data (bulk RNA-seq).
    • 10X_5CL :
      5,001 cells from 5 human lung adenocarcinoma cell lines H2228, H1975, A549, H838 and HCC827 were profiled by 10X (scRNA-seq). This dataset has bulk RNA-seq data (bulk RNA-seq).
    • BONE_MARROW :
      6,941 human bone marrow cells from sample MantonBM6 were profiled by 10X. The original single-cell RNA sequencing data provided by HCA was aligned to hg19, 6939 cells left after cell filtering (scRNA-seq). This dataset has bulk RNA-seq data (bulk RNA-seq).
    • RETINA :
      Retinas of mice at age of p14 were profiled in 7 different replicates on by Drop-seq, where 6,600, 9,000, 6,120, 7,650, 7,650, 8280, and 4000 (49,300 in total) STAMPs (single-cell transcriptomes attached to micro-particles) were collected (scRNA-seq). The dataset has cell annotation.
    • BRAIN_SPLiT :
      156,049 mice nuclei from developing brain and spinal cord at age of p2 or p11 mice were profiled by SPLiT-seq (scRNA-seq). The cell annotation of this dataset is included in file GSM3017261_150000_CNS_nuclei.mat.gz at the same GEO page.
    • BRAIN_1.3M :
      1,306,127 cells from combined cortex, hippocampus, and subventricular zone of 2 E18 C57BL/6 mice were profiled by 10X (scRNA-seq).

    We provide our pre-processed data here.

    Dataset

    Raw Data

    DS Data

    FISH Data

    Bulk Data

    Cell Type Annotation

    MELANOMA

    YES

    0.5

    YES

    NO

    NO

    SSCORTEX

    YES

    0.5

    YES

    NO

    NO

    CBMC

    YES

    0.5

    NO

    NO

    NO

    PBMC

    YES

    0.3, 0.5

    NO

    NO

    YES

    JURKAT_293T

    YES

    NO

    NO

    YES

    NO

    10X_5CL

    YES

    NO

    NO

    YES

    NO

    BONE_MARROW

    YES

    NO

    NO

    YES

    YES

    RETINA

    YES

    0.3, 0.5

    NO

    NO

    YES

    BRAIN_SPLiT

    YES

    0.3, 0.5

    NO

    NO

    YES

    BRAIN_1.3M

    NO (Too large)

    NO

    NO

    NO

    Clustering Result

  • Evaluations

References

Yao He#, Hao Yuan#, Cheng Wu#, Zhi Xie*. DISC: a highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning. Genome Biology 21, 170 (2020). https://doi.org/10.1186/s13059-020-02083-3

History

1.1 (2020-06-06)

  • Update CLI.

1.0 (2019-12-16)

  • First release on PyPI.