DISC

A highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning.

Free software: Apache License 2.0

Requirements

Python >=3.6
TensorFlow >=1.13.1,<2.0.0
numpy >=1.14.0
pandas >=0.21.0
h5py >=2.9.0
matplotlib >=3.0.0

Installation

Install TensorFlow

If you have an Nvidia GPU, be sure to install a version of TensorFlow that supports it first -- DISC runs much faster with GPU:

pip install "tensorflow-gpu>= 1.13.1,<2.0.0"

We typically tensorflow-gpu==1.13.1.

Here are requirements for GPU version TensorFlow:

* Hardware
    * NVIDIA GPU card with CUDA Compute Capability 3.5 or higher.
* Software
    * NVIDIA GPU drivers - CUDA 10.0 requires 410.x or higher.
    * CUDA Toolkit - TensorFlow_ supports CUDA 10.0 (TensorFlow >= 1.13.0)
    * CUPTI ships with the CUDA Toolkit.
    * cuDNN SDK (>= 7.4.1)

See this for further information.

Install DISC with pip

To install with pip, run the following from a terminal:
```
pip install disc
```
Install DISC from GitHub

To clone the repository and install manually, run the following from a terminal:
```
git clone git://github.com/iyhaoo/DISC.git

cd disc

python setup.py install
```

Usage

Quick Start

(1). How to run DISC:
```
disc \
--dataset=matrix.loom \
--out-dir=out_dir
```
where matrix.loom is a loom-formatted raw count matrix with genes in rows and cells in columns and out_dir is the target path for output folder.

(2). What DISC outputs:
- log.tsv: records DISC training information.
- summary.pdf: shows the fitting line and optimal point and will be updated in real time when DISC is running.
- summary.tsv: records the raw data in summary.pdf.
- result: imputaion result folder, which contains:
  - imputation.loom: the imputed matrix with genes in rows and cells in columns.
  - feature.loom: the feature matrix with feature in rows and cells in columns.
  - running_info.hdf5: a hdf5-formatted file, contains some useful information of matrix.loom (e.g. library size, the expressed counts and cells for each genes, imputed genes, etc.).
- models: For every save interval, DISC freezes its parameters into this folder (in pb format).

Data availability

The sources of our data are listed here.

MELANOMA :

8,640 cells from the melanoma WM989 cell line were sequenced using Drop-seq, where 32,287 genes were detected (scRNA-seq). In addition, RNA FISH experiment of across 7,000-88,000 cells from the same cell line was conducted and 26 genes were detected (FISH).
SSCORTEX :

Mouse somatosensory cortex of CD-1 mice at age of p28 and p29 were profiled by 10X where 7,477 cells were detected (scRNA-seq). In addition, osmFISH experiment of 4,839 cells from somatosensory cortex, hippocampus and ventricle of a CD-1 mouse at age of p22 was conducted and 33 genes were detected (FISH).
CBMC :

Cord blood mononuclear cells were profiled by CITE-seq, where 8,005 human cells were detected in total (scRNA-seq).
PBMC :

2,700 freeze-thaw peripheral blood mononuclear cells (PBMC) from a healthy donor were profiled by 10X, where 32,738 genes were detect (scRNA-seq).
JURKAT_293T :

3258 jurkat cells (scRNA-seq) and 2885 293T cells (scRNA-seq) were profiled by 10X separately. This dataset has bulk RNA-seq data (bulk RNA-seq).
10X_5CL :

5,001 cells from 5 human lung adenocarcinoma cell lines H2228, H1975, A549, H838 and HCC827 were profiled by 10X (scRNA-seq). This dataset has bulk RNA-seq data (bulk RNA-seq).
BONE_MARROW :

6,941 human bone marrow cells from sample MantonBM6 were profiled by 10X. The original single-cell RNA sequencing data provided by HCA was aligned to hg19, 6939 cells left after cell filtering (scRNA-seq). This dataset has bulk RNA-seq data (bulk RNA-seq).
RETINA :

Retinas of mice at age of p14 were profiled in 7 different replicates on by Drop-seq, where 6,600, 9,000, 6,120, 7,650, 7,650, 8280, and 4000 (49,300 in total) STAMPs (single-cell transcriptomes attached to micro-particles) were collected (scRNA-seq). The dataset has cell annotation.
BRAIN_SPLiT :

156,049 mice nuclei from developing brain and spinal cord at age of p2 or p11 mice were profiled by SPLiT-seq (scRNA-seq). The cell annotation of this dataset is included in file GSM3017261_150000_CNS_nuclei.mat.gz at the same GEO page.
BRAIN_1.3M :

1,306,127 cells from combined cortex, hippocampus, and subventricular zone of 2 E18 C57BL/6 mice were profiled by 10X (scRNA-seq).

We provide our pre-processed data here.

Dataset	Raw Data	DS Data	FISH Data	Bulk Data	Cell Type Annotation
MELANOMA	YES	0.5	YES	NO	NO
SSCORTEX	YES	0.5	YES	NO	NO
CBMC	YES	0.5	NO	NO	NO
PBMC	YES	0.3, 0.5	NO	NO	YES
JURKAT_293T	YES	NO	NO	YES	NO
10X_5CL	YES	NO	NO	YES	NO
BONE_MARROW	YES	NO	NO	YES	YES
RETINA	YES	0.3, 0.5	NO	NO	YES
BRAIN_SPLiT	YES	0.3, 0.5	NO	NO	YES
BRAIN_1.3M	NO (Too large)	NO	NO	NO	Clustering Result

Evaluations
- Data Preparation, Imputation and Computational Resource Evaluation
  
  (1). Data Pre-processing
  
  MELANOMA
  
  SSCORTEX
  
  PBMC
  
  CBMC
  
  JURKAT_293T
  
  10X_5CL
  
  BONE_MARROW
  
  RETINA
  
  BRAIN_SPLiT
  
  BRAIN_1.3M
  
  (2). Imputation
  
  (3). Computational Resource Evaluation (Results, Test Program)
- Data Structure Recovery Evaluation
  (1). Gene Expression Structures (FISH)
  - Tutorial : MELANOMA
  (2). Gene and Cell Structures (Down-sampling)
  - Tutorial : MELANOMA
  (S1). Spearman Correlation (Bulk)
  - Tutorial : JURKAT_293T
  (S2). Identification of True Zeros (Down-sampling)
  - Tutorial : MELANOMA, SSCORTEX, CBMC and PBMC
- Down-stream Analysis Improvement:
  (1). Cell Type Identification (Down-sampling)
  - Tutorial : PBMC
  (2). DEG Identification (Bulk)
  - Tutorial : JURKAT_293T
  (3). Solution for Large Dataset Analysis
  - Tutorial : PBMC
  (S1). Trajectory Analysis
  
  Tutorial : BONE_MARROW
- Other Utility Scripts
  
  Script
  
  Output
  
  Violin Plot
  
  PBMC
  
  RETINA

Script	Output
Violin Plot	PBMC	RETINA

References

Yao He^#, Hao Yuan^#, Cheng Wu^#, Zhi Xie^*. DISC: a highly scalable and accurate inference of gene expression and structure for single-cell transcriptomes using semi-supervised deep learning. Genome Biology 21, 170 (2020). https://doi.org/10.1186/s13059-020-02083-3

History

1.1 (2020-06-06)

Update CLI.

1.0 (2019-12-16)

First release on PyPI.

xie-lab/DISC