/DAISM-XMBD

Highly accurate cell type proportion estimation with in silico data augmentation and deep neural networks

Primary LanguagePythonMIT LicenseMIT

DAISM-DNN

 

We propose data augmentation through in silico mixing with deep neural networks (DAISM-DNN) to achieve highly accurate and unbiased immune-cell proportion estimation from bulk RNA sequencing (RNA-seq) data. Our method tackles the batch effect problem by creating a data-specific training dataset from a small subset of calibration samples with ground truth cell proportions which is further augmented with publicly available RNA-seq data from purified cells, single-cell RNA-seq (scRNA-seq) data or CITE-seq data.

Installation

DAISM-DNN is functional on all operating systems (Linux, Windows and Mac OSX) with Python 3.

Pip

To install DASIM-DNN via pip, simply run the following command:

pip install daism

All package dependencies should be handled automatically when installing with pip.

python (v3.7.7)
torch (v1.5.1)
pandas (v1.2.4)
numpy (v1.18.1)
scikit-learn (v0.24.2)
argh (v0.26.2) 
anndata (v0.7.6)
scanpy (v1.8.1)
tqdm (v4.46.0)

Docker

We provide a docker image with DAISM-DNN installed: https://hub.docker.com/r/zoelin1130/daism

Pull the docker image:

docker pull zoelin1130/daism:latest

Create a container (GPU):

docker run --gpus all -i -t --name run_daism -v example/:/workspace/example/ zoelin1130/daism:latest /bin/bash

Create a container (CPU):

docker run -i -t --name run_daism -v example/:/workspace/example/ zoelin1130/daism:latest /bin/bash

run_daismis your container name. It is strongly recommended to add -v parameter for implementing data and scripts mounting: mount the local volume example (from your machine) to /workspace/example/ (to your container) instead of directly copy them into the container.

Cell Types Supported

The example we provide contains the following cell types. The purified dataset for data augmentation can be downloaded from:https://doi.org/10.5281/zenodo.6481157

pbmc8k.h5ad contains 5 cell types: B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells.

pbmc8k_fine.h5ad contains 11 cell types: naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells.

RNA_TPM_coarse.h5ad contains 5 cell types: B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells.

Note: each cell type needs to be named according to above format.

DAISM-DNN can support the prediction of any cell types, as long as calibration samples with ground truth and purified expression profiles of corresponding cell types provided.

Usage

In our example below, we set working directory to daism. Use -h to print out help information on DAISM-DNN modules.

daism -h

DAISM-DNN consists of four modules:

DAISM modules:

daism DAISM -platform S -caliexp ../example/caliexp.txt -califra ../example/califra.txt -aug ../example/pbmc8k.h5ad -N 16000 -testexp ../example/testexp.txt -outdir ./

DAISM is a one-stop mode to run DAISM-DNN, which integrates simulation, training and prediction in one module. Example: we use pbmc8k.h5ad, a single cell RNA-seq dataset, as purified samples for data augmentation and put it under the example directory. So we use S for platform parameter. The calibration data is an RNA-seq expression profile caliexp.txt.

simulation modules:

We have two training set simulation modules. One is DAISM_simulation which using DAISM strategy in generating mixtures.

daism DAISM_simulation -platform S -caliexp ../example/caliexp.txt -califra ../example/califra.txt -aug ../example/pbmc8k.h5ad -N 16000 -testexp ../example/testexp.txt -outdir ./

The other is Generic_simulation which generates training set only using purified cells.

daism Generic_simulation -platform S -aug ../example/pbmc8k.h5ad -N 16000 -testexp ../example/testexp.txt -outdir ./

training modules:

# If you use DAISM_simulation mode:
daism training -trainexp ./output/DAISM_mixsam.txt -trainfra ./output/DAISM_mixfra.txt -outdir ./

# If you use Generic_simulation mode:
daism training -trainexp ./output/Generic_mixsam.txt -trainfra ./output/Generic_mixfra.txt -outdir ./

We use the DAISM-generated mixtures DAISM_mixsam.txt and corresponding artificial cell fractions DAISM_mixfra.txt to train the neural networks.

prediction modules:

daism prediction -testexp ../example/testexp.txt -model ./output/DAISM_model.pkl -celltype ./output/DAISM_model_celltypes.txt -feature ./output/DAISM_model_feature.txt -outdir ./

Both the result file and the process files will be saved in the output folder.

Citation

Lin Y, Li H, Xiao X, et al. DAISM-DNNXMBD: Highly accurate cell type proportion estimation with in silico data augmentation and deep neural networks. Patterns (2022) https://doi.org/10.1016/j.patter.2022.100440