We propose data augmentation through in silico mixing with deep neural networks (DAISM-DNN) to achieve highly accurate and unbiased immune-cell proportion estimation from bulk RNA sequencing (RNA-seq) data. Our method tackles the batch effect problem by creating a data-specific training dataset from a small subset of calibration samples with ground truth cell proportions which is further augmented with publicly available RNA-seq data from purified cells, single-cell RNA-seq (scRNA-seq) data or CITE-seq data.
DAISM-DNN is functional on all operating systems (Linux, Windows and Mac OSX) with Python 3.
To install DASIM-DNN via pip, simply run the following command:
pip install daism
All package dependencies should be handled automatically when installing with pip.
python (v3.7.7)
torch (v1.5.1)
pandas (v1.2.4)
numpy (v1.18.1)
scikit-learn (v0.24.2)
argh (v0.26.2)
anndata (v0.7.6)
scanpy (v1.8.1)
tqdm (v4.46.0)
We provide a docker image with DAISM-DNN installed: https://hub.docker.com/r/zoelin1130/daism
Pull the docker image:
docker pull zoelin1130/daism:latest
Create a container (GPU):
docker run --gpus all -i -t --name run_daism -v example/:/workspace/example/ zoelin1130/daism:latest /bin/bash
Create a container (CPU):
docker run -i -t --name run_daism -v example/:/workspace/example/ zoelin1130/daism:latest /bin/bash
run_daism
is your container name. It is strongly recommended to add -v parameter for implementing data and scripts mounting: mount the local volume example
(from your machine) to /workspace/example/
(to your container) instead of directly copy them into the container.
The example we provide contains the following cell types. The purified dataset for data augmentation can be downloaded from:https://doi.org/10.5281/zenodo.6481157
pbmc8k.h5ad contains 5 cell types: B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells.
pbmc8k_fine.h5ad contains 11 cell types: naive.B.cells, memory.B.cells, naive.CD4.T.cells, memory.CD4.T.cells,naive.CD8.T.cells, memory.CD8.T.cells, regulatory.T.cells, monocytes, macrophages, myeloid.dendritic.cells, NK.cells.
RNA_TPM_coarse.h5ad contains 5 cell types: B.cells, CD4.T.cells, CD8.T.cells, monocytic.lineage, NK.cells.
Note: each cell type needs to be named according to above format.
DAISM-DNN can support the prediction of any cell types, as long as calibration samples with ground truth and purified expression profiles of corresponding cell types provided.
In our example below, we set working directory to daism. Use -h to print out help information on DAISM-DNN modules.
daism -h
DAISM-DNN consists of four modules:
daism DAISM -platform S -caliexp ../example/caliexp.txt -califra ../example/califra.txt -aug ../example/pbmc8k.h5ad -N 16000 -testexp ../example/testexp.txt -outdir ./
DAISM
is a one-stop mode to run DAISM-DNN, which integrates simulation, training and prediction in one module.
Example: we use pbmc8k.h5ad, a single cell RNA-seq dataset, as purified samples for data augmentation and put it under the example
directory. So we use S
for platform parameter. The calibration data is an RNA-seq expression profile caliexp.txt
.
We have two training set simulation modules. One is DAISM_simulation which using DAISM strategy in generating mixtures.
daism DAISM_simulation -platform S -caliexp ../example/caliexp.txt -califra ../example/califra.txt -aug ../example/pbmc8k.h5ad -N 16000 -testexp ../example/testexp.txt -outdir ./
The other is Generic_simulation which generates training set only using purified cells.
daism Generic_simulation -platform S -aug ../example/pbmc8k.h5ad -N 16000 -testexp ../example/testexp.txt -outdir ./
# If you use DAISM_simulation mode:
daism training -trainexp ./output/DAISM_mixsam.txt -trainfra ./output/DAISM_mixfra.txt -outdir ./
# If you use Generic_simulation mode:
daism training -trainexp ./output/Generic_mixsam.txt -trainfra ./output/Generic_mixfra.txt -outdir ./
We use the DAISM-generated mixtures DAISM_mixsam.txt
and corresponding artificial cell fractions DAISM_mixfra.txt
to train the neural networks.
daism prediction -testexp ../example/testexp.txt -model ./output/DAISM_model.pkl -celltype ./output/DAISM_model_celltypes.txt -feature ./output/DAISM_model_feature.txt -outdir ./
Both the result file and the process files will be saved in the output
folder.
Lin Y, Li H, Xiao X, et al. DAISM-DNNXMBD: Highly accurate cell type proportion estimation with in silico data augmentation and deep neural networks. Patterns (2022) https://doi.org/10.1016/j.patter.2022.100440