A pipeline to generate a Deep Neural Network cell type deconvolution model for bulk RNASeq samples from single cell rna-seq data. As described in Torroja and Sanchez-Cabo, 2019.
The pipeline runs in R 3.5.1 and python 3.6.
Install R with the following packages:
- splatter
- zimbwave
- SingleCellExperiment
- Matrix
- dplyr
- gtools
- ggplot2
- resahpe2
- edgeR
- pbapply
- optparse
Install Python 3 with the following packages:
- numpy
- pandas
- argparse
- matplotlib
- keras
Using conda enviroment:
conda create -n digitalDLSorter
conda activate digitalDLSorter
conda install tensorflow=1.5.0 python=3.6 numpy pandas argparse matplotlib keras
conda install r=3.5.1 bioconductor-splatter bioconductor-zinbwave bioconductor-SingleCellExperiment r-gtools r-dplyr r-ggplot2 r-reshape2 bioconductor-edgeR r-pbapply r-optparse
Each script has its own help instructions.
Rscript estimateZinbwaveParams.R --help
python ~/Documents/GitLabProjects/uploadedDLSorter/digitalDLSorter.py --help
Under the data folder there are counts, cellsMetada, genesMetada and probPriors files to test the pipeline. These data has been obtained from GSE81861 (Li et al., 2017), a collection of single cell RNA-Seq profiles from 11 CRC patients.
There is also a a set of bulk RNA-Seq profiles from TCGA colorectal samples obtained from TCGA project data portal.
countsMatrix.tsv.gz: File with single cell counts. Genes in rows and Cells in columns.
zcat data/countsMatrix.tsv.gz | cut -f1-5 | head -n6
RHC3934 | RHC3944 | RHC3962 | RHC4005 | RHC4220 |
---|---|---|---|---|
7SK | 0 | 0 | 0 | 0 |
A1BG | 0 | 0 | 0 | 0 |
A1BG-AS1 | 0 | 0 | 0 | 0 |
A1CF | 0 | 0 | 0.075 | 0 |
A2M | 0 | 0 | 0 | 0 |
cellsMetadata.tsv.gz: file with annotations for each cell (rows) in countsMatrix
zcat cellsMetadata.tsv.gz | cut -f1-4,12 | head -n6
Cell_ID | Patient | issue_type | gender | Cell_Type |
---|---|---|---|---|
RHC3934 | CRC03 | NM | Female | pB |
RHC3944 | CRC03 | NM | Female | gB |
RHC3962 | CRC03 | NM | Female | CD8Gn |
RHC4005 | CRC03 | NM | Female | gB |
RHC4220 | CRC04 | NM | Female | pB |
genesMetadata.tsv.gz: file with annotations for each gene (rows) in countsMatrix
zcat data/genesMetadata.tsv.gz | head -n 6
ensembl_gene_id | external_gene_name | gene_length |
---|---|---|
ENSG00000000003 | TSPAN6 | 2959 |
ENSG00000000005 | TNMD | 1603 |
ENSG00000000419 | DPM1 | 1197 |
ENSG00000000457 | SCYL3 | 5561 |
ENSG00000000460 | C1ORF112 | 4936 |
probPriors.txt: file with the frequency ranges expected for each cell type annotated in cellsMatadata. This information can be estimated from literature or from the single cell experiment itself. It is a priory guess to build the simulated bulk profiles.
head -n6 data/probPriors.txt
cellType | from | to |
---|---|---|
CRC | 30 | 70 |
CD4 | 0 | 15 |
CD8Gn | 0 | 15 |
CD8Gp | 0 | 15 |
Ep | 1 | 50 |
In cases where there is low numbers of cells in the experiment or there is a particular cell type which is under represented we can increase the sample size by simulating new data based on the real one. This pipeline uses the ZinbWave framework implemented in R to estimate the parameters to simulate single cell profiles (Risso et al., 2018).
We can adjust the cell type model --cellTypeColumn=Cell_Type
based on our experimental design adding covariates such as patient and gender for example --cellCovColumns=Patient,gender
. Also we can add covariates at gene level, for example the gene length --geneCovColumns=gene_length
.
We do also filter genes based on minimum expression at a minimum number of cells --minCounts=1 --minCells=1
.
conda activate digitalDLSorter
Rscript estimateZinbwaveParams.R \
-t All \
-m data/cellsMetadata.tsv.gz --cellTypeColumn=Cell_Type --cellIDColumn=Cell_ID --cellCovColumns=Patient,gender \
-g data/genesMetadata.tsv.gz --geneIDColumn=external_gene_name --geneCovColumns=gene_length \
-c data/countsMatrix.tsv.gz --minCounts=1 --minCells=1 \
-o digitalDLSorterDemo -p Demo -d 4
Be patient this may take a few hours to run. It is not very optimized. -d 4
will use 4 threads for some steps of the estimation. Adjust it depending on your resources.
Results will be stored at digitalDLSorterDemo/
folder.
Using the Demo.All.zinbParams.rds
r object file with the estimated parameters we will generate new single cell profiles.
We will simulate 100 cell profiles per cell type -t All -n 100
. In our dataset we have 10 cells types so we will end up with 1000 simulated single cell profiles.
conda activate digitalDLSorter
Rscript simSingleCellProfiles.R \
-t All -n 100 \
-z digitalDLSorterDemo/Demo.All.zinbParams.rds \
-c data/cellsMetadata.tsv.gz -q Cell_Type \
-m data/countsMatrix.tsv.gz \
-o digitalDLSorterDemo -p Demo
The script generates a counts matrix stored as a sparse matrix in this folder digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000
and a metadata file for the simulated cells in two formats, a gzip tsv file Demo.All.simCellsMetadata.1000.tsv.gz
and an R object Demo.All.simCellsMetadata.1000.rds
. This files contains both , the original single cell profiles and the simulated ones.
To train the digitalDLSorter deconvolution model we will first generate bulk samples with known proportions of cells.
First we will separate the cells into training and test set and create a list of vectors with the cell type fractions for each bulk sample.
We will split the Single Cell profiles in 65% for training and 35% for testing with -f 0.65
. The script will need the simulated metadata file in tsv or r object format -c digitalDLSorterDemo/Demo.All.simCellsMetadata.1000.rds
, the column that contains the target classification we want to train on -q Cell_Type
and its corresponding counts matrix -s digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000/matrix.mtx
from the previous step.
conda activate digitalDLSorter
Rscript generateTrainAndTestBulkProbMatrix.R \
-f 0.65 -g data/genesMetadata.tsv.gz -d data/probPriors.txt \
-c digitalDLSorterDemo/Demo.All.simCellsMetadata.1000.rds -q Cell_Type \
-s digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000/matrix.mtx \
-o digitalDLSorterDemo -p Demo
This script will generate several files separated in Train and Test sets.
Demo.trainProbsMatrix
and Demo.testProbMatrix
with the list of vectors with cell type fractions to simulate. Cell types in columns and simulated sample in rows.
Demo.trainProbMatrixNames
and Demo.testProbMatrixNames
with the names of the cells that have been sampled from the training and test pools repectively to build the bulk samples. Each sample in rows made of a 100 cells.
Demo.trainSetList
and Demo.testSetList
files containing the list of cell types with the pool of cells belonging to each cell type
Using the vectors of cell type fractions created for Train and Test and the Single Cell matrix counts, we will generate the counts profiles for the simulated Bulk samples.
We will run this step twice, one for the training set and one for the test set.
Training Set
conda activate digitalDLSorter
Rscript generateBulkSamples.R \
-m digitalDLSorterDemo/Demo.trainProbMatrixNames.rds \
-s digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000/matrix.mtx \
-o digitalDLSorterDemo -p Demo.Train -n 6
Test set
conda activate digitalDLSorter
Rscript generateBulkSamples.R \
-m digitalDLSorterDemo/Demo.testProbMatrixNames.rds \
-s digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000/matrix.mtx \
-o digitalDLSorterDemo -p Demo.Test -n 6
This step will generate the Demo.Train.simBulkCounts and
Demo.Test.simBulkCounts` matrix counts in tsv.gz and r .rds object. With genes in rows and samples in columns.
Now we will prepare the data to feed the digitalDLSorter model by combining the simulated bulk and single cell profiles, normalize, shuffle and scale them.
We will need the Demo.trainProbsMatrix
or Demo.testProbsMatrix
file with the vectors cell type fractions, the Demo.trainSetList
or Demo.testSetList
file with the list of cells selected for each set and the counts matrix for the single cell profiles Demo.All.simCellsCountsMtx.1000/matrix.mtx
and the bulk profiles Demo.Train.simBulkCounts
or Demo.Test.simBulkCounts
.
Again, we will run this step twice, one for the training set and one for the test set.
Training Set
conda activate digitalDLSorter
Rscript generateCombinedScaledShuffledSet.R \
-m digitalDLSorterDemo/Demo.trainProbsMatrix.rds -c digitalDLSorterDemo/Demo.trainSetList.rds \
-s digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000/matrix.mtx -b digitalDLSorterDemo/Demo.Train.simBulkCounts.rds \
-o digitalDLSorterDemo -p Demo.Train
Test Set
conda activate digitalDLSorter
Rscript generateCombinedScaledShuffledSet.R \
-m digitalDLSorterDemo/Demo.testProbsMatrix.rds -c digitalDLSorterDemo/Demo.testSetList.rds \
-s digitalDLSorterDemo/Demo.All.simCellsCountsMtx.1000/matrix.mtx -b digitalDLSorterDemo/Demo.Test.simBulkCounts.rds \
-o digitalDLSorterDemo -p Demo.Test
This step will generate two versions of the data, one with all combined single cell and bulk samples and counts without further transformation (combinedCounts
and combinedProbsMatrix
) and one with the samples normalized, scaled and shuffled (combinedCounts.log2CPMScaledShuffledTransposed
and combinedProbsMatrix.log2CPMScaledShuffledTransposed
). It also provides the list of genes used combinedCounts.geneList
.
With the log2CPMScaledShuffledTransposed
files of data (samples in rows and genes in columns) we will train the model.
We need to provide the number of training --num_train_samples 16056
and test samples --num_test_samples 10800
and the final number of genes used after filtering steps --num_genes 36477
.
You could use this bash lines to find out the numbers
zcat digitalDLSorterDemo/Demo.Train.combinedCounts.log2CPMScaledShuffledTransposed.tsv.gz | wc -l | awk '{print $0-1}'
zcat digitalDLSorterDemo/Demo.Test.combinedCounts.log2CPMScaledShuffledTransposed.tsv.gz | wc -l | awk '{print $0-1}'
zcat digitalDLSorterDemo/Demo.Train.combinedCounts.geneList.tsv.gz | wc -l
The number of cell types --num_classes 10
, batch size to process the data in chunks --batch_size 100
and number of rounds for training --num_epochs 20
.
Finaly we provide the loss function to reduce --loss kullback_leibler_divergence
for this demo.
conda activate digitalDLSorter
python digitalDLSorter.py \
--trainCountsFile digitalDLSorterDemo/Demo.Train.combinedCounts.log2CPMScaledShuffledTransposed.tsv.gz \
--trainProbsFile digitalDLSorterDemo/Demo.Train.combinedProbsMatrix.log2CPMScaledShuffledTransposed.tsv.gz \
--testCountsFile digitalDLSorterDemo/Demo.Test.combinedCounts.log2CPMScaledShuffledTransposed.tsv.gz \
--testProbsFile digitalDLSorterDemo/Demo.Test.combinedProbsMatrix.log2CPMScaledShuffledTransposed.tsv.gz \
--num_classes 10 --num_genes 36477 --num_train_samples 16056 --num_test_samples 10800 \
--batch_size 100 --num_epochs 10 --loss kullback_leibler_divergence \
--outputPath digitalDLSorterDemo --prefix Demo
The training step will generate an h5 file with the model Demo.digitalDLSorterTrainedModel.kullback_leibler_divergence.h5
and its weights Demo.digitalDLSorterTrainedWeightsModel.kullback_leibler_divergence
, the list of genes for the input Demo.digitalDLSorterTrainedModel.inputGeneList
, the target class names or cell types Demo.digitalDLSorterTrainedModel.targetClassNames
, the model performance evaluation Demo.digitalDLSorterTrainedModel.evalStats
and the predictions on the test set Demo.digitalDLSorterTrainedModel.DeconvTestPredictions
.
Now with the model trained we can provide new samples to be deconvoluted. In example, these colorectal samples obtained from TCGA site. The counts matrix of bulk samples should have samples in rows and genes in columns. If the data is not normalized and scaled we can do so by --normData True
. If we have trained the model starting from TPMs we should feed it with TPMs.
conda activate digitalDLSorter
python digitalDLSorterModelDeconv.py \
--modelFile digitalDLSorterDemo/Demo.digitalDLSorterTrainedModel.kullback_leibler_divergence.h5 \
--modelGenesList digitalDLSorterDemo/Demo.digitalDLSorterTrainedModel.inputGeneList.kullback_leibler_divergence.txt.gz \
--modelClassNames digitalDLSorterDemo/Demo.digitalDLSorterTrainedModel.targetClassNames.kullback_leibler_divergence.txt.gz \
--countsFile data/TCGA.Colon.Counts.transpose.tsv.gz \
--batch_size 100 --num_samples 521 --normData True\
--outputPath digitalDLSorterDemo --prefix Demo.PredictTCGA