/$$$$$$ /$$$$$$ /$$$$$$$$ /$$$$$$$ /$$__ $$ /$$__ $$|__ $$__/| $$__ $$ | $$ \__/| $$ \ $$ | $$ | $$ \ $$ | $$ | $$$$$$$$ | $$ | $$ | $$ | $$ | $$__ $$ | $$ | $$ | $$ | $$ $$| $$ | $$ | $$ | $$ | $$ | $$$$$$/| $$ | $$ | $$ | $$$$$$$/ \______/ |__/ |__/ |__/ |_______/
This package extends the functionalities in deconv_benchmark, which was published at: Nature Communications.
Zhichao Miao, Anna Vathrakokili Pournara, Irene Papatheodoru. Multiple-level benchmarking of cell type deconvolution.
The scripts in CATD include CATD.R and helper_functions.R, which do not require an installation. The installation instruction for the deconvolution programs can be found at install.md.
The input reference data (a single-cell dataset) is formatted as ExpressionSet.
ExpressionSet (storageMode: lockedEnvironment)
assayData: 80 features, 1000 samples
element names: exprs
protocolData: none
phenoData
sampleNames: cell_1 cell_2 ... cell_1000 (1000 total)
varLabels: cellID cellType sampleID
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:
The expression matrix is available at: x@assayData$exprs
. Rows for genes, columns for cells. A matrix: M × N of type int
.
The metadata is available at: x@phenoData@data
, which includes THREE columns: cellID, cellType and sampleID.
An example below:
A data.frame: 1000 × 3
cellID cellType sampleID
<chr> <chr> <chr>
cell_1 cell_1 cell_type_1 indiv1
cell_2 cell_2 cell_type_1 indiv2
cell_3 cell_3 cell_type_1 indiv1
cell_4 cell_4 cell_type_1 indiv2
cell_5 cell_5 cell_type_1 indiv1
cell_6 cell_6 cell_type_1 indiv2
cell_7 cell_7 cell_type_1 indiv1
cell_8 cell_8 cell_type_1 indiv2
cell_9 cell_9 cell_type_1 indiv1
cell_10 cell_10 cell_type_1 indiv2
......
Python formatted AnnData can be transformed to required ExpressionSet format using the anndata2expressionset
function in helper_functions.R.
An example here: anndata2expressionset('downsample200/Ximerakis2019.h5', 'downsample200R/Ximerakis2019.rds')
.
In this evaluation framework, we have THREE levels of experiments: 1. self-reference, 2. cross-reference and 3. real bulk.
Both the pseudobulk and the cell type reference are generated from the same dataset, which is the same setting as the Nature Communications paper.
In this case, the self_reference
function in CATD.R is used.
When we have two reference (single-cell) datasets for the same type of tissue available, one of the dataset can be used to generate pseudobulk and the other one can be used as cell type reference.
In this case, the cross_reference
function in CATD.R is used.
When we have real bulk RNA-seq data available, we can deconvolve real bulk data with a single-cell reference dataset. However, there is no way to know the cell type proportion in the bulk dataset in priori. Therefore, we use TWO reference datasets to deconvolve the real bulk dataset. We assume that a program can deconvolve the real bulk dataset with the reference dataset very well (suppose the accuracy is 100%). The resulted cell proportion can be used as a gold standard to compare with the deconvolution using the second reference dataset. There can be possibility that the deconvolution with the two reference datasets are accidentally equally wrong. However, such a situation can be almost excluded after applying the self-reference and cross-reference evaluations.
The bulk_reference
function in CATD.R can deconvolve a real bulk dataset with a reference dataset, while the bulk_2references
function compares the deconvolutions of a real bulk dataset using two reference datasets.
- self-reference
# With the example we provided with this repository + no cell type removed:
Rscript CATD.R s example.rds none bulk TMM all nnls 100 none 1 F T T
#Expected output:
# RMSE Pearson
#1 0.0351 0.9866
- cross-reference
# With the example we provided with this repository + no cell type removed:
Rscript CATD.R c example1.rds example2.rds none bulk TMM all nnls 10000 none 1 F T T
#Expected output:
# RMSE Pearson
#1 0.0351 0.9866
- real bulk
# With the example we provided with this repository + no cell type removed:
Rscript CATD.R b bulk.rds example1.rds example2.rds none bulk TMM all nnls 10000 none 1 T
#Expected output:
# RMSE Pearson
#1 0.0351 0.9866