The current code was implemented using R v4.1.0, Bioconductor v3.13, Snakemake v5.5.0, and Python v3.6.8. All R dependencies (from GitHub, CRAN and Bioconductor) are listed under code/10-session_info.R and may be installed using the command contained therein.
config.yaml
specifies the R library and version to usecode
contains all R scripts used in the Snakemake workflowdata
contains raw, filtered and simulated scRNA-seq datasets,
as well as simulation parameter estimatesmeta
contains two .json files that specify simulation method (methods.json
) and reference subset (subsets.json
) configurationsouts
contains all results from computations (typicallydata.frame
s) as .rds filesfigs
contains all visual outputs as .pdf files, and correspondingggplot
objects as .rds files (for subsequent arrangement into 'super'-figures)
Simulation methods are tagged with one or many of the following labels, according to which scenario(s) they can accommodate:
n
for none: no clusters or batchesb
for batch: multiple batches, no clustersk
for cluster: multiple clusters, no batches
Similarly, we tag subsets (see below) with exactly one of these labels. This allows running each method on subsets they are capable of simulating.
1. Data retrieval
Each code/00-get_data-<datset_id>.R
script retrieves a publicly available scRNA-seq dataset through from which a SingleCellExperiment is constructed and written to data/00-raw/<datset_id>.rds
2. Filtering
code/01-fil_data.R
is applied to each raw dataset as to
- remove batches, cluster, or batch-cluster instances with fewer than 50 cells (depending on the dataset's complexity)
- keep genes with a count of at least 1 in at least 10 cells, and remove cells with fewer than 100 detected genes
Filtered data are written to data/01-fil/<datset_id>.rds
.
3. Subsetting
Because different methods can accommodate only some features (e.g. multiple batches or clusters, both or neither), code/02-sub_data.R
creates specific subsets in data/02-sub/<datset_id>.<subset_id>,rds
. We term these ref(erence)sets (i.e. <datset_id>.<subset_id> = <refset_id>
), as they serve as the input reference data for simulation.
1. Parameter estimation
Simulation parameters are estimated with code/03-est_pars.R
, which in term sources a code/03-est_pars-<method_id>.R
script that executes a method's parameter estimation function(s). In cases where no separate estimation takes place, this returns NULL
. Parameter estimates for each combination of <refset_id.<method_id> = <simset_id>
are written to data/04-est/<simset_id>.rds
.
2. Data simulation
Data is simulated with code/04-sim_data.R
, which in term sources a code/04-sim_data-<method_id>.R
script that executes a method's simulation function. Simulations for each combination of <refset_id>
and method_id
are written to data/05-sim/<refset_id>,<method_id>.rds
.
Various quality control (QC) summaries are computed with code/05-calc_qc.R
, which in term sources a set of code/05-calc_qc-<metric_id>.R
scripts. QC results for reference and simulated data are written to outs/qc_ref-<refset_id>,<metric_id>.rds
and outs/qc_sim-<simset_id>,<metric_id>.rds
, respectively. At current, we consider:
1. Gene-level
frq
: detection frequency (i.e., fraction of cells with non-zero counts)avg/var
: average/variance of logCPMcv
: coefficient of variationcor
: gene-to-gene correlation
2. Cell-level
frq
: detection frequency (i.e., fraction of genes with non-zero counts)lls
: log-transformed library size (total counts)cor
: cell-to-cell correlationpcd
: cell-to-cell distance (in PCA space)knn
: number of KNN occurrencesldf
: local density factor
3. Global
sw
: Silhouette width (using batch/cluster labels as classes)cms
: cell-specific mixing score (using batch/cluster labels as batches)pve
: percent variance explained (of gene expression = logCPM, by batch/cluster)
Noteworthily, we compute each summary for different groupings of cells (depending on the dataset's complexity):
- globally, i.e. across all cells
- at the batch-level, i.e. for each batch
- at the cluster-level, i.e. for each cluster
Global summaries are computed at the batch-/cluster-level only, as they require a grouping variable.
We compare summaries between reference and simulated data in both one- (code/06-stat_1d.R
) and two-dimensional settings (code/06-statl_2d.R
). For the latter, every combination of gene- and cell-level metrics is considered, excluding correlations and global summaries. Furthermore, metrics are evaluated for each cell grouping, i.e. we perform a test globally, for each batch and cluster (again, depending on the dataset's complexity). Test results are written to outs/stat_1d,<refset_id>,<metric_id>,<stat1d_id>.rds
for 1D, and outs/stat_2d,<refset_id>,<metric1_id>,<metric2_id>,<stat2d_id>.rds
for 2D tests.
1. One-dimensional
- Kolmogorov-Smirnov (KS) test
- Wasserstein metric
2. Two-dimensional
- two-dimensional KS test
- Earth Mover's Distance (EMD)
Each 05-calc_batch-x.R
script wraps around an integration method that is applied in 05-calc_batch.R
to the set of type b subsets. The output corrected assay data or integrated cell embeddings (depending on the method) are written to outs/batch_ref/sim-<ref/simset_id>,<batch_method>.rds
for every reference and simulation, respectively. Results are evaluated by 06-eval_batch.R
, which computes the following set of metrics:
- cell-specific mixing score (CMS)
- difference in local density factor ($\Delta$LDF)
- batch correction score (BCS)
Each 05-calc_clust-x.R
script wraps around an integration method that is applied in 05-calc_clust.R
to the set of type b subsets. The output cluster assignments are written to outs/clust_ref/sim-<ref/simset_id>,<clust_method>.rds
for every reference and simulation, respectively. Results are evaluated by 06-eval_clust.R
, which computes the following set of metrics:
- precision (P) and recall (R)
- F1 score (harmonic mean of P and R)
Finally, results are collected across refset_id
s and method_id
s (jointly or separated by type), and visualized in various ways using as set of 07-plot_x.R
scripts. Output figures are written to plts
as .pdf files, along with the corresponding ggplot
objects as .rds files. Lastly, 08-fig_x.R
scripts are used to combined various ggplot
s into figures that are saved to figs
as .pdf files.
In principle, any dataset for which a code/00-get_data-<dataset_id>.R
script exists will be accessible to the workflow. However, data will only be retrieved if the dataset appears in meta/subsets.json
. Hence,
To exclude a dataset from the workflow, i) (re)move the corresponding code/00-get_data-<dataset_id>.R
script; or, ii) remove or comment out any associated meta/subsets.json
entries.
Similarly, a new dataset can be added by supplying an adequate code/00-get_data-<dataset_id>.R
script, and adding an entry to the meta/subsets.json
configuration that specifies the subset ID, the number of genes/cells to sample (NULL
for all), which batch(es)/cluster(s) to retain, as well as the resulting subset's type (one of n,b,k,g).
The Snakemake will automatically include any simulation method for which a code/03-est_pars-<method_id>.R
and code/04-sim_data-<method_id>.R
script exists. Secondly, meta/methods.json
will determine on which type(s) of dataset(s) each method should be run. Thus,
To exclude a method from the workflow, either i) set "<method_id>": "x"
in the meta/methods.json
file (or anything other than n,b,k,g); or, ii) (re)move the parameter estimation and/or simulation script from the code
directory.
Analogous to the above, adding a method to the benchmark requires i) adding a code/03-est_pars-<method_id>.R
and code/04-sim-data-<method_id>.R
script; and, ii) adding an entry for the method_id
to the meta/methods.json
file. Importantly, the R script for parameter estimation should handle batches (colData
column batch
), clusters (colData
column cluster
), both or neither. And the method's type(s) should be specified accordingly (n
for neither, b/k
for batches/clusters, g
for groups), e.g. "<method_id>": ["n", "k"]
for a method that supports 'singular' datasets, as well as ones with multiple clusters.