Example code from prepSCE does not work with aggregateData

Question

Example code from prepSCE does not work with aggregateData

Closed this issue 3 years ago · 1 comments

Hi- I'm creating an example dataset using the code from the documentation of prepSCE:

ng <- 50
nc <- 200
 
# generate some cell metadata
gids <- sample(c("groupA", "groupB"), nc, TRUE)   
sids <- sample(paste0("sample", seq_len(3)), nc, TRUE) 
kids <- sample(paste0("cluster", seq_len(5)), nc, TRUE) 
batch <- sample(seq_len(3), nc, TRUE)
cd <- data.frame(group = gids, id = sids, cluster = kids, batch)

# construct SCE
library(scuttle)
sce <- mockSCE(ncells = nc, ngenes = ng)
colData(sce) <- cbind(colData(sce), cd)

# prep. for workflow
sce <- prepSCE(sce, kid = "cluster", sid = "id", gid = "group")
head(colData(sce))
metadata(sce)$experiment_info

When I pass the object sce to aggregateData, the output is an object with colData having 0 columns:

pb <- aggregateData(sce)
colData(pb)
DataFrame with 3 rows and 0 columns

# Then 
pbDS(pb)
Error in .check_pbs(pb, check_by = TRUE) : 
  !is.null(pbs[["group_id"]]) is not TRUE

Am I doing something wrong?

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /home1/db291g/miniconda3/envs/pseudobulk_dge/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scuttle_1.4.0               SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 Biobase_2.54.0              GenomicRanges_1.46.0        GenomeInfoDb_1.30.0         IRanges_2.28.0             
 [8] S4Vectors_0.32.0            BiocGenerics_0.40.0         MatrixGenerics_1.6.0        matrixStats_0.61.0          muscat_1.8.0               

loaded via a namespace (and not attached):
  [1] backports_1.4.0           circlize_0.4.13           blme_1.0-5                plyr_1.8.6                TMB_1.7.22                splines_4.1.1             BiocParallel_1.28.0      
  [8] listenv_0.8.0             ggplot2_3.3.5             scater_1.22.0             digest_0.6.29             foreach_1.5.1             viridis_0.6.2             lmerTest_3.1-3           
 [15] fansi_0.4.2               magrittr_2.0.1            memoise_2.0.1             ScaledMatrix_1.2.0        cluster_2.1.2             doParallel_1.0.16         limma_3.50.0             
 [22] ComplexHeatmap_2.10.0     globals_0.14.0            Biostrings_2.62.0         annotate_1.72.0           prettyunits_1.1.1         colorspace_2.0-2          blob_1.2.2               
 [29] ggrepel_0.9.1             dplyr_1.0.7               crayon_1.4.2              RCurl_1.98-1.5            genefilter_1.76.0         lme4_1.1-27.1             survival_3.2-13          
 [36] iterators_1.0.13          glue_1.5.1                gtable_0.3.0              zlibbioc_1.40.0           XVector_0.34.0            GetoptLong_1.0.5          DelayedArray_0.20.0      
 [43] BiocSingular_1.10.0       future.apply_1.8.1        shape_1.4.6               scales_1.1.1              DBI_1.1.1                 edgeR_3.36.0              Rcpp_1.0.7               
 [50] viridisLite_0.4.0         xtable_1.8-4              progress_1.2.2            clue_0.3-60               bit_4.0.4                 rsvd_1.0.5                httr_1.4.2               
 [57] gplots_3.1.1              RColorBrewer_1.1-2        ellipsis_0.3.2            pkgconfig_2.0.3           XML_3.99-0.8              locfit_1.5-9.4            utf8_1.2.2               
 [64] tidyselect_1.1.1          rlang_0.4.12              reshape2_1.4.4            AnnotationDbi_1.56.1      munsell_0.5.0             tools_4.1.1               cachem_1.0.6             
 [71] generics_0.1.1            RSQLite_2.2.8             broom_0.7.10              stringr_1.4.0             fastmap_1.1.0             bit64_4.0.5               caTools_1.18.2           
 [78] purrr_0.3.4               KEGGREST_1.34.0           future_1.23.0             nlme_3.1-153              sparseMatrixStats_1.6.0   compiler_4.1.1            pbkrtest_0.5.1           
 [85] beeswarm_0.4.0            png_0.1-7                 variancePartition_1.24.0  tibble_3.1.6              geneplotter_1.72.0        stringi_1.7.6             lattice_0.20-45          
 [92] Matrix_1.3-4              nloptr_1.2.2.3            vctrs_0.3.8               pillar_1.6.4              lifecycle_1.0.1           GlobalOptions_0.1.2       BiocNeighbors_1.12.0     
 [99] data.table_1.14.2         bitops_1.0-7              irlba_2.3.3               R6_2.5.1                  KernSmooth_2.23-20        gridExtra_2.3             vipor_0.4.5              
[106] parallelly_1.29.0         codetools_0.2-18          boot_1.3-28               MASS_7.3-54               gtools_3.9.2              DESeq2_1.34.0             rjson_0.2.20             
[113] sctransform_0.3.2         GenomeInfoDbData_1.2.7    parallel_4.1.1            hms_1.1.1                 grid_4.1.1                beachmat_2.10.0           tidyr_1.1.4              
[120] glmmTMB_1.1.2.3           minqa_1.2.4               DelayedMatrixStats_1.16.0 numDeriv_2016.8-1.1       ggbeeswarm_0.6.0

Answer 1 · 2022-01-30T19:42:56.000Z

Nope, nothing's wrong here, but I do understand and apologise for the bad example and resulting confusion! The mock data demonstrates how to compute pseudobulks using aggregateData, but is wasn't intended to make sense for DS analysis. Specifically, the sampling of each samples group assignment is random, so that cells from a given sample could in fact have different group assignments. To mock up reasonable data, something like the following should work:

library(muscat)
ng <- 400 # number of genes
nc <- 200 # number of cells
ns <- 4   # number of samples
nk <- 3   # number of clusters
 
# generate some cell metadata
# (the following assures there are 2 samples from each group,
# and that each sample is uniquely assigned to one group)
sids <- rep(paste0("sample", seq_len(ns)), each = nc/ns)
gids <- rep(c("groupA", "groupB"), each = nc/2)   

kids <- sample(paste0("cluster", seq_len(nk)), nc, TRUE) 
batch <- sample(seq_len(3), nc, TRUE)
cd <- data.frame(group = gids, id = sids, cluster = kids, batch)

# construct SCE
library(scuttle)
sce <- mockSCE(ncells = nc, ngenes = ng)
colData(sce) <- cbind(colData(sce), cd)

# prep. for workflow
sce <- prepSCE(sce, kid = "cluster", sid = "id", gid = "group")
head(colData(sce))
metadata(sce)$experiment_info

# perform DS analysis
pb <- aggregateData(sce)
res <- pbDS(pb)