Example code from prepSCE does not work with aggregateData
Closed this issue · 1 comments
dariober commented
Hi- I'm creating an example dataset using the code from the documentation of prepSCE
:
ng <- 50
nc <- 200
# generate some cell metadata
gids <- sample(c("groupA", "groupB"), nc, TRUE)
sids <- sample(paste0("sample", seq_len(3)), nc, TRUE)
kids <- sample(paste0("cluster", seq_len(5)), nc, TRUE)
batch <- sample(seq_len(3), nc, TRUE)
cd <- data.frame(group = gids, id = sids, cluster = kids, batch)
# construct SCE
library(scuttle)
sce <- mockSCE(ncells = nc, ngenes = ng)
colData(sce) <- cbind(colData(sce), cd)
# prep. for workflow
sce <- prepSCE(sce, kid = "cluster", sid = "id", gid = "group")
head(colData(sce))
metadata(sce)$experiment_info
When I pass the object sce
to aggregateData
, the output is an object with colData having 0 columns:
pb <- aggregateData(sce)
colData(pb)
DataFrame with 3 rows and 0 columns
# Then
pbDS(pb)
Error in .check_pbs(pb, check_by = TRUE) :
!is.null(pbs[["group_id"]]) is not TRUE
Am I doing something wrong?
sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS/LAPACK: /home1/db291g/miniconda3/envs/pseudobulk_dge/lib/libopenblasp-r0.3.18.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] scuttle_1.4.0 SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.0 GenomeInfoDb_1.30.0 IRanges_2.28.0
[8] S4Vectors_0.32.0 BiocGenerics_0.40.0 MatrixGenerics_1.6.0 matrixStats_0.61.0 muscat_1.8.0
loaded via a namespace (and not attached):
[1] backports_1.4.0 circlize_0.4.13 blme_1.0-5 plyr_1.8.6 TMB_1.7.22 splines_4.1.1 BiocParallel_1.28.0
[8] listenv_0.8.0 ggplot2_3.3.5 scater_1.22.0 digest_0.6.29 foreach_1.5.1 viridis_0.6.2 lmerTest_3.1-3
[15] fansi_0.4.2 magrittr_2.0.1 memoise_2.0.1 ScaledMatrix_1.2.0 cluster_2.1.2 doParallel_1.0.16 limma_3.50.0
[22] ComplexHeatmap_2.10.0 globals_0.14.0 Biostrings_2.62.0 annotate_1.72.0 prettyunits_1.1.1 colorspace_2.0-2 blob_1.2.2
[29] ggrepel_0.9.1 dplyr_1.0.7 crayon_1.4.2 RCurl_1.98-1.5 genefilter_1.76.0 lme4_1.1-27.1 survival_3.2-13
[36] iterators_1.0.13 glue_1.5.1 gtable_0.3.0 zlibbioc_1.40.0 XVector_0.34.0 GetoptLong_1.0.5 DelayedArray_0.20.0
[43] BiocSingular_1.10.0 future.apply_1.8.1 shape_1.4.6 scales_1.1.1 DBI_1.1.1 edgeR_3.36.0 Rcpp_1.0.7
[50] viridisLite_0.4.0 xtable_1.8-4 progress_1.2.2 clue_0.3-60 bit_4.0.4 rsvd_1.0.5 httr_1.4.2
[57] gplots_3.1.1 RColorBrewer_1.1-2 ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.8 locfit_1.5-9.4 utf8_1.2.2
[64] tidyselect_1.1.1 rlang_0.4.12 reshape2_1.4.4 AnnotationDbi_1.56.1 munsell_0.5.0 tools_4.1.1 cachem_1.0.6
[71] generics_0.1.1 RSQLite_2.2.8 broom_0.7.10 stringr_1.4.0 fastmap_1.1.0 bit64_4.0.5 caTools_1.18.2
[78] purrr_0.3.4 KEGGREST_1.34.0 future_1.23.0 nlme_3.1-153 sparseMatrixStats_1.6.0 compiler_4.1.1 pbkrtest_0.5.1
[85] beeswarm_0.4.0 png_0.1-7 variancePartition_1.24.0 tibble_3.1.6 geneplotter_1.72.0 stringi_1.7.6 lattice_0.20-45
[92] Matrix_1.3-4 nloptr_1.2.2.3 vctrs_0.3.8 pillar_1.6.4 lifecycle_1.0.1 GlobalOptions_0.1.2 BiocNeighbors_1.12.0
[99] data.table_1.14.2 bitops_1.0-7 irlba_2.3.3 R6_2.5.1 KernSmooth_2.23-20 gridExtra_2.3 vipor_0.4.5
[106] parallelly_1.29.0 codetools_0.2-18 boot_1.3-28 MASS_7.3-54 gtools_3.9.2 DESeq2_1.34.0 rjson_0.2.20
[113] sctransform_0.3.2 GenomeInfoDbData_1.2.7 parallel_4.1.1 hms_1.1.1 grid_4.1.1 beachmat_2.10.0 tidyr_1.1.4
[120] glmmTMB_1.1.2.3 minqa_1.2.4 DelayedMatrixStats_1.16.0 numDeriv_2016.8-1.1 ggbeeswarm_0.6.0
HelenaLC commented
Nope, nothing's wrong here, but I do understand and apologise for the bad example and resulting confusion! The mock data demonstrates how to compute pseudobulks using aggregateData
, but is wasn't intended to make sense for DS analysis. Specifically, the sampling of each samples group assignment is random, so that cells from a given sample could in fact have different group assignments. To mock up reasonable data, something like the following should work:
library(muscat)
ng <- 400 # number of genes
nc <- 200 # number of cells
ns <- 4 # number of samples
nk <- 3 # number of clusters
# generate some cell metadata
# (the following assures there are 2 samples from each group,
# and that each sample is uniquely assigned to one group)
sids <- rep(paste0("sample", seq_len(ns)), each = nc/ns)
gids <- rep(c("groupA", "groupB"), each = nc/2)
kids <- sample(paste0("cluster", seq_len(nk)), nc, TRUE)
batch <- sample(seq_len(3), nc, TRUE)
cd <- data.frame(group = gids, id = sids, cluster = kids, batch)
# construct SCE
library(scuttle)
sce <- mockSCE(ncells = nc, ngenes = ng)
colData(sce) <- cbind(colData(sce), cd)
# prep. for workflow
sce <- prepSCE(sce, kid = "cluster", sid = "id", gid = "group")
head(colData(sce))
metadata(sce)$experiment_info
# perform DS analysis
pb <- aggregateData(sce)
res <- pbDS(pb)