Error when simulating data use my own sce object

Question

Error when simulating data use my own sce object

Closed this issue 2 years ago · 10 comments

Hi, when I tried to perform data simulation using my own sce object, I always encounter an error saying
Error in DataFrame(beta0 = cs[, 1], row.names = rownames(x)) : invalid length of row names Calls: prepSim -> DataFrame

I tried to check:

rownames(sce), which equals to my gene names;
my sce contains group_id, sample_id, and cluster_id.

Any idea what might cause the problem and how I may fix it?
Thank you so much!

Answer 1 · 2021-10-28T01:07:31.000Z

Hi, also, as I was going over the source code, I found that when adding the dispersion, the code does not remove the i from the y:

cs <- y$coefficients
i <- !rowAnyNAs(cs)
cs <- cs[i, , drop = FALSE]
x <- x[i, , drop = FALSE]

.......

ds <- y$dispersion
names(ds) <- rownames(x)
rowData(x)$disp <- ds

This causes an error that the dim of X does not match the length of ds when I walked through the source code to identify the problem. If that works for everyone, did I miss something here?
Thanks!

Answer 2 · 2021-10-28T07:48:14.000Z

could you post your session info please?

Answer 3 · 2021-10-28T16:03:29.000Z

Here it is. Thank you for looking into that.

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] muscat_1.6.0 SingleCellExperiment_1.14.1
[3] SummarizedExperiment_1.22.0 Biobase_2.52.0
[5] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0
[7] IRanges_2.26.0 S4Vectors_0.30.0
[9] BiocGenerics_0.38.0 MatrixGenerics_1.4.0
[11] matrixStats_0.59.0 SCF_4.1.0

loaded via a namespace (and not attached):
[1] backports_1.2.1 circlize_0.4.12
[3] blme_1.0-5 plyr_1.8.6
[5] TMB_1.7.20 splines_4.1.0
[7] listenv_0.8.0 BiocParallel_1.26.0
[9] ggplot2_3.3.5 scater_1.20.1
[11] TH.data_1.0-10 digest_0.6.27
[13] foreach_1.5.1 viridis_0.6.1
[15] lmerTest_3.1-3 fansi_0.5.0
[17] magrittr_2.0.1 memoise_2.0.0
[19] ScaledMatrix_1.0.0 cluster_2.1.1
[21] doParallel_1.0.16 limma_3.48.0
[23] ComplexHeatmap_2.8.0 globals_0.14.0
[25] Biostrings_2.60.0 annotate_1.70.0
[27] sandwich_3.0-1 prettyunits_1.1.1
[29] colorspace_2.0-2 blob_1.2.1
[31] dplyr_1.0.7 crayon_1.4.1
[33] RCurl_1.98-1.3 genefilter_1.74.0
[35] lme4_1.1-27 survival_3.2-11
[37] zoo_1.8-9 iterators_1.0.13
[39] glue_1.4.2 gtable_0.3.0
[41] zlibbioc_1.38.0 emmeans_1.6.1
[43] XVector_0.32.0 GetoptLong_1.0.5
[45] DelayedArray_0.18.0 BiocSingular_1.8.0
[47] future.apply_1.7.0 shape_1.4.6
[49] scales_1.1.1 mvtnorm_1.1-1
[51] DBI_1.1.1 edgeR_3.34.0
[53] Rcpp_1.0.7 viridisLite_0.4.0
[55] xtable_1.8-4 progress_1.2.2
[57] clue_0.3-59 bit_4.0.4
[59] rsvd_1.0.5 httr_1.4.2
[61] gplots_3.1.1 RColorBrewer_1.1-2
[63] ellipsis_0.3.2 pkgconfig_2.0.3
[65] XML_3.99-0.6 scuttle_1.2.0
[67] locfit_1.5-9.4 utf8_1.2.2
[69] tidyselect_1.1.1 rlang_0.4.11
[71] reshape2_1.4.4 AnnotationDbi_1.54.0
[73] munsell_0.5.0 tools_4.1.0
[75] cachem_1.0.5 generics_0.1.0
[77] RSQLite_2.2.7 broom_0.7.9
[79] stringr_1.4.0 fastmap_1.1.0
[81] bit64_4.0.5 caTools_1.18.2
[83] purrr_0.3.4 KEGGREST_1.32.0
[85] future_1.21.0 nlme_3.1-152
[87] sparseMatrixStats_1.4.0 pbkrtest_0.5.1
[89] compiler_4.1.0 rstudioapi_0.13
[91] beeswarm_0.4.0 png_0.1-7
[93] variancePartition_1.22.0 tibble_3.1.3
[95] geneplotter_1.70.0 stringi_1.7.3
[97] lattice_0.20-41 Matrix_1.3-4
[99] nloptr_1.2.2.2 vctrs_0.3.8
[101] pillar_1.6.2 lifecycle_1.0.0
[103] GlobalOptions_0.1.2 BiocNeighbors_1.10.0
[105] estimability_1.3 data.table_1.14.0
[107] bitops_1.0-7 irlba_2.3.3
[109] colorRamps_2.3 R6_2.5.1
[111] KernSmooth_2.23-18 gridExtra_2.3
[113] parallelly_1.25.0 vipor_0.4.5
[115] codetools_0.2-18 gtools_3.8.2
[117] boot_1.3-27 MASS_7.3-54
[119] assertthat_0.2.1 DESeq2_1.32.0
[121] rjson_0.2.20 sctransform_0.3.2
[123] multcomp_1.4-17 GenomeInfoDbData_1.2.6
[125] hms_1.1.0 grid_4.1.0
[127] beachmat_2.8.0 tidyr_1.1.3
[129] coda_0.19-4 glmmTMB_1.0.2.1
[131] minqa_1.2.4 DelayedMatrixStats_1.14.0
[133] Cairo_1.5-12.2 numDeriv_2016.8-1.1
[135] ggbeeswarm_0.6.0

Answer 4 · 2021-11-03T22:25:53.000Z

Hello, any update on this issue? thanks!

Answer 5 · 2021-11-04T06:25:51.000Z

Hey, sorry, slipped my mind - thanks for the ping.

I am actually not sure what is causing the issue. The error occurs in

bs <- DataFrame(
  beta0 = cs[, 1],
  row.names = rownames(x))

which is independent of y. Both cs and x are being subset right above via

cs <- y$coefficients
i <- !rowAnyNAs(cs)
cs <- cs[i, , drop = FALSE]
x <- x[i, , drop = FALSE]

Also, whether or not the input SCE x has row names set shouldn't matter, as this is assured in the beginning via

if (is.null(rownames(x))) rownames(x) <- paste0("gene", seq(nrow(x)))
if (is.null(colnames(x))) colnames(x) <- paste0("cell", seq(ncol(x)))

Would you be able to do look into it via calling debug("prepSim"), then rerunning prepSim(...) and possible tracking down what's causing this? Currently, I can't reproduce the error with various test cases I tried.

Answer 6 · 2021-11-04T21:57:42.000Z

Hi,
Thank you for looking into this. I'll try your suggestions and see if I may figure out what happens here.
BTW, I tried to walk through the source code and I didn't hit the error.

However, though bs is independent of Y, when I tried to assign ds to x, the length of ds is not the same as x so I hit an error as y has not be subset for i. For example, before subsetting, x and y have the same row numbers. But after subsetting, x may have less rows than y while length of ds is the same as y's row number.
Let me know if I misunderstood anything here.
Thanks!

Answer 7 · 2021-11-16T01:11:25.000Z

Hi, also, I'm not sure if this is an issue of my data or the simulation algorithm. My reference data has 5 samples and 4 clusters. I tried to simulate the data with 10 clusters and 20 samples in each group. Before making changes to the lfc or rel_lfc, the UMAP plot somehow make sense to me.

However, when I started to make change to rel_lfc

and lfc

some of the clusters overlapped with each other.

Any idea what those overlap means? And why they would group in 5 groups in the last 2 plots? Thanks!

Answer 8 · 2021-12-04T04:39:55.000Z

Hello, just want to follow up on my previous post. If I set all the genes to be EE, I would get the following results. I cannot figure out why there would be sub clusters within the same simulated clusters and some of the clusters overlapped with each other.

Answer 9 · 2021-12-17T09:40:34.000Z

I would say this is somewhat to be expected. The way the simulation works, simply put, is that every cell belongs to come sample-cluster combination. Now, for every sample-cluster, we sample a reference sample-cluster. In your dataset, there are 5 samples and 4 clusters. Hence for a given cluster, cells can come from one of 5 samples. Hence in the last UMAP above, every cluster is split into 5. This, my intuition says, indicates that the sample-sample differences are of similar magnitude as the cluster-cluster differences. Conversely, is all samples were pretty similar, the clusters shouldn't split in the UMAP. Conversely again, if the clusters were pretty similar, they would overlap but might be split by sample.

The take home messages for me are: 1) in a reference-based simulation such as this, I would refrain from simulating many more samples and clusters than are available as it can't be avoided that some samples/cluster will be resamples/used more than once; and 2) we aware of how (dis)similar the samples/clusters are in your reference (e.g., what does the non-simulated UMAP look like?) or, in other words, what sample-to-sample/cluster-to-cluster variability is to be expected. As a reference-based simulation will try and mimic the data, these difference will manifest/propagate in the synthetic data.

Answer 10 · 2022-01-14T08:59:33.000Z

closing as part 1 is related to #86, and I assume part 2 has been resolved due to inactivity. Feel free to continue the discussion here should you have further questions regarding the UMAP outputs, and I'll be happy to help (if I can)!