campbio/celda

Problem Running decontX with z argument

astarr97 opened this issue · 19 comments

Hello,

I am trying to run decontX on a counts matrix (called count_mat) with dim 1819 x 16382. My z argument is a vector (called names) of length 1819 containing cluster names. However when I run celda::decontX(count_mat, z = names) I get the error "Error in .processCellLabels(z, numCells = nC) : 'z' must be of the same length as the number of cells in the 'counts' matrix.". Because of this, I tried running it on the transpose of the counts matrix: celda::decontX(t(count_mat), z = names). Doing this, I get the error: "Error in rowSums(phi) : 'x' must be an array of at least two dimensions". First, should rows be the cells or should rows be the genes? Second, any idea how I can fix these errors?

Best, Alex.

zhewa commented

Hi Alex,

The rows should be genes and columns should be cells in the input counts matrix. Your second error seems to be dataset dependent. Make sure that no rows or columns in the counts matrix are all zeros. Also check if input variables count_mat and t(count_mat) are classes of matrix with 2 dimensions. Let us know if the error persists. If possible, can you provide a reproducible example script of the error so it would be easier for us to pinpoint where the program went wrong?

I will check those things, thank you for the very quick response! By reproducible example script, should it include my data?

zhewa commented

Yes, that would work, if you feel open doing that. A simulated data which generates this error would also work.

Okay I may try to simulate it, although it looks like there are zero columns so hopefully that is it!

That seems to have resolved it, although now RStudio is encountering a fatal error every time I run it.

Does decontX require an extreme amount of RAM? I cannot figure out why R encounters a fatal error when I run the program with z, but runs fine when I don't enter z.

zhewa commented

Hi @astarr97,

decontX do convert input matrix object to a sparse matrix dgCMatrix before calculation. A RAM of 8GB or more should be enough to deal with a dataset of size 16382 x 1819. I recommend making sure no unnecessary objects are loaded in the memory by executing rm(list = ls()) before your analysis. This removes all defined variables in the global environment and frees up memory. Also, if you are using Rstudio, uncheck Restore .RData into workspace at startup to ensure no unwanted workspace is loaded at Rstudio startup. Also, can you paste the fatal error message so we can be sure it's a memory related error?

Another thing you can check is how you defined names for the argument z. Make sure the total number of cell clusters is as you expected by executing length(unique(names)) in R.

I will try those things, thank you. Although I have been sure to clear my global environment each time at run time. In addition, it terminates with a fatal error within a few seconds so it may not be a RAM issue. For the z argument, it should be the same length as the number of cells, right? And the input counts matrix should just be the counts without column or row labels? Thank you again.

zhewa commented

Yes, z should have the same length as the number of columns in x. You can have row names and column names for x as long as these labels are in the rownames(x) and colnames(x) slots and not an actual row or column in x.

Okay, it is still erroring. I attached the error message and an image of my code (it is the last line that errors). I am checking with a collaborator to see if it is okay to send you the data itself or a simulated version. Thank you so much for your help so far!
code

Error_Message

Hi @astarr97, I'm sorry this error is occurring for you. It would be great for us to figure out what it is so we can add a check to ensure we don't crash R! A few other questions for you:

  1. Which version of R and celda are you using. You can run sessionInfo() and send us the results. If you are not using the latest version, I would highly recommend updating it.
  2. Do you have any NA's in your matrix or names vector? Or characters or negative numbers in your matrix?
  3. What is the storage mode of you matrix. You can check this by running storage.mode(counts_mat)
  4. If you can't send us the data directly, can you show us the first few entries in the matrix and names vector?

counts_mat[1:5,1:5]
names_vector[1:5]

Hello, here is the output of sessionInfo():
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

Random number generation:
RNG: Mersenne-Twister
Normal: Inversion
Sample: Rounding

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base

other attached packages:
[1] MASS_7.3-53 celda_1.2.4 optparse_1.6.6

loaded via a namespace (and not attached):
[1] ggrepel_0.9.1 Rcpp_1.0.6
[3] lattice_0.20-41 digest_0.6.27
[5] assertthat_0.2.1 foreach_1.5.1
[7] SingleCellExperiment_1.8.0 R6_2.5.0
[9] GenomeInfoDb_1.22.1 plyr_1.8.6
[11] stats4_3.6.1 httr_1.4.2
[13] ggplot2_3.3.3 pillar_1.4.7
[15] zlibbioc_1.32.0 rlang_0.4.10
[17] rstudioapi_0.13 data.table_1.13.6
[19] S4Vectors_0.24.4 Matrix_1.3-2
[21] combinat_0.0-8 BiocParallel_1.20.1
[23] Rtsne_0.15 stringr_1.4.0
[25] RcppEigen_0.3.3.9.1 RCurl_1.98-1.2
[27] munsell_0.5.0 uwot_0.1.10
[29] DelayedArray_0.12.3 compiler_3.6.1
[31] pkgconfig_2.0.3 BiocGenerics_0.32.0
[33] tidyselect_1.1.0 SummarizedExperiment_1.16.1
[35] tibble_3.0.5 gridExtra_2.3
[37] GenomeInfoDbData_1.2.2 enrichR_2.1
[39] IRanges_2.20.2 codetools_0.2-18
[41] matrixStats_0.57.0 dendextend_1.14.0
[43] viridisLite_0.3.0 withr_2.4.0
[45] crayon_1.3.4 dplyr_1.0.3
[47] MCMCprecision_0.4.0 bitops_1.0-6
[49] MAST_1.12.0 grid_3.6.1
[51] gtable_0.3.0 lifecycle_0.2.0
[53] DBI_1.1.1 magrittr_2.0.1
[55] pROC_1.17.0.1 scales_1.1.1
[57] stringi_1.5.3 XVector_0.26.0
[59] reshape2_1.4.4 viridis_0.5.1
[61] doParallel_1.0.16 getopt_1.20.3
[63] ggdendro_0.1.22 ellipsis_0.3.1
[65] generics_0.1.0 vctrs_0.3.6
[67] rjson_0.2.20 RColorBrewer_1.1-2
[69] iterators_1.0.13 tools_3.6.1
[71] Biobase_2.46.0 glue_1.4.2
[73] purrr_0.3.4 abind_1.4-5
[75] parallel_3.6.1 yaml_2.2.1
[77] colorspace_2.0-0 BiocManager_1.30.10
[79] GenomicRanges_1.38.0

Here is the output of running > storage.mode(count_mat)
[1] "integer"

And here are the first few entries of count_mat:
[,1] [,2] [,3] [,4] [,5]
V1 0 0 0 0 0
V2 0 0 0 0 0
V3 0 0 0 0 0
V4 1 3 0 1 0
V5 0 0 1 2 1
And names_vector:
[1] MARS_13 MARS_4 MARS_11 MARS_7 MARS_3
26 Levels: MARS_0 MARS_1 MARS_10 MARS_11 MARS_12 ... MARS_9

Thank you for including the commands, definitely saved me some time.

Best, Alex.

Okay, it looks like you are running an older version and chances are this bug has already been fixed. The latest version is in Bioconductor 3.12 and needs R 4.0.2 to run:

https://www.bioconductor.org/packages/release/bioc/html/celda.html

Great I think that did it. Is this expected output with the z argument included? It is running much faster than the previous version and says that it was "Generating UMAP" so I am concerned the z argument didn't take for some reason (see attached output).

Best, Alex.
output_expected

Yes, this is correct. It still generates a UMAP for visualization but not for clustering. Just let us know if you have other issues and make sure you read over the new vignette as there are several new plotting utilities to help in the assessment of contamination:

https://www.bioconductor.org/packages/release/bioc/vignettes/celda/inst/doc/decontX.pdf

Thank you so much for developing and maintaining this software!

As a final question, is there any way to get decontX to deal with contamination that is present in all cell clusters? There is a "hidden" cluster (we didn't collect these cells, but they still contaminate our data) that has a few genes that are so highly expressed they contaminate our data.

Hi @astarr97, this is a good suggestion and something we are actively working on. We hope to push an update in another month or so that will allow you to estimate a contamination distribution directly from empty droplets.

Hi @astarr97, we have just pushed a new version of decontX to the master branch that allows you to give the raw matrix as a second parameter. The ambient RNA distribution is estimated from the empty drops which should contain counts from your "hidden" population.

Note that you will have to reinstall celda from GitHub and build the vignette locally to see an example:

library(devtools)
install_github("campbio/celda", build_vignettes = TRUE)
vignette("decontX")