Running zinbwave on cluster

Question

Running zinbwave on cluster

Closed this issue 4 years ago · 4 comments

Hi, I'm trying to run zinbwave on a large scRNA-seq dataset (> 5K cells) and it runs out of memory on my laptop, so I'm trying to run it on the cluster. The error I get is
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Error: failed to stop ‘SOCKcluster’ cluster: error writing to connection

the traceback is

traceback()
15: stop(paste(strwrap(txt, exdent = 2), collapse = "\n"), call. = FALSE)
14: value[3L]
13: tryCatchOne(expr, names, parentenv, handlers[[1L]])
12: tryCatchList(expr, classes, parentenv, handlers)
11: tryCatch({
parallel::stopCluster(bpbackend(x))
}, error = function(err) {
txt <- sprintf("failed to stop %s cluster: %s", sQuote(class(bpbackend(x))[[1]]),
conditionMessage(err))
stop(paste(strwrap(txt, exdent = 2), collapse = "\n"), call. = FALSE)
})
10: bpstop(BPPARAM)
9: bpstop(BPPARAM)
8: bplapply(seq(n), function(i) {
solveRidgeRegression(x = getV_mu(m)[P[i, ], , drop = FALSE],
y = L[i, P[i, ]] - Xbeta_mu[i, P[i, ]], epsilon = getEpsilon_gamma_mu(m),
family = "gaussian")
}, BPPARAM = BPPARAM)
7: bplapply(seq(n), function(i) {
solveRidgeRegression(x = getV_mu(m)[P[i, ], , drop = FALSE],
y = L[i, P[i, ]] - Xbeta_mu[i, P[i, ]], epsilon = getEpsilon_gamma_mu(m),
family = "gaussian")
}, BPPARAM = BPPARAM)
6: unlist(bplapply(seq(n), function(i) {
solveRidgeRegression(x = getV_mu(m)[P[i, ], , drop = FALSE],
y = L[i, P[i, ]] - Xbeta_mu[i, P[i, ]], epsilon = getEpsilon_gamma_mu(m),
family = "gaussian")
}, BPPARAM = BPPARAM))
5: matrix(unlist(bplapply(seq(n), function(i) {
solveRidgeRegression(x = getV_mu(m)[P[i, ], , drop = FALSE],
y = L[i, P[i, ]] - Xbeta_mu[i, P[i, ]], epsilon = getEpsilon_gamma_mu(m),
family = "gaussian")
}, BPPARAM = BPPARAM)), nrow = NCOL(getV_mu(m)))
4: zinbInitialize(m, Y, nb.repeat = nb.repeat.initialize, BPPARAM = BPPARAM)
3: .local(Y, ...)
2: zinbFit(counts(sce), K = nlabels)
1: zinbFit(counts(sce), K = nlabels)

and the session info is below

sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /share/software/user/open/openblas/0.2.19/lib/libopenblasp-r0.2.19.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] zinbwave_1.0.0 SingleCellExperiment_1.0.0
[3] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[5] matrixStats_0.53.1 Biobase_2.38.0
[7] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0
[9] IRanges_2.12.0 S4Vectors_0.16.0
[11] BiocGenerics_0.24.0 BiocInstaller_1.28.0

loaded via a namespace (and not attached):
[1] pcaPP_1.9-73 Rcpp_0.12.16 compiler_3.4.0
[4] XVector_0.18.0 iterators_1.0.9 bitops_1.0-6
[7] tools_3.4.0 zlibbioc_1.24.0 digest_0.6.15
[10] bit_1.1-12 memoise_1.1.0 RSQLite_2.1.0
[13] annotate_1.56.2 lattice_0.20-35 pspline_1.0-18
[16] foreach_1.4.4 Matrix_1.2-14 DBI_0.8
[19] mvtnorm_1.0-7 GenomeInfoDbData_1.0.0 copula_0.999-18
[22] genefilter_1.60.0 glmnet_2.0-16 bit64_0.9-7
[25] locfit_1.5-9.1 grid_3.4.0 ADGofTest_0.3
[28] AnnotationDbi_1.40.0 survival_2.41-3 XML_3.98-1.11
[31] BiocParallel_1.12.0 limma_3.34.9 blob_1.1.1
[34] edgeR_3.20.9 codetools_0.2-15 splines_3.4.0
[37] stabledist_0.7-1 softImpute_1.4 xtable_1.8-2
[40] numDeriv_2016.8-1 gsl_1.9-10.3 RCurl_1.95-4.10

Thank you.

Answer 1 · 2018-04-27T23:35:20.000Z

Searching for the error seemed to indicate that the error is because it uses too much memory, so I requested 48gb and got the following error after running zinbFit again.

Error: 'bplapply' receive data failed:
error reading from connection

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal

Answer 2 · 2018-04-28T01:43:52.000Z

Hi,

How many cores are you using for the parallelization and what is the value of K?

I can run zinbwave on my desktop (8 cores, 32 GB ram) with up to 10k cells so I'm a bit surprised that you are running out of memory.

For some reason, I do observe very large memory usage when using many CPUs -- so I would suggest to try with fewer cores and see what happens.

We recommend to apply zinbwave to the thousand or so most variable genes, both for speed and because it leads to better results.

Finally, if nothing works, try to run zinbwave in serial mode. You can do that by adding the option BPPARAM=BiocParallel::SerialParam(). This should give you a more informative error if it fails. And if it doesn't we will know that it's a parallelization issue.

Hope this helps.

Answer 3 · 2018-07-27T20:21:28.000Z

Hi,

I also wanted to run ZinbWave in cluster. However, I am having trouble to install packages: zinbwave, scRNAseq, SummarizedExperiment. When I did biocLite(""), it was installing. But when I did library(""), there was an error message saying "no such package".

Answer 4 · 2018-08-14T21:07:46.000Z

Hi @rtian30 ,

when you say biocLite("") and library("") do you mean biocLite("zinbwave") and library(zinbwave). Are you sure that there isn't an error or a warning from biocLite? The fact that it says no package it means that it wasn't able to install it.