gagneurlab/OUTRIDER

controlForConfounders uses as much memory as available on machine

Opened this issue · 14 comments

Thank you for this very helpful package.

I've noticed this issue when moving my workflow over to a high performance computing cluster. Specifically, I try to run the exact same code that runs locally (on a machine with 8G of RAM), but whenever I run this code on the cluster, it gets killed by the job manager for using too much memory (even when I provide 100G of ram). In my most recent attempt, the code used 215.965G of memory when I ran the controlForConfounders function. Just before running that function, my memory usage was only at 3.47G. The code successfully completes on this cluster when I max out the memory on the node.

I have tried using the "ulimit -v" setting in bash before starting R to control the memory usage, but this does not change the behavior.

Please let me know if I can provide any further information:

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.10 (Santiago)

Matrix products: default
BLAS/LAPACK: /~/miniconda/installDir/lib/R/lib/libRblas.so

locale:
[1] C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] OUTRIDER_0.99.31 data.table_1.11.8
[3] SummarizedExperiment_1.10.1 DelayedArray_0.6.6
[5] matrixStats_0.54.0 GenomicFeatures_1.32.3
[7] AnnotationDbi_1.42.1 Biobase_2.40.0
[9] GenomicRanges_1.32.7 GenomeInfoDb_1.16.0
[11] IRanges_2.14.12 S4Vectors_0.18.3
[13] BiocGenerics_0.26.0 BiocParallel_1.14.2

loaded via a namespace (and not attached):
[1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] progress_1.2.0 PRROC_1.3.1 httr_1.3.1
[7] tools_3.5.1 backports_1.1.2 R6_2.3.0
[10] KernSmooth_2.23-15 rpart_4.1-13 Hmisc_4.1-1
[13] DBI_1.0.0 lazyeval_0.2.1 colorspace_1.3-2
[16] nnet_7.3-12 tidyselect_0.2.4 gridExtra_2.3
[19] prettyunits_1.0.2 DESeq2_1.20.0 bit_1.1-14
[22] compiler_3.5.1 htmlTable_1.12 plotly_4.8.0
[25] rtracklayer_1.40.6 caTools_1.17.1.1 scales_1.0.0
[28] checkmate_1.8.5 genefilter_1.62.0 stringr_1.3.1
[31] digest_0.6.17 Rsamtools_1.32.3 foreign_0.8-71
[34] XVector_0.20.0 base64enc_0.1-3 pkgconfig_2.0.2
[37] htmltools_0.3.6 htmlwidgets_1.3 rlang_0.2.2
[40] rstudioapi_0.8 RSQLite_2.1.1 BBmisc_1.11
[43] bindr_0.1.1 jsonlite_1.5 gtools_3.8.1
[46] acepack_1.4.1 dplyr_0.7.6 RCurl_1.95-4.11
[49] magrittr_1.5 GenomeInfoDbData_1.1.0 Formula_1.2-3
[52] Matrix_1.2-14 Rcpp_0.12.19 munsell_0.5.0
[55] reticulate_1.10 stringi_1.2.4 zlibbioc_1.26.0
[58] gplots_3.0.1 plyr_1.8.4 grid_3.5.1
[61] blob_1.1.1 gdata_2.18.0 crayon_1.3.4
[64] lattice_0.20-35 Biostrings_2.48.0 splines_3.5.1
[67] annotate_1.58.0 hms_0.4.2 locfit_1.5-9.1
[70] knitr_1.20 pillar_1.3.0 ggpubr_0.1.8
[73] geneplotter_1.58.0 biomaRt_2.36.1 XML_3.98-1.12
[76] glue_1.3.0 latticeExtra_0.6-28 pcaMethods_1.72.0
[79] tidyr_0.8.1 gtable_0.2.0 purrr_0.2.5
[82] assertthat_0.2.0 ggplot2_3.0.0 xtable_1.8-3
[85] viridisLite_0.3.0 survival_2.42-6 tibble_1.4.2
[88] GenomicAlignments_1.16.0 memoise_1.1.0 bindrcpp_0.2.2
[91] cluster_2.0.7-1

Hi Garret,

Thanks for trying it out.

Did you pass a BPPARAM object?
In case you didn't it will start as many workers, as there are cores in your machine.
This then results in a massive overhead due to the way Multicore works.
Likely a smaller core umber will run in the same time with less overhead.

Please try initializing BPPARAM before calling the function.

BPPARAM = MulticoreParam(workers = 20)
ods <- controlForConfounders(ods, BPPARAM=BPPARAM)

Thank you for your prompt reply. I get the same problem regardless of my BPPARAM workers count. I tried using workers=1, 2 and 20.

But when I did 20 workers, I went ahead and requested 400G of RAM so I could complete successfully and also give a complete profile of the RAM usage. The code did complete correctly, but the peak RAM utilization was 232 gigabytes. I know this is not required as I ran the same data through the same code on my local windows desktop with 8G of RAM. For your testing purposes I am using anaconda's latest (non-microsoft) version of R, and I am using this to ensure that BLAS libraries do not use too many resources (actually I find that anaconda's R sometimes segfaults when it hits multithreaded BLAS libraries):

library(RhpcBLASctl)
blas_set_num_threads(1)

Otherwise, I am pretty much just following the vignette with my own data.

EDIT: Also this means my title for this issue is incorrect. This machine had more than 400G of ram and only 232G was used. Feel free to edit the issue title if you want.

Hey Garrett,

I looked into this issue and I found that Multicore keeps accumulating memory. I guess this is due to the fact that the machine has to much free space and does not see the need for garbage collection.

To speed up the autoencoder fit I start the cluster with bpstart(BPPARAM). This will keep up the threads over multiple parallelized steps.

I created a bioc-dev branch with a switch to not keep the cluster running. This will reduce a bit the speed but should be the solution to the memory leek.

could you install OUTRIDER from github and run your pipeline with the new parameter startBPPARAM=FALSE

devtools::install_github("gagneurlab/OUTRIDER", ref="bioc-dev", force=TRUE)

# quick way
ods <- makeExampleOutriderDataSet()
ods <- OUTRIDER(ods, startBPPARAM=FALSE, BPPARAM=MulticoreParam(20))

# manual way
ods <- estimateSizeFactors(ods)
ods <- controlForConfounders(ods, startBPPARAM=FALSE, BPPARAM=MulticoreParam(20))

This should resolve your memory problem. If you what the threads with htop on the cluster you can see how the threads accumulate memory and dont free them. But with startBPPARAM=FALSE you always will get a new thread and therefore no accumulation. I guess this is has to be addressed in Multicore and not in OUTRIDER.

Thank you for your help here. This did work, although I was going the manual route, which did require changes to the subsequent function calls as well. This is what ended up working for me:

ods <- controlForConfounders(ods,startBPPARAM=FALSE,BPPARAM=MulticoreParam(1))
ods <- fit(ods,BPPARAM=MulticoreParam(1))
ods <- computePvalues(ods,BPPARAM=MulticoreParam(1))
ods <- computeZscores(ods)
res <- results(ods, all=TRUE)

Running fit or computePvalues without the BPPARAM would result in the memory issue being encountered at those steps. I guess we will keep our fingers crossed that Multicore gets this corrected.

Thanks again!
Garrett

Sorry to ask a potentially unrelated question (I will open a new issue if you prefer), but I had a question regarding your experience with changing results as the cohort grows. We had a cohort of ~50 samples all processed in using the same chemistry/protocol/etc and ran OUTRIDER. We then added in ~10 more samples, and the existing results for two of our previous samples changed dramatically. For example, one sample had no FDR<0.05 & |Z-score|>3 genes in the first round and now it has over 500 genes meeting those criteria. Have you ever seen this before? I would value any thoughts you might have about this. I saw your downsampling analysis in the paper, and so maybe this is just due to an increase in power, but it seems odd to me to go from no outliers detected to >500 when our cohort size did not change dramatically.

Also do you suggest using "sampleExclusionMask" to exclude these dramatic outlier samples from the analyses? I initially thought this might make sense, but when I exclude these samples and run OUTRIDER, I get even more samples with a large number of aberrant genes identified. So for now I only use "sampleExclusionMask" to mask out any samples that are from related individuals (e.g., we have a few trios).

@WGarrettJenkinson
I guess its anyway too late for a response, but for completeness I will answer as much as I can and close this one. If you think you need a bit more information on this please open a new issue.

We did see some changes when playing around with the number of samples in the test. But never such dramatic changes (from 0 to > 500 outliers). What we do see is higher numbers of outliers if we increase the number of samples, which is due to the power increase.
Without the data in front it is hard to judge where the problem comes from. I would compare the volcano plots of each run to see how the values changes. Maybe compare Z-scores/p values from both runs to see if they just increase or change randomly. The first would indicate increase of power/sensitivity and the latter would suggest problems with the model.
From our experience, we only suggest to use sampleExclusionMask on replicates or on samples with the sample variant/gene to increase the power to detect the same again. Having more samples to estimate the underlying distribution is rather helpful. Hence also extreme samples improve in our experience the performance.

Hope this helped also with such a delay.

Best,
Christian

Hello! Sorry to wake an old thread, but did the hotfix ever make it into master? We have a need for it too. Or was the issue recoded so it does not arise anymore? Thanks.

We did not pushed this hotfix into the master branch as it did not come up again. Also I'm not sure if this should be rather raised directly on BiocParallel from Bioconductor or the backend you are using or if it is just an OUTRIDER problem.

Also as I understood from Garrett that he just limited the number of threads to 1 which disables parallel computing at all.

Does your job survives if you register or pass a SerialParam object? How big is your matrix?

Since I'm not really sure if the hotfix works as Garrett did not tried it its up to you. So @lsmainzer if you want to give it a try I could setup a branch with it.

Hi sorry if it wasn't clear, my comment on Dec 18, 2018 was indicating that I had tested the fix and that it worked when I modified my function calls as listed in that comment (using just 1 core as you note). As a matter of fact, we have stayed on that old dev commit because we knew it was tested and working, but it may be nice to have this on master so we could move off the old installation. I do not recall testing SerialParam, so I too would be curious if that works for @lsmainzer (which would save me setting up a new install off of master to test it).

thanks @WGarrettJenkinson for chipping in on this old thread. As you mentioned, you used only 1 thread eg: MutlicoreParam(1) you basically used SerialParam as backend and turned multithreading off. Hence your code worked successful, but did not tested the real problem of the memory peaks/leaks.

From the MulticoreParam documentation its also true for Linux with a single worker:

When MulticoreParam is created/used in Windows it defaults to SerialParam which is the equivalent of using a single worker.

So in theory you should be fine to switch to the newest version and just run register(SerialParam()) before any work with OUTRIDER.

Just to see if the cluster re-initiation fixes the memory issue, could you monitor your memory consumption when increasing the thread counts?

# run it for different numbers of threads
numThreads <- 3
ods <- controlForConfounders(ods,startBPPARAM=FALSE,BPPARAM=MulticoreParam(numThreads))
ods <- fit(ods,BPPARAM=MulticoreParam(numThreads))
ods <- computePvalues(ods,BPPARAM=MulticoreParam(numThreads))
ods <- computeZscores(ods)
res <- results(ods, all=TRUE) 

Thanks for the follow up and information. I just ran your suggested code with three threads on my installation with the patch from this issue and my memory utilization never exceeded 8GB. So I think it is safe to say that the fix from this issue is tested and working on our system here. Let me know if I can provide any further information.

If I get a chance I will try to test your suggested register(SerialParam()) on a new installation, but that will take more time/effort to test since I need a parallel R installation with the latest OUTRIDER to test it out (since I don't want to mess with our current working installation).

As an update, I tested the register(SerialParam()) and this worked in the newest release of OUTRIDER (never went above 3GB of RAM). So we will plan to move to the newer implementation, which did give slightly different results from the older version. I checked and the concordance was strong ~99% of the time, although there were a few isolated cases of large discordance; I suppose that is to be expected with an autoencoder and the corresponding optimization algorithm.

I've noticed this issue when moving my workflow over to a high performance computing cluster. Specifically, I try to run the exact same code that runs locally (on a machine with 8G of RAM), but whenever I run this code on the cluster, it gets killed by the job manager for using too much memory (even when I provide 100G of ram). In my most recent attempt, the code used 215.965G of memory when I ran the controlForConfounders function. Just before running that function, my memory usage was only at 3.47G. The code successfully completes on this cluster when I max out the memory on the node.

Just wanted to also give a quick feedback here:
I experienced the same problem on my HPC environment.
However, explicitly setting ods <- OUTRIDER(ods, BPPARAM=SerialParam())fixed the issue for me.

I did not try your hotfix-version though...