morrowcj/remotePARTS

fitGLS_* max out CPU use on Linux

Closed this issue · 3 comments

When running fitGLS_opt on Linux CPU use is maxed out (on an application server with 64 cores) or more than 80 CPUs are in use (on an application server with 112 cores). The issue occurs already when computing an example from the vignette (3000 data points), with the computation being successful only on the 112 core server, where it takes almost an hour. For comparison, the very same Vignette is computed on a Win machine (64 cores, 1023GB RAM) in less than 5 minutes taking <3% CPU.

Furthermore, when the ncores argument is passed to the fitGLS_partition it is ignored on Linux, and all available cores are used (not even all-1). For bigger datasets (~240,000 pixels after taking every 3rd px and excluding NAs) it leads to the following error:

tryChatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11509 OUT=/dev/null SETUPTIMEOUT=120 TIMEOUT=259200 XDR=TRUE SETUPSTRATEGY=parallel

the following arguments are passed to thefitGLS_partition function (though the same error is observed with some twigs to the passed arguments):

fitGLS_partition(formula = Est ~ 1
                             # frmula0 = Est ~ 1,
                             data = gvDM, 
                             partmat = GV.pm, 
                             covar_FUN = "covar_exp",
                             covar.pars = list(range = 4.805467),
                             distm_FUN = "distm_km",
                             ncross = 5,
                             ncores = 20,
                             debug = TRUE, #FALSE,
                             progressbar = TRUE  )

remotePARTS is run in a docker container created from this image: https://github.com/erfea/R.Docker4remotePARTS. Exact specification of the system architecture and setup are available here:(https://github.com/morrowcj/remotePARTS/files/11434853/serverSetup_remotePARTS.txt)

Limiting number of cores available for a container leads to the use of all assigned cores and the same error on each node.

At the same time in Windows environment, where limiting CPU use works well the said bigger dataset generates the following error Error in unserialize(socklist[[n]]) : error reading from connection​ (with the same setup of the fitGLS_partition function as above).

I am wondering whether anyone else came across similar behavioue.

I am happy to share the data and run some more test if needed.
Thank you and cheers!
Kasia

I ran some more tests on Win and Linux.

On both OS I am able to successfully execute fitGLS_partition when no multicore argument is passed.
e.g.,

GV.GLS <- fitGLS_partition(formula = Est ~ 1,
                           data = gvDM, 
                           distm_FUN = "distm_km",
                           partmat = GV.pm, 
                           covar_FUN = "covar_exppow", 
                           covar.pars = list(range = 4.805467, shape=0.4),
                           ncross = 5,
                           nugget = NA,
                           debug = TRUE, #FALSE )

Sure, the CPU use spikes at times, but on Linux I can cape it within a docker contained. Not ideal, but does the job :)

When I try to recreate an example from the MC_GLSpart {remotePARTS} documentation (with my data)

fitGLS_partition(formula = Est ~ 1, partmat = GV.pm, data = gvDM, nugget = 0,
                           ncores = 20, parallel = TRUE, debug = FALSE)

on Win I get the following error:

Error in array(NA, dim = c(npairs, p, p), dimnames = list(NULL, names(object[[1]]$partGLS$coefficients),  : 
length of 'dimnames' [2] not equal to array extent

Whereas on Linux the following error is generated for parallel processes
image

Even when the R process is aborted and R quitted, the CPUs on Linux are busy until a respective docker container is killed.

I will look into it a bit more. Just wanted to give you an update.

Thank you and cheers,
k

@Erfea, I was able to identify a couple of bugs that may solve this issue:

  1. fitGLS_opt() did not have a multicore option at all. I have added one.
  2. fitGLS_partition() had a few bugs where all available cores would be used for a particular task. The result was that ncores was not respected for all operations. I believe I have fixed that.

I tested on 3 different Ubuntu versions, Mac, and Windows, so I don't think it is an OS-specific problem.

These problems are addressed by PR #21. I'm going to close this issue for now, but we can reopen it if you find additional errors or if you find that this patch does not solve it.

@morrowcj Thank you for pushing all the updates! Indeed the fitGLS_partition() works now without any error. It stills maxes out CPU for a split of a second every now and then, but it is something I have been observing across many packages.
Ossom! Once more thank you for all your hard work.
P.S. I am updating a docker image to r-base:4.3.1. Once done it will be available in my GitHub repo and as kelewinska/rparts on the hub.docker.