Can't register backend after upgrading to 0.9.0
bhayete-empress opened this issue · 3 comments
when I run
register_dopar_cmq(
n_jobs = parallelJobs,
fail_on_error = FALSE,
verbose = TRUE,
log_worker = TRUE,
timeout = self$timeout,
# how long to wait on MQ side
template = list(
timeout = self$timeout,
# how long to wait on SLURM side
memory = self$memReq,
# the amount of memory per node
partition = self$partition,
# the slurm partition to use
cores = self$nCores,
# how many cores to use per job
r_path = r_path, # set the R path for parallel jobs
max_calls_worker = 1
)
)
I get the following error:
Error in fill_template(private$template, opts, required = c("master", :
Template values required but not provided: partition, timeout, r_path
All of these values are set and were not giving any problems before the upgrade. The upgrade was done because the older version didn't support max_calls_worker (bug 110, which I also ran into). Now it doesn't even register the backend. Even if I keep max_calls_worker at default in the template file, I still get this error, i.e., I can no longer run clustermq. I've made no changes other than adding max_calls_worker to the template file and the template argument of register_dopar_cmq, and upgrading the package using github to 0.9.0. What might be going on?
For reference, my template file looks like this:
#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 15000 }}
#SBATCH --partition={{ partition }} #intentionally no default - be cognizant of where you are running!
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}
#SBATCH --time={{ timeout }}
#SBATCH --max_calls_worker={{ max_calls_worker | 1 }} #refresh to avoid stalls, as in https://github.com/mschubert/clustermq/issues/110
##SBATCH --log_file="/path/to.file.%a"
CMQ_AUTH={{ auth }} {{ r_path }} --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
Here's a fully-reproducible example using the same template file as above:
library(clustermq)
library(foreach)
options(
clustermq.scheduler = "SLURM",
clustermq.template = '/fsx/home/bhayete/Projects/scPipeline/inst/templates/slurmMq.tmpl',
clustermq.data.warning = 5000 # megabytes
)
r_path <- file.path(R.home("bin"), "R")
# set R to the binary path in R.home()
register_dopar_cmq(
n_jobs = 5,
fail_on_error = FALSE,
verbose = TRUE,
log_worker = TRUE,
# how long to wait on MQ side
template = list(
partition = 'compute-spot', # the slurm partition to use
timeout = 100,
memory = 100,
cores = 1, # how many cores to use per job
r_path = r_path, # set the R path for parallel jobs
max_calls_worker = 1
)
)
results <- foreach( i = 1:100,
.packages = c((.packages())),
.export = c(ls())
) %dopar% {
return(rnorm(1))
}
)
I was able to further simplify my sample code. The issue seems to be that the template argument to register_dopar_cmq and hence to the Q function is not parsed and passed onwards.
library(foreach)
library(clustermq)
options(
clustermq.scheduler = "SLURM",
clustermq.template = '/fsx/home/bhayete/Projects/scPipeline/inst/templates/slurmMq.2.tmpl',
clustermq.data.warning = 5000 # megabytes
)
clustermq::register_dopar_cmq(n_jobs=2, memory=1024,
template = list(
timeout = 100
)) # this accepts same arguments as `Q`
x = foreach(i=1:300) %dopar% sqrt(i) # this will be executed as jobs
gives the error
> x = foreach(i=1:300) %dopar% sqrt(i) # this will be executed as jobs
Error in fill_template(private$template, opts, required = c("master", :
Template values required but not provided: timeout, r_path
Filling template values
This was caused by the workers
function not passing the template
argument to Pool$add
. It is now fixed in the current git version (see linked commits)
Using max_calls_worker
(--> see #322)
max_calls_worker
For reference, my template file looks like this:
#SBATCH --max_calls_worker={{ max_calls_worker | 1 }}
max_calls_worker
should be an argument to Q
, not a template argument (this is handled by the clustermq
package, Slurm does not know anything about workers and their calls)
The following should work, but is broken with the current CRAN and git version:
table(unlist(clustermq::Q(function(x) { Sys.sleep(x==1); Sys.getpid() }, x=1:4, n_jobs=2)))
# workers do 1 and 3 tasks, respectively
table(unlist(clustermq::Q(function(x) { Sys.sleep(x==1); Sys.getpid() }, x=1:4, n_jobs=2, max_calls_worker=2)))
# both workers should do 2 tasks: >> BROKEN ON 0.9.0