mschubert/clustermq

Can't register backend after upgrading to 0.9.0

bhayete-empress opened this issue · 3 comments

when I run

  register_dopar_cmq(
    n_jobs = parallelJobs,
    fail_on_error = FALSE,
    verbose = TRUE,
    log_worker = TRUE,
    timeout = self$timeout,
    # how long to wait on MQ side
    template = list(
      timeout = self$timeout,
      # how long to wait on SLURM side
      memory = self$memReq,
      # the amount of memory per node
      partition = self$partition,
      # the slurm partition to use
      cores = self$nCores,
      # how many cores to use per job
      r_path = r_path, # set the R path for parallel jobs
      max_calls_worker = 1
    )
  )

I get the following error:

Error in fill_template(private$template, opts, required = c("master", :
Template values required but not provided: partition, timeout, r_path

All of these values are set and were not giving any problems before the upgrade. The upgrade was done because the older version didn't support max_calls_worker (bug 110, which I also ran into). Now it doesn't even register the backend. Even if I keep max_calls_worker at default in the template file, I still get this error, i.e., I can no longer run clustermq. I've made no changes other than adding max_calls_worker to the template file and the template argument of register_dopar_cmq, and upgrading the package using github to 0.9.0. What might be going on?

For reference, my template file looks like this:

#!/bin/bash

#SBATCH --job-name={{ job_name }}
#SBATCH --output={{ log_file | /dev/null }}
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 15000 }}
#SBATCH --partition={{ partition }} #intentionally no default - be cognizant of where you are running!
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --cpus-per-task={{ cores | 1 }}
#SBATCH --time={{ timeout }}
#SBATCH --max_calls_worker={{ max_calls_worker | 1 }} #refresh to avoid stalls, as in https://github.com/mschubert/clustermq/issues/110
##SBATCH --log_file="/path/to.file.%a"

CMQ_AUTH={{ auth }} {{ r_path }} --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Here's a fully-reproducible example using the same template file as above:

library(clustermq)
library(foreach)
options(
  clustermq.scheduler = "SLURM",
  clustermq.template = '/fsx/home/bhayete/Projects/scPipeline/inst/templates/slurmMq.tmpl',
  clustermq.data.warning = 5000 # megabytes
)
r_path <- file.path(R.home("bin"), "R")
# set R to the binary path in R.home()
register_dopar_cmq(
  n_jobs = 5,
  fail_on_error = FALSE,
  verbose = TRUE,
  log_worker = TRUE,
  # how long to wait on MQ side
  template = list(
    partition = 'compute-spot', # the slurm partition to use
    timeout = 100,      
    memory = 100,      
    cores = 1, # how many cores to use per job
    r_path = r_path, # set the R path for parallel jobs
    max_calls_worker = 1
  )
)
  
  results <- foreach( i = 1:100,
    .packages = c((.packages())),
    .export = c(ls())
  ) %dopar% {
    
    return(rnorm(1))
  }
)

I was able to further simplify my sample code. The issue seems to be that the template argument to register_dopar_cmq and hence to the Q function is not parsed and passed onwards.

library(foreach)
library(clustermq)
options(
  clustermq.scheduler = "SLURM",
  clustermq.template = '/fsx/home/bhayete/Projects/scPipeline/inst/templates/slurmMq.2.tmpl',
  clustermq.data.warning = 5000 # megabytes
)

clustermq::register_dopar_cmq(n_jobs=2, memory=1024,
                              template = list(
                                timeout = 100
                              )) # this accepts same arguments as `Q`
x = foreach(i=1:300) %dopar% sqrt(i) # this will be executed as jobs

gives the error

> x = foreach(i=1:300) %dopar% sqrt(i) # this will be executed as jobs
Error in fill_template(private$template, opts, required = c("master",  : 
  Template values required but not provided: timeout, r_path

Filling template values

This was caused by the workers function not passing the template argument to Pool$add. It is now fixed in the current git version (see linked commits)

Using max_calls_worker (--> see #322)

For reference, my template file looks like this:

#SBATCH --max_calls_worker={{ max_calls_worker | 1 }}

max_calls_worker should be an argument to Q, not a template argument (this is handled by the clustermq package, Slurm does not know anything about workers and their calls)

The following should work, but is broken with the current CRAN and git version:

table(unlist(clustermq::Q(function(x) { Sys.sleep(x==1); Sys.getpid() }, x=1:4, n_jobs=2)))
# workers do 1 and 3 tasks, respectively
table(unlist(clustermq::Q(function(x) { Sys.sleep(x==1); Sys.getpid() }, x=1:4, n_jobs=2, max_calls_worker=2)))
# both workers should do 2 tasks: >> BROKEN ON 0.9.0