earthlab/rslurm

Trouble submitting jobs with multithreaded R function

naglemi opened this issue · 3 comments

First, let me thank you for making this package available!

I'm trying to use slurm_apply to submit jobs in which each row of parameters should go to a separate node, rather than a separate core, because multithreading is done internally by my R function (that cannot easily be rewritten for this purpose). I need to set the cpus-per-task directive (equivalent to cpus_per_node option in slurm_apply) to the number of threads (2x cores) because my R code uses multithreading internally to use all threads on each CPU. The reason I need to set cpus-per-task in my batch script to the number of threads I multithread over internally is described in further detail in documentation here, under the "Multithreaded Jobs" header (https://researchcomputing.princeton.edu/slurm)

Everything works as expected if I submit only a single job (one row of parameters). However, if I submit multiple jobs, then the number of jobs is divided over the cpus-per-task option, as described in the documentation for slurm_apply. Thus, for example, if I submit 48 rows of parameters then instead of getting 48 nodes (multithreaded internally in my code over 24 cores), I only get two nodes requested.

Here is my rslurm command. The pars dataframe has 38 rows and I wish to have each row submitted as a job to a separate node. My R code will then handle the multithreading over cores on each node internally, as described, However, the 38 rows are being divided over the 24 threads on each node (indicated by the cpus_per_node option), so only two nodes are requested.

slurm_apply(MTMCSKAT_workflow,
                      pars,
                      jobname = opt$job_id,
                      nodes = nrow(pars),
                      cpus_per_node = 24,
                      submit = FALSE,
                      slurm_options = list(time = opt$time),
                      preschedule_cores = FALSE)

If I understand correctly, this is expected behavior, but rslurm as I have it configured does not offer support for jobs that multithread within the R function being submitted. Is this correct?

Is there a way for me to override this behavior, or properly configure rslurm to avoid it? Please let me know if any data or further information would be helpful, or if any of this is unclear.

Thanks for your attention to this!

Hi, thanks for submitting an issue. You're correct that this isn't supported for slurm_apply as configured. . . . at least I'm pretty sure of that! 😊 I will look into possible workarounds for you and get back to you shortly.

So, I looked more closely and confirmed that the behavior you described is hard-coded into slurm_apply and related functions. It's a beneficial feature for minimizing the number of nodes allocated to a job, but obviously in your case it's a hindrance. Sorry about that!

I will leave your issue open for the moment and hopefully when I have time I will add support for multithreading. I might have to think about the best way to do that (an optional argument passed to slurm_apply or a different function entirely).

For now I think your best bet is to write the submission script manually or modify the source code of slurm_apply locally. If you come up with a good general solution please feel free to write a pull request! Thanks again for your comment.

Just revisiting this ... ultimately, the functions parallel::mclapply and parallel::mcmapply underlie slurm_map and slurm_apply, respectively. The developers of parallel strongly discourage the use of these functions with third-party multithreaded functions such as in your case. The rationale for this is described in the documentation of parallel::mcfork.

So because using parallel package with functions that themselves implement multithreading is discouraged, I will not implement any fix to your situation in rslurm, sorry about that! Again I'm sure it's possible to work around this locally but do that at your own risk ⚠️ 😆 good luck!