earthlab/rslurm

Issue with mcmapply

mhesselbarth opened this issue · 1 comments

I keep running into an issue where not all params submitted to the Slurm scheduler return a result. I am trying to run a function on about 600 different input parameters. While most run without an error, some seem to fail. However, when running locally, the parameters don't seem to cause an issue.

I use the following options during the slurm_apply call (besides others that I don't think a relevant here)

nodes = 50, cpus_per_node = 10, processes_per_node = 10,
preschedule_cores = FALSE, job_array_task_limit = NULL,
slurm_options = list("partition" = "generic", "mem-per-cpu" = "5G")

My understanding is that 50 cluster nodes are used and on each 10 cores are used which each calculates one row of the params data.frame.

However, in some slurm_*.out files I see the following line: 1 parallel function call did not deliver a result. These seem to correspond to the missing rows. Interestingly, none of the jobs fail and they all have the status COMPLETED

Ups, sorry nevermind. The issue was related to not enough RAM for each cpu and a missing error message by the Slurm system about it. So the CPU was probably shut down without a message.