mschubert/clustermq

SLURM starts jobs, but they don't finish

mhesselbarth opened this issue · 5 comments

Hello,

I am currently having the problem that jobs are sent to the workers, but it seems they never really start and thus get canceled due to the time limit. The code itself should be okay since it runs without problem using e.g. the future package and all I'm doing is to get nodename (fx <- function(x) {Sys.sleep(30); Sys.info()["nodename"]}.

My first guess is that the works cannot communicate because they don't find zeromq. I tried to set the LD_LIBRARY_PATH to the installation of zeromq, but this didn't help (setenv ('LD_LIBRARY_PATH', 'home/mhessel/zeromq-4.0.3/')).

Worker log

2021-04-16 08:40:25.777142 | Master: tcp://gl-login2.arc-ts.umich.edu:7313
2021-04-16 08:40:25.798204 | WORKER_UP to: tcp://gl-login2.arc-ts.umich.edu:7313
slurmstepd: error: *** JOB 19291379 ON gl3031 CANCELLED AT 2021-04-16T08:42:39 DUE TO TIME LIMIT ***

SSH log

> clustermq:::ssh_proxy(ctl=51896, job=50915)
master ctl listening at: tcp://127.0.0.1:51896
forwarding local network from: tcp://gl-login2.arc-ts.umich.edu:7313
sent PROXY_UP to master ctl
received common data:function (x) {    Sys.sleep(30)    Sys.info()["nodename"]}
setting up qsys: SLURM
sent PROXY_READY to master ctl
received: PROXY_CMDqsys$submit_jobs(job_name = "clustermq", service = "short", mem_cpu = 512, walltime = "00:02:00", log_file = "clustermq.log", n_jobs = 3, log_worker = TRUE, verbose = TRUE)
Submitting 3 worker jobs (ID: clustermq) ...
received: PROXY_STOPTRUE
shutting down and cleaning up
Master: [247.2s 0.0% CPU]; Worker: [avg NA% CPU, max NA Mb]

Thank you very much

This looks less like a library issue, more like a network (SSH) forwarding issue.

Can you tell me:

  • Does your code work if you run it on your login node instead of via SSH?
  • Which version of clustermq are you using?
  • Did this work before? If yes, what changed? (e.g. package update from version X to version Y)

Hey,

Interesting that this might be a SSH issue.

  • Yes, the code does run on the login node.
> fx(5)
                    nodename
"gl-login1.arc-ts.umich.edu
  • I am using clustermq_0.8.95.1
  • I used clustermq before, but on a different HPC. On the HPC I am using currently I never used clustermq and I am also not aware somebody else did.

Does your code work if you run it on your login node instead of via SSH?

fx(5)

I meant with Q(...) 😄

That makes a lot more sense, sorry 😅

Mmh...this doesn't work and Clustermq get stuck during this step:

Submitting 3 worker jobs (ID: clustermq) ...
Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...

Which is the same step where it gets stuck when using SSH.

Ok, that makes it easier because now we know the issue is a connection problem from the workers to the login node, and not related to ssh.

Your login node likely has multiple network interfaces, and if a worker tries to connect to Sys.info()["nodename"] it resolves to the wrong interface.

You likely need to set options(clustermq.host="<interface that accepts worker connections>".