RZNevada -- concurrent job runs fail

Question

RZNevada -- concurrent job runs fail

dawson6 opened this issue 3 years ago · 2 comments

The 7.0.5 version of ats uses slurm options to run cocurrent jobs. This works on alastor, genie, etc.

On rznevada this fails. While ATS can run jobs one after another (using the --sequential command line option), when two or more jobs are started concurrently, the jobs fail with

srun --exclusive --mpibind=off --distribution=block --nodes=1-2 --cpus-per-task=1 --ntasks=2


0: Fri Jul 23 10:59:54 2021: [PE_0]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use

0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed

0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1

1: Fri Jul 23 10:59:54 2021: [PE_1]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use

1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_inet_listen_socket_setup:socket setup failed

1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1

Answer 1 · 2021-07-23T18:16:42.000Z

OK, able to reproduce outside of ATS with my 'mpi' test app.
Allocated 2 nodes (62 cpus each node) and when I ran this, hit the same issue
srun --exclusive --mpibind=off --nodes=1-2 --ntasks=32 --cpus-per-task=1 ./a.out job1 &
srun --exclusive --mpibind=off --nodes=1-2 --ntasks=32 --cpus-per-task=1 ./a.out job2 &

Answer 2 · 2021-07-23T18:26:18.000Z

If I leave off the --exclusive option then this does run, but the jobs queue up and effectively run sequentiall.