RZNevada -- concurrent job runs fail
dawson6 opened this issue · 2 comments
The 7.0.5 version of ats uses slurm options to run cocurrent jobs. This works on alastor, genie, etc.
On rznevada this fails. While ATS can run jobs one after another (using the --sequential command line option), when two or more jobs are started concurrently, the jobs fail with
srun --exclusive --mpibind=off --distribution=block --nodes=1-2 --cpus-per-task=1 --ntasks=2
0: Fri Jul 23 10:59:54 2021: [PE_0]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use
0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
0: Fri Jul 23 10:59:54 2021: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1
1: Fri Jul 23 10:59:54 2021: [PE_1]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 1371 listen_sock = 3 Address already in use
1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_inet_listen_socket_setup:socket setup failed
1: Fri Jul 23 10:59:54 2021: [PE_1]:_pmi_init:_pmi_inet_listen_socket_setup (full) returned -1
OK, able to reproduce outside of ATS with my 'mpi' test app.
Allocated 2 nodes (62 cpus each node) and when I ran this, hit the same issue
srun --exclusive --mpibind=off --nodes=1-2 --ntasks=32 --cpus-per-task=1 ./a.out job1 &
srun --exclusive --mpibind=off --nodes=1-2 --ntasks=32 --cpus-per-task=1 ./a.out job2 &
If I leave off the --exclusive option then this does run, but the jobs queue up and effectively run sequentiall.