Singularity multicontainer mpi
mmiesch opened this issue · 30 comments
Hi @vsoch ! I'm taking you up on your offer to "ping me directly on github" with Singularity issues (I'm the one from this recent post.
So, here's the issue - I managed to reproduce the error I'm having just with gnu and openmpi - so it has nothing to do with intel.
I posted a basic singularity file here - scroll down to the bottom. It installs gnu compilers and openmpi. It also installs a hello_world_mpi application in /usr/local/bin. The singularity container created from this is available publicly on sylabs cloud at library://jcsda/public/multicon_test:latest.
When I invoke mpirun in the container (what I call a solo-container mode - one container), I get some warnings about MPI being "unable to find any relevant network interfaces..." but, apart from that, it works:
singularity exec -e multicon_test.sif mpirun -np 4 /usr/local/bin/hello_world_mpi
[...]
Hello from rank 3 of 4 running on ip-172-31-87-130
Hello from rank 0 of 4 running on ip-172-31-87-130
Hello from rank 1 of 4 running on ip-172-31-87-130
Hello from rank 2 of 4 running on ip-172-31-87-130
When I invoke mpirun outside of the container, compiled with the same gnu 7.4 compiler suite and openmpi version, 3.1.2 (I call this multicontainer since each MPI task fires up its own container), I get this (again, omitting the warnings)
mpirun -np 4 singularity exec -e multicon_test.sif hello_world_mpi
Hello from rank 0 of 1 running on ip-172-31-87-130
Hello from rank 0 of 1 running on ip-172-31-87-130
Hello from rank 0 of 1 running on ip-172-31-87-130
Hello from rank 0 of 1 running on ip-172-31-87-130
All four MPI tasks think they are rank 0 and the total number of tasks is 1.
Do you mind running it to see if you se the same thing? Have you ever gotten this hybrid MPI model to work with Singularity?
I have the same version of slurm PMI2 installed inside and outside the container - it doesn't seem to help (though, admittedly, I didn't configure the external openmpi to use it when I built it some time ago).
I'm grateful for any thoughts you have.
hey @mmiesch - I can definitely at least try to reproduce your case. I'm doing this from our cluster with SLURM.
First, pulling the container.
singularity pull library://jcsda/public/multicon_test:latestI don't have mpi loaded (we use modules) but just for kicks and giggles I'm going to run it anyway.
$ singularity exec -e multicon_test_latest.sif mpirun -np 4 /usr/local/bin/hello_world_mpi
Hello from rank 0 of 4 running on sh02-01n58.int
Hello from rank 2 of 4 running on sh02-01n58.int
Hello from rank 1 of 4 running on sh02-01n58.int
Hello from rank 3 of 4 running on sh02-01n58.intNote that I don't see any errors about networking. I have:
$ singularity --version
singularity version 3.5.3-1.1.el7Now I think I'd need to load a module to interact with mpi from the outside.
$ module load openmpi/3.1.2I don't have that script on my host so I'll copy it from the container:
singularity exec -e multicon_test_latest.sif cp /usr/local/bin/hello_world_mpi hello_world_mpiAnd I want to run this file the same (but from the outside of the container) just to be sure it acts the same as the previous run and I'm not blindly introducing a bug:
$ singularity exec -e multicon_test_latest.sif mpirun -np 4 hello_world_mpi
Hello from rank 2 of 4 running on sh02-01n58.int
Hello from rank 0 of 4 running on sh02-01n58.int
Hello from rank 1 of 4 running on sh02-01n58.int
Hello from rank 3 of 4 running on sh02-01n58.intokay now let's try to add the wrapper on top of that - now mpirun is on the outside.
mpirun -np 4 singularity exec -e multicon_test_latest.sif hello_world_mpiAnd actually this is interesting - this time I'm told that there aren't enough slots on the system:
]$ mpirun -np 4 singularity exec -e multicon_test_latest.sif hello_world_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
singularity
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------And this is totally spot on because I only have one!
$ nproc
1We can simplify the case even further - just remove the container and use mpirun.
$ mpirun -np 4 hello_world_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
hello_world_mpi
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------Same. So the first interesting question is - why was the first mpirun in the container able to work? Why does it think I have 4 inside the container, but not outside? This already seems strange or buggy to me - because the container should also see the nproc as 1 and submit the same error message, but it doesn't - it "shows" 4 processes. Sorry to side track, but have you encountered this?
okay, so it looks like if I remove the -e it correctly doesn't work in the container. What environment might be being passed that is causing this?
Thanks @vsoch - you can get rid of that slots problem with this:
mkdir $HOME/.openmpi
echo "rmaps_base_oversubscribe = 1" >> $HOME/.openmpi/mca-params.conf
If you remove the -e what do you get?
It makes sense leaving the -e out because often the parallel process manager will set runtime variables. For example, slurm srun sets the environment variable SLURM_NTASKS with the total number of tasks. But, when I do that with intel it really gets confused.
Do you have a testing command to run with slurm, with -N ? Or is that already going beyond the issue?
Sure, if you have slurm on your host system you can try:
srun ---ntasks=4 --mpi=pmi2 singularity exec -e multicon_test_latest.sif hello_world_mpi
You might also want to set this outside the container:
export SLURM_EXPORT_ENV=ALL
And - sorry - without the -e!
okay, trying this!
export SLURM_EXPORT_ENV=ALLWithout the -e:
$ srun --ntasks=4 --mpi=pmi2 singularity exec multicon_test_latest.sif hello_world_mpi
srun: job 64661755 queued and waiting for resources
srun: job 64661755 has been allocated resources
Hello from rank 1 of 4 running on sh02-01n08.int
Hello from rank 2 of 4 running on sh02-01n12.int
Hello from rank 3 of 4 running on sh02-01n17.int
Hello from rank 0 of 4 running on sh02-01n08.int
Is that buggy? It looks okay to me. So what you are trying to do is wrap this additionally in an mpirun command?
Nope - that looks correct - thanks for checking. Yes, so the -e works for openmpi. So, my problem is back with intel. To check that you'd need Intel MPI installed on your host, or something compatible, like mpich.
This one?
----------------------------------------------------------------------------
impi:
----------------------------------------------------------------------------
Description:
Intel® MPI Library is a multi-fabric message passing library that
implements the Message Passing Interface, version 3.1 (MPI-3.1)
specification.
Versions:
impi/2017.u2
impi/2018.u1
impi/2018
impi/2019
What can I try next?
Yes! you have it! load impi/2019 and then pull this singularity container: library://jcsda/public/jedi-intel19-impi-hpc-app.sif. This has the intel runtime libraries in it. Then run this:
srun -ntasks=4 --mpi=pmi2 singularity exec jedi-intel19-impi-hpc-app.sif hello_world_mpi
You can be an honorary Jedi if this works (or even if it doesn't)!
That's what they call me, Rubber Duck Jedi!
$ module load impi/2019
$ singularity pull library://jcsda/public/jedi-intel19-impi-hpc-app
$ srun --ntasks=4 --mpi=pmi2 singularity exec jedi-intel19-impi-hpc-app_latest.sif hello_world_mpiThe first try didn't find my script (which is in the $PWD, that's weird):
$ srun --ntasks=4 --mpi=pmi2 singularity exec jedi-intel19-impi-hpc-app_latest.sif mpirun hello_world_mpi
srun: job 64666770 queued and waiting for resources
srun: job 64666770 has been allocated resources
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'Actually there is quite a bit more
[mpiexec@sh02-01n06.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 35374, exit code 65280)
[mpiexec@sh02-01n06.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n06.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n06.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n06.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
[mpiexec@sh02-01n06.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 35375, exit code 65280)
[mpiexec@sh02-01n06.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n06.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n06.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n06.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
[mpiexec@sh02-01n09.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 86234, exit code 65280)
[mpiexec@sh02-01n09.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n09.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n09.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n09.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
srun: error: sh02-01n09: task 3: Exited with exit code 255
srun: Terminating job step 64666770.0
[mpiexec@sh02-01n08.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 113050, exit code 65280)
[mpiexec@sh02-01n08.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n08.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n08.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n08.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
srun: error: sh02-01n06: tasks 0-1: Exited with exit code 255
srun: error: sh02-01n08: task 2: Exited with exit code 255
Slurm seems unhappy, so maybe I should start with just interaction with the container?
module load impi/2019
singularity exec jedi-intel19-impi-hpc-app_latest.sif mpirun hello_world_mpiIt seems to be hanging - I wonder if there is some library issue? Maybe I should add -e...
For the (Han-) solo-container mode, specify the path to the executable that was compiled inside the container and here you want to use -e because the container has everything it needs. Also - srun is not installed inside the container - there you need to use mpirun:
singularity exec -e jedi-intel19-impi-hpc-app.sif mpirun -np 4 /opt/jedi/bin/hello_world_mpi
Also for the multi-container mode - you should use the hello world inside the container because that was compiled with intel compilers and intel mpi. You can try leaving out the pmi bit:
srun -n 4 singularity exec jedi-intel19-impi-hpc-app.sif /opt/jedi/bin/hello_world_mpi
That seems ok too! Note that this is on an interactive node.
$ singularity exec -e /scratch/users/vsochat/jedi-intel19-impi-hpc-app_latest.sif mpirun -np 4 /opt/jedi/bin/hello_world_mpi
Hello from rank 3 of 4 running on sh02-01n47.int
Hello from rank 2 of 4 running on sh02-01n47.int
Hello from rank 0 of 4 running on sh02-01n47.int
Hello from rank 1 of 4 running on sh02-01n47.intand for the non Han solo container-mode using srun:
$ srun -n 4 singularity exec jedi-intel19-impi-hpc-app_latest.sif /opt/jedi/bin/hello_world_mpi
srun: job 64670692 queued and waiting for resources
Still waiting! Our cluster is super busy and I don't get special treatment for being a research dinosaur :)
I wait with bated breath...
That's the correct response for the solo-container mode.
Wait the second command needs mpirun, correct? I don't see it in your previous message - I'll try running the job adding it.
And it's missing the -e too! Unless it's supposed to be different?
No - it's different. Here the srun command launches 4 MPI task and each task launches it's own container and runs the application. So - four containers. The -e is left out here so the srun outside can talk to the mpi inside.
You can try with the -e. I suspect you'll find what I was finding - that every process thinks they are rank zero, which means mpi is not properly initialized.
okay! I wound up writing a wrapper so I could load the libraries.
#!/bin/bash
module load impi/2019
singularity exec jedi-intel19-impi-hpc-app_latest.sif /opt/jedi/bin/hello_world_mpi
Then I did:
$ srun -n 4 run_job.sh
srun: job 64672810 queued and waiting for resources
srun: job 64672810 has been allocated resources
Hello from rank 2 of 4 running on sh02-01n06.int
Hello from rank 1 of 4 running on sh02-01n06.int
Hello from rank 0 of 4 running on sh02-01n03.int
Hello from rank 3 of 4 running on sh02-01n09.intI think the first was possibly hanging because it didn't have libraries on the host node loaded. Does that look right too?
Trying now with -e...
Interesting - that does look correct! That's more than what I was able to do. I have tried submitting scripts like this to sbatch but not to srun. I'll give it another go tomorrow - I shut my amazon node down for the night where I do these runs and I have moved on to Miller (or, rather, microbrew) time. Good to know that somebody got it to work and that there is nothing pathological about the container itself. Perhaps my issue is with my environment.
Anyway - thanks again - this is valuable information.
Reproduced! This is adding -e
$ srun -n 4 run_job.sh
srun: job 64673470 queued and waiting for resources
srun: job 64673470 has been allocated resources
Hello from rank 0 of 1 running on sh02-01n09.int
Hello from rank 0 of 1 running on sh02-01n06.int
Hello from rank 0 of 1 running on sh02-01n06.int
Hello from rank 0 of 1 running on sh02-01n03.int
-e would clean the environment, so probably some setting from the host library is now missing.
@mmiesch of course! Sometimes it just helps to reproduce, and then insights come from that. I'm also getting ready for dinner (cauliflower, avocado, carrots, oh my!) so have a good evening! 🥑 🥕 💮