vsoch/vsoch.github.io

Singularity multicontainer mpi

mmiesch opened this issue · 30 comments

Hi @vsoch ! I'm taking you up on your offer to "ping me directly on github" with Singularity issues (I'm the one from this recent post.

So, here's the issue - I managed to reproduce the error I'm having just with gnu and openmpi - so it has nothing to do with intel.

I posted a basic singularity file here - scroll down to the bottom. It installs gnu compilers and openmpi. It also installs a hello_world_mpi application in /usr/local/bin. The singularity container created from this is available publicly on sylabs cloud at library://jcsda/public/multicon_test:latest.

When I invoke mpirun in the container (what I call a solo-container mode - one container), I get some warnings about MPI being "unable to find any relevant network interfaces..." but, apart from that, it works:

singularity exec -e multicon_test.sif mpirun -np 4 /usr/local/bin/hello_world_mpi
[...]
Hello from rank 3 of 4 running on ip-172-31-87-130
Hello from rank 0 of 4 running on ip-172-31-87-130
Hello from rank 1 of 4 running on ip-172-31-87-130
Hello from rank 2 of 4 running on ip-172-31-87-130

When I invoke mpirun outside of the container, compiled with the same gnu 7.4 compiler suite and openmpi version, 3.1.2 (I call this multicontainer since each MPI task fires up its own container), I get this (again, omitting the warnings)

mpirun -np 4 singularity exec -e multicon_test.sif hello_world_mpi
Hello from rank 0 of 1 running on ip-172-31-87-130
Hello from rank 0 of 1 running on ip-172-31-87-130
Hello from rank 0 of 1 running on ip-172-31-87-130
Hello from rank 0 of 1 running on ip-172-31-87-130

All four MPI tasks think they are rank 0 and the total number of tasks is 1.

Do you mind running it to see if you se the same thing? Have you ever gotten this hybrid MPI model to work with Singularity?

I have the same version of slurm PMI2 installed inside and outside the container - it doesn't seem to help (though, admittedly, I didn't configure the external openmpi to use it when I built it some time ago).

I'm grateful for any thoughts you have.

vsoch commented

hey @mmiesch - I can definitely at least try to reproduce your case. I'm doing this from our cluster with SLURM.

First, pulling the container.

singularity pull  library://jcsda/public/multicon_test:latest

I don't have mpi loaded (we use modules) but just for kicks and giggles I'm going to run it anyway.

$ singularity exec -e multicon_test_latest.sif mpirun -np 4 /usr/local/bin/hello_world_mpi
Hello from rank 0 of 4 running on sh02-01n58.int
Hello from rank 2 of 4 running on sh02-01n58.int
Hello from rank 1 of 4 running on sh02-01n58.int
Hello from rank 3 of 4 running on sh02-01n58.int

Note that I don't see any errors about networking. I have:

$ singularity  --version
singularity version 3.5.3-1.1.el7

Now I think I'd need to load a module to interact with mpi from the outside.

$ module load openmpi/3.1.2

I don't have that script on my host so I'll copy it from the container:

singularity exec -e multicon_test_latest.sif cp /usr/local/bin/hello_world_mpi hello_world_mpi

And I want to run this file the same (but from the outside of the container) just to be sure it acts the same as the previous run and I'm not blindly introducing a bug:

$ singularity exec -e multicon_test_latest.sif mpirun -np 4 hello_world_mpi
Hello from rank 2 of 4 running on sh02-01n58.int
Hello from rank 0 of 4 running on sh02-01n58.int
Hello from rank 1 of 4 running on sh02-01n58.int
Hello from rank 3 of 4 running on sh02-01n58.int

okay now let's try to add the wrapper on top of that - now mpirun is on the outside.

mpirun -np 4 singularity exec -e multicon_test_latest.sif hello_world_mpi

And actually this is interesting - this time I'm told that there aren't enough slots on the system:

]$ mpirun -np 4 singularity exec -e multicon_test_latest.sif hello_world_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  singularity

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

And this is totally spot on because I only have one!

$ nproc
1

We can simplify the case even further - just remove the container and use mpirun.

$ mpirun -np 4 hello_world_mpi 
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  hello_world_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

Same. So the first interesting question is - why was the first mpirun in the container able to work? Why does it think I have 4 inside the container, but not outside? This already seems strange or buggy to me - because the container should also see the nproc as 1 and submit the same error message, but it doesn't - it "shows" 4 processes. Sorry to side track, but have you encountered this?

vsoch commented

okay, so it looks like if I remove the -e it correctly doesn't work in the container. What environment might be being passed that is causing this?

Thanks @vsoch - you can get rid of that slots problem with this:

mkdir $HOME/.openmpi
echo "rmaps_base_oversubscribe = 1" >> $HOME/.openmpi/mca-params.conf 

If you remove the -e what do you get?

It makes sense leaving the -e out because often the parallel process manager will set runtime variables. For example, slurm srun sets the environment variable SLURM_NTASKS with the total number of tasks. But, when I do that with intel it really gets confused.

vsoch commented

Do you have a testing command to run with slurm, with -N ? Or is that already going beyond the issue?

Sure, if you have slurm on your host system you can try:

srun ---ntasks=4 --mpi=pmi2 singularity exec -e multicon_test_latest.sif hello_world_mpi

You might also want to set this outside the container:

export SLURM_EXPORT_ENV=ALL

And - sorry - without the -e!

vsoch commented

okay, trying this!

vsoch commented
export SLURM_EXPORT_ENV=ALL

Without the -e:

$ srun --ntasks=4 --mpi=pmi2 singularity exec multicon_test_latest.sif hello_world_mpi
srun: job 64661755 queued and waiting for resources
srun: job 64661755 has been allocated resources
Hello from rank 1 of 4 running on sh02-01n08.int
Hello from rank 2 of 4 running on sh02-01n12.int
Hello from rank 3 of 4 running on sh02-01n17.int
Hello from rank 0 of 4 running on sh02-01n08.int

Is that buggy? It looks okay to me. So what you are trying to do is wrap this additionally in an mpirun command?

Nope - that looks correct - thanks for checking. Yes, so the -e works for openmpi. So, my problem is back with intel. To check that you'd need Intel MPI installed on your host, or something compatible, like mpich.

vsoch commented

This one?

----------------------------------------------------------------------------
  impi:
----------------------------------------------------------------------------
    Description:
      Intel® MPI Library is a multi-fabric message passing library that
      implements the Message Passing Interface, version 3.1 (MPI-3.1)
      specification.

     Versions:
        impi/2017.u2
        impi/2018.u1
        impi/2018
        impi/2019
vsoch commented

What can I try next?

Yes! you have it! load impi/2019 and then pull this singularity container: library://jcsda/public/jedi-intel19-impi-hpc-app.sif. This has the intel runtime libraries in it. Then run this:

srun -ntasks=4 --mpi=pmi2 singularity exec jedi-intel19-impi-hpc-app.sif hello_world_mpi

You can be an honorary Jedi if this works (or even if it doesn't)!

vsoch commented

That's what they call me, Rubber Duck Jedi!

$ module load impi/2019
$ singularity pull library://jcsda/public/jedi-intel19-impi-hpc-app
$  srun --ntasks=4 --mpi=pmi2 singularity exec jedi-intel19-impi-hpc-app_latest.sif hello_world_mpi

The first try didn't find my script (which is in the $PWD, that's weird):

$ srun --ntasks=4 --mpi=pmi2 singularity exec jedi-intel19-impi-hpc-app_latest.sif mpirun hello_world_mpi 
srun: job 64666770 queued and waiting for resources
srun: job 64666770 has been allocated resources
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'
slurmstepd: error: mpi/pmi2: invalid PMI1 init command: `error'

Actually there is quite a bit more

[mpiexec@sh02-01n06.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 35374, exit code 65280)
[mpiexec@sh02-01n06.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n06.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n06.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n06.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
[mpiexec@sh02-01n06.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 35375, exit code 65280)
[mpiexec@sh02-01n06.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n06.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n06.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n06.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
[mpiexec@sh02-01n09.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 86234, exit code 65280)
[mpiexec@sh02-01n09.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n09.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n09.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n09.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies
srun: error: sh02-01n09: task 3: Exited with exit code 255
srun: Terminating job step 64666770.0
[mpiexec@sh02-01n08.int] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:121): unable to run bstrap_proxy (pid 113050, exit code 65280)
[mpiexec@sh02-01n08.int] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@sh02-01n08.int] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@sh02-01n08.int] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:755): error waiting for event
[mpiexec@sh02-01n08.int] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1926): error setting up the boostrap proxies

srun: error: sh02-01n06: tasks 0-1: Exited with exit code 255
srun: error: sh02-01n08: task 2: Exited with exit code 255

Slurm seems unhappy, so maybe I should start with just interaction with the container?

module load impi/2019
singularity exec jedi-intel19-impi-hpc-app_latest.sif mpirun hello_world_mpi

It seems to be hanging - I wonder if there is some library issue? Maybe I should add -e...

For the (Han-) solo-container mode, specify the path to the executable that was compiled inside the container and here you want to use -e because the container has everything it needs. Also - srun is not installed inside the container - there you need to use mpirun:

singularity exec -e jedi-intel19-impi-hpc-app.sif mpirun -np 4 /opt/jedi/bin/hello_world_mpi

Also for the multi-container mode - you should use the hello world inside the container because that was compiled with intel compilers and intel mpi. You can try leaving out the pmi bit:

srun -n 4 singularity exec jedi-intel19-impi-hpc-app.sif /opt/jedi/bin/hello_world_mpi
vsoch commented

That seems ok too! Note that this is on an interactive node.

$ singularity exec -e /scratch/users/vsochat/jedi-intel19-impi-hpc-app_latest.sif mpirun -np 4 /opt/jedi/bin/hello_world_mpi
Hello from rank 3 of 4 running on sh02-01n47.int
Hello from rank 2 of 4 running on sh02-01n47.int
Hello from rank 0 of 4 running on sh02-01n47.int
Hello from rank 1 of 4 running on sh02-01n47.int

and for the non Han solo container-mode using srun:

$ srun -n 4 singularity exec jedi-intel19-impi-hpc-app_latest.sif /opt/jedi/bin/hello_world_mpi
srun: job 64670692 queued and waiting for resources

Still waiting! Our cluster is super busy and I don't get special treatment for being a research dinosaur :)

I wait with bated breath...

That's the correct response for the solo-container mode.

vsoch commented

Wait the second command needs mpirun, correct? I don't see it in your previous message - I'll try running the job adding it.

vsoch commented

And it's missing the -e too! Unless it's supposed to be different?

No - it's different. Here the srun command launches 4 MPI task and each task launches it's own container and runs the application. So - four containers. The -e is left out here so the srun outside can talk to the mpi inside.

You can try with the -e. I suspect you'll find what I was finding - that every process thinks they are rank zero, which means mpi is not properly initialized.

vsoch commented

okay! I wound up writing a wrapper so I could load the libraries.

#!/bin/bash
module load impi/2019
singularity exec jedi-intel19-impi-hpc-app_latest.sif /opt/jedi/bin/hello_world_mpi

Then I did:

$ srun -n 4 run_job.sh 
srun: job 64672810 queued and waiting for resources
srun: job 64672810 has been allocated resources
Hello from rank 2 of 4 running on sh02-01n06.int
Hello from rank 1 of 4 running on sh02-01n06.int
Hello from rank 0 of 4 running on sh02-01n03.int
Hello from rank 3 of 4 running on sh02-01n09.int

I think the first was possibly hanging because it didn't have libraries on the host node loaded. Does that look right too?

vsoch commented

Trying now with -e...

Interesting - that does look correct! That's more than what I was able to do. I have tried submitting scripts like this to sbatch but not to srun. I'll give it another go tomorrow - I shut my amazon node down for the night where I do these runs and I have moved on to Miller (or, rather, microbrew) time. Good to know that somebody got it to work and that there is nothing pathological about the container itself. Perhaps my issue is with my environment.

Anyway - thanks again - this is valuable information.

vsoch commented

Reproduced! This is adding -e

$ srun -n 4 run_job.sh 
srun: job 64673470 queued and waiting for resources
srun: job 64673470 has been allocated resources
Hello from rank 0 of 1 running on sh02-01n09.int
Hello from rank 0 of 1 running on sh02-01n06.int
Hello from rank 0 of 1 running on sh02-01n06.int
Hello from rank 0 of 1 running on sh02-01n03.int

-e would clean the environment, so probably some setting from the host library is now missing.

vsoch commented

@mmiesch of course! Sometimes it just helps to reproduce, and then insights come from that. I'm also getting ready for dinner (cauliflower, avocado, carrots, oh my!) so have a good evening! 🥑 🥕 💮

vsoch commented

hey @mmiesch let me know if you need any more help! I'll be around. if not, feel free to close this issue.

Thanks again @vsoch - my problems aren't completely solve but they may be specific to my environments