mpi4py bug
aowen87 opened this issue · 5 comments
Description
Hello,
I'm working on an LLNL project that uses maestro to manage ML workflows, and we've recently encountered an odd bug. If the following conditions are met, the job will hang indefinitely:
- We pass a p-gen file to maestro that imports mpi4py directly or indirectly (through another imported module).
- Maestro launches a job using more than one processor.
- The job being launched also imports mpi4py.
Reproducer
I've included files reproduce the issue below. There is 1 yaml file, 1 python script that maestro will launch with srun, and 3 parameter generation files. 1 of the parameter generation files works fine because it doesn't import mpi4py, and the other two parameter generation files will import mpi4py directly or indirectly and cause the job to hang.
Here are commands to reproduce each scenario:
This works: maestro run -p param_gen.py mpi_bug.yaml
This causes job hang: maestro run -p mpi_param_gen.py mpi_bug.yaml
This causes job hang: maesturo run -p kosh_param_gen.py mpi_bug.yaml
Files to reproduce:
mpi_bug.yaml
:
batch:
bank: wbronze
host: rzgenie
queue: pdebug
type: slurm
description:
description: Reproduces mpi4py bug
name: bug_demo
env:
variables:
nodes: 1
procs: 4
walltime: '00:10:00'
script: /path/to/hello_world.py
study:
- description: Launch a simple script using srun
name: hello_world
run:
cmd: "#SBATCH --ntasks $(procs)\n\n $(LAUNCHER) python $(script)"
nodes: $(nodes)
procs: $(procs)
walltime: $(walltime)
hello_world.py
:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
print(f"{rank}: hello world!")
param_gen.py
:
from maestrowf.datastructures.core import ParameterGenerator
def get_custom_generator(*args, **kw_args):
p_gen = ParameterGenerator()
return p_gen
mpi_param_gen.py
:
from maestrowf.datastructures.core import ParameterGenerator
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
def get_custom_generator(*args, **kw_args):
p_gen = ParameterGenerator()
return p_gen
kosh_param_gen.py
:
from maestrowf.datastructures.core import ParameterGenerator
#
# Kosh relies on mpi4py. This also causes maestro to hang.
#
import kosh
def get_custom_generator(*args, **kw_args):
p_gen = ParameterGenerator()
return p_gen
Well thanks for the detailed reproducer on this! Will take a look and see if we can get this sorted out for you.
@aowen87 I've a few more questions for you on this:
- What's the environment in which you're running this (login node/batch job/something else)?
- How's it being launched -> i.e. are you launching maestro with mpirun/exec/srun?
@aowen87 I've a few more questions for you on this:
- What's the environment in which you're running this (login node/batch job/something else)?
- How's it being launched -> i.e. are you launching maestro with mpirun/exec/srun?
I'm running this from the login node, and the commands I'm using are the exact commands shown above (no srun/mpirun/etc. just maestro).
Ok, @aowen87 , I finally made some headway here. Part of the problem appears to be pgen calling mpi ends up setting a bunch of mpi related env vars which confuses the batch job in an unintuitive way: the default slurm configuration is to treat missing --export=[opts]
in the slurm batch headers to be set to ALL, which exports all env vars in the current environment when calling sbatch. This is why you don't have to specify using your virtualenvironment python that maestro is installed into in your job steps. However it also lets some MPI set things go though. I was able to get a version with mpi in pgen and the job step to work just fine by hardwiring the --export=NONE
option in maestro. So a potential solution here is maybe adding some hooks in the spec somewhere to control this option and a few more potential options for controlling it:
- either explicitly purge some env vars from what's passed to slurm using the
--export
to then reset what's left - try perturbing the subprocess env to not inherit as much from the parent python process (initial trial didn't work here..)
- add some user hooks for automatically injecting a bashrc to source in steps or letting you, the user, explicitly do that (have seen the latter using env block to define the path to said rc file)
Great info! Thanks for digging into this!