mpi4py bug

Question

mpi4py bug

aowen87 opened this issue 2 years ago · 5 comments

Description

Hello,

I'm working on an LLNL project that uses maestro to manage ML workflows, and we've recently encountered an odd bug. If the following conditions are met, the job will hang indefinitely:

We pass a p-gen file to maestro that imports mpi4py directly or indirectly (through another imported module).
Maestro launches a job using more than one processor.
The job being launched also imports mpi4py.

Reproducer

I've included files reproduce the issue below. There is 1 yaml file, 1 python script that maestro will launch with srun, and 3 parameter generation files. 1 of the parameter generation files works fine because it doesn't import mpi4py, and the other two parameter generation files will import mpi4py directly or indirectly and cause the job to hang.

Here are commands to reproduce each scenario:

This works: maestro run -p param_gen.py mpi_bug.yaml

This causes job hang: maestro run -p mpi_param_gen.py mpi_bug.yaml

This causes job hang: maesturo run -p kosh_param_gen.py mpi_bug.yaml

Files to reproduce:

mpi_bug.yaml:

batch:
  bank: wbronze
  host: rzgenie
  queue: pdebug
  type: slurm
description:
  description: Reproduces mpi4py bug
  name: bug_demo
env:
  variables:
    nodes: 1
    procs: 4
    walltime: '00:10:00'
    script: /path/to/hello_world.py
study:
- description: Launch a simple script using srun
  name: hello_world
  run:
    cmd: "#SBATCH --ntasks $(procs)\n\n $(LAUNCHER) python $(script)"
    nodes: $(nodes)
    procs: $(procs)
    walltime: $(walltime)

hello_world.py:

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print(f"{rank}: hello world!")

param_gen.py:

from maestrowf.datastructures.core import ParameterGenerator

def get_custom_generator(*args, **kw_args):
    p_gen = ParameterGenerator()
    return p_gen

mpi_param_gen.py:

from maestrowf.datastructures.core import ParameterGenerator
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

def get_custom_generator(*args, **kw_args):
    p_gen = ParameterGenerator()
    return p_gen

kosh_param_gen.py:

from maestrowf.datastructures.core import ParameterGenerator
#
# Kosh relies on mpi4py. This also causes maestro to hang.
#
import kosh

def get_custom_generator(*args, **kw_args):
    p_gen = ParameterGenerator()
    return p_gen

Answer 1 · 2023-04-05T21:08:35.000Z

Well thanks for the detailed reproducer on this! Will take a look and see if we can get this sorted out for you.

Answer 2 · 2023-04-18T16:35:06.000Z

@aowen87 I've a few more questions for you on this:

What's the environment in which you're running this (login node/batch job/something else)?
How's it being launched -> i.e. are you launching maestro with mpirun/exec/srun?

Answer 3 · 2023-04-18T16:38:53.000Z

@aowen87 I've a few more questions for you on this:

What's the environment in which you're running this (login node/batch job/something else)?

How's it being launched -> i.e. are you launching maestro with mpirun/exec/srun?

I'm running this from the login node, and the commands I'm using are the exact commands shown above (no srun/mpirun/etc. just maestro).

Answer 4 · 2023-05-03T02:14:45.000Z

Ok, @aowen87 , I finally made some headway here. Part of the problem appears to be pgen calling mpi ends up setting a bunch of mpi related env vars which confuses the batch job in an unintuitive way: the default slurm configuration is to treat missing --export=[opts] in the slurm batch headers to be set to ALL, which exports all env vars in the current environment when calling sbatch. This is why you don't have to specify using your virtualenvironment python that maestro is installed into in your job steps. However it also lets some MPI set things go though. I was able to get a version with mpi in pgen and the job step to work just fine by hardwiring the --export=NONE option in maestro. So a potential solution here is maybe adding some hooks in the spec somewhere to control this option and a few more potential options for controlling it:

either explicitly purge some env vars from what's passed to slurm using the --export to then reset what's left
try perturbing the subprocess env to not inherit as much from the parent python process (initial trial didn't work here..)
add some user hooks for automatically injecting a bashrc to source in steps or letting you, the user, explicitly do that (have seen the latter using env block to define the path to said rc file)

Answer 5 · 2023-05-03T14:47:48.000Z

Great info! Thanks for digging into this!