assertion botched on Open MPI
Opened this issue · 3 comments
We also found a problem similar to the one described in #14
but with OpenMPI and a different error.
The error can be reproduced on two different clusters.
Cluster A:
malloc: unknown:0: assertion botched
free: called with unallocated block argument
Aborting...[A-c7-048-16:05963] *** Process received signal ***
[A-c7-048-16:05963] Signal: Aborted (6)
[A-c7-048-16:05963] Signal code: (-6)
Cluster B:
user@B0:~/Projects/mpibash/MPI-Bash/examples$ mpirun -n 8 ./testmpi
[B13:19987] *** Process received signal ***
malloc: unknown:0: assertion botched
Let's try to fix this issue first then move on to #14 as most of the clusters I use have Open MPI rather than MVAPICH installed. To start, what version of Open MPI and Bash are you using? Also, please confirm that the Bash source against which you're building MPI-Bash corresponds to the same Bash version you're using as your shell.
— Scott
Alright. I'll focus on machine A.
Open MPI
$ module list
Currently Loaded Modulefiles:
1) open_mpi/1.6.5(1.6:default)
bash
$ ~/bash/bin/bash --version
GNU bash, version 4.4.12(1)-release (x86_64-unknown-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
I also gave it a try by compiling bash 4.4.18. No luck.
MPI-Bash was built upon this ./configure call:
./configure --with-bashdir=$HOME/bash-4.4.18 --prefix=$HOME/mpibash CC=mpicc
my submission script:
$ cat submit.sh
#BSUB -J TEST
#BSUB -n 96
#BSUB -W 00:45
[...]
mpirun ~/bash/bin/bash testmpi
I submitted the script using the default /bin/bash shell.
$ /bin/bash --version
GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Is maybe this one above the problem?
All right; I can reproduce the problem. From what I can tell, the InfiniBand driver in particular is really unhappy about fork
s or something that MPI-Bash is doing. However, I'm having trouble diagnosing what that is. My thinking is that the "right" thing to do is to provide some scripts to enable a static build of MPI-Bash (i.e., a new bash
executable that's linked to MPI). Alas, a quick test of that approach isn't working, either.
I don't know what to suggest. I fear MPI-Bash might be semi-permanently broken on modern cluster installations.
Sorry about that,
— Scott