Fatal error in internal_Init_thread: Other MPI error
Closed this issue · 10 comments
I got these on my M2 Mac with Julia 1.10.2:
Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67)...........: MPI_Init_thread(argc=0x0, argv=0x0, required=2, provided=0x16db94160) failed
MPII_Init_thread(234)..............:
MPID_Init(67)......................:
init_world(171)....................: channel initialization failed
MPIDI_CH3_Init(84).................:
MPID_nem_init(314).................:
MPID_nem_tcp_init(175).............:
MPID_nem_tcp_get_business_card(397):
GetSockInterfaceAddr(370)..........: gethostbyname failed, bogon (errno 0)
Can you please show the output of the command MPI.versioninfo()
and the code you're running. It's hard to make any guess about what's going on without basic information.
The output of MPI.versioninfo()
is
binary: MPICH_jll
abi: MPICH
Package versions
MPI.jl: 0.20.18
MPIPreferences.jl: 0.1.10
MPICH_jll: 4.1.2+0
Library information:
libmpi: /Users/gelongqing/.julia/artifacts/f99c980548677ee7ea55b4fb5a14c9036e7ce0b6/lib/libmpi.12.dylib
libmpi dlpath: /Users/gelongqing/.julia/artifacts/f99c980548677ee7ea55b4fb5a14c9036e7ce0b6/lib/libmpi.12.dylib
MPI version: 4.0.0
Library version:
MPICH Version: 4.1.2
MPICH Release date: Wed Jun 7 15:22:45 CDT 2023
MPICH ABI: 15:1:3
MPICH Device: ch3:nemesis
MPICH configure: --prefix=/workspace/destdir --build=x86_64-linux-musl --host=aarch64-apple-darwin20 --enable-shared=yes --enable-static=no --with-device=ch3 --disable-dependency-tracking --enable-fast=all,O3 --docdir=/tmp --mandir=/tmp --disable-opencl FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch
MPICH CC: cc -fno-common -DNDEBUG -DNVALGRIND -O3
MPICH CXX: c++ -DNDEBUG -DNVALGRIND -O3
MPICH F77: gfortran -fallow-argument-mismatch -O3
MPICH FC: gfortran -fallow-argument-mismatch -O3
The bug appears even when I just run MPI.Init()
with mpiexecjl --project -n 2 julia ./test.jl
. And the exactly same code worked yesterday. The only change is I opened my newly bought Mac... Maybe the error is caused by the synchronization of VSCode?
Well, when I cut off the internet, the code works... It seems that the other Mac is considered as a processor?
Ok, this seems similar to the error reported at idaholab/moose#23610, for which there's a suggested workaround at https://mooseframework.inl.gov/help/troubleshooting.html
gethostbyname failed, localhost (errno 3)
This is a fairly common occurrence which happens when your internal network stack / route, is not correctly configured for the local loopback device. Thankfully, there is an easy fix:
- Obtain your hostname:
$ hostname mycoolname- Linux & Macintosh : Add the results of
hostname
to your/etc/hosts
file. Like so:Everyones host file is different. But the results of adding the necessary line described above will be the same.$ sudo vi /etc/hosts 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 127.0.0.1 mycoolname # <--- add this line to the end of your hosts file
- Macintosh only, 2nd method:
We have received reports where the second method sometimes does not work.sudo scutil --set HostName mycoolname
This also looks similar to pmodels/mpich#6547. pmodels/mpich#6547 (comment) suggested to export the environment variables
MPIR_CVAR_OFI_SKIP_IPV6=0
FI_PROVIDER=tcp
as an alternative work around, without messing up with hostname configuration. That bug was reportedly fixed in MPICH by pmodels/mpich#6558, which first appeared in v4.2.0, but you're using v4.1.2.
This also looks similar to pmodels/mpich#6547. pmodels/mpich#6547 (comment) suggested to export the environment variables
MPIR_CVAR_OFI_SKIP_IPV6=0 FI_PROVIDER=tcpas an alternative work around, without messing up with hostname configuration. That bug was reportedly fixed in MPICH by pmodels/mpich#6558, which first appeared in v4.2.0, but you're using v4.1.2.
Unfortunately, this work around failed for me. But the first one works. Thanks a lot for your reply!
Can you try to revert your changes to /etc/hosts
and update MPICH_jll
to v4.2.0 (]add MPICH_jll@v4.2.0
, if ]up
doesn't upgrade it automatically)? That should also do the trick
Can you try to revert your changes to
/etc/hosts
and updateMPICH_jll
to v4.2.0 (]add MPICH_jll@v4.2.0
, if]up
doesn't upgrade it automatically)? That should also do the trick
It doesn't work.
Alright, I'll open a PR to add a known issue to the documentation, but this is strictly not a bug in MPI.jl
, rather in your MPI/system configuration, as other independent projects have experienced it as well.
Alright, I'll open a PR to add a known issue to the documentation, but this is strictly not a bug in
MPI.jl
, rather in your MPI/system configuration, as other independent projects have experienced it as well.
I see. Thanks again!