JuliaParallel/MPI.jl

Fatal error in internal_Init_thread: Other MPI error

Closed this issue · 10 comments

I got these on my M2 Mac with Julia 1.10.2:

Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67)...........: MPI_Init_thread(argc=0x0, argv=0x0, required=2, provided=0x16db94160) failed
MPII_Init_thread(234)..............: 
MPID_Init(67)......................: 
init_world(171)....................: channel initialization failed
MPIDI_CH3_Init(84).................: 
MPID_nem_init(314).................: 
MPID_nem_tcp_init(175).............: 
MPID_nem_tcp_get_business_card(397): 
GetSockInterfaceAddr(370)..........: gethostbyname failed, bogon (errno 0)

Can you please show the output of the command MPI.versioninfo() and the code you're running. It's hard to make any guess about what's going on without basic information.

The output of MPI.versioninfo() is

  binary:  MPICH_jll
  abi:     MPICH

Package versions
  MPI.jl:             0.20.18
  MPIPreferences.jl:  0.1.10
  MPICH_jll:          4.1.2+0

Library information:
  libmpi:  /Users/gelongqing/.julia/artifacts/f99c980548677ee7ea55b4fb5a14c9036e7ce0b6/lib/libmpi.12.dylib
  libmpi dlpath:  /Users/gelongqing/.julia/artifacts/f99c980548677ee7ea55b4fb5a14c9036e7ce0b6/lib/libmpi.12.dylib
  MPI version:  4.0.0
  Library version:  
    MPICH Version:      4.1.2
    MPICH Release date: Wed Jun  7 15:22:45 CDT 2023
    MPICH ABI:          15:1:3
    MPICH Device:       ch3:nemesis
    MPICH configure:    --prefix=/workspace/destdir --build=x86_64-linux-musl --host=aarch64-apple-darwin20 --enable-shared=yes --enable-static=no --with-device=ch3 --disable-dependency-tracking --enable-fast=all,O3 --docdir=/tmp --mandir=/tmp --disable-opencl FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch
    MPICH CC:           cc   -fno-common  -DNDEBUG -DNVALGRIND -O3
    MPICH CXX:          c++   -DNDEBUG -DNVALGRIND -O3
    MPICH F77:          gfortran -fallow-argument-mismatch  -O3
    MPICH FC:           gfortran -fallow-argument-mismatch  -O3

The bug appears even when I just run MPI.Init() with mpiexecjl --project -n 2 julia ./test.jl. And the exactly same code worked yesterday. The only change is I opened my newly bought Mac... Maybe the error is caused by the synchronization of VSCode?

Well, when I cut off the internet, the code works... It seems that the other Mac is considered as a processor?

Ok, this seems similar to the error reported at idaholab/moose#23610, for which there's a suggested workaround at https://mooseframework.inl.gov/help/troubleshooting.html

gethostbyname failed, localhost (errno 3)

This is a fairly common occurrence which happens when your internal network stack / route, is not correctly configured for the local loopback device. Thankfully, there is an easy fix:

  • Obtain your hostname:
    $ hostname
    mycoolname
  • Linux & Macintosh : Add the results of hostname to your /etc/hosts file. Like so:
    $ sudo vi /etc/hosts
    
    127.0.0.1  localhost
    
    # The following lines are desirable for IPv6 capable hosts
    ::1        localhost ip6-localhost ip6-loopback
    ff02::1    ip6-allnodes
    ff02::2    ip6-allrouters
    
    127.0.0.1  mycoolname  # <--- add this line to the end of your hosts file
    
    Everyones host file is different. But the results of adding the necessary line described above will be the same.
  • Macintosh only, 2nd method:
    sudo scutil --set HostName mycoolname
    
    We have received reports where the second method sometimes does not work.

This also looks similar to pmodels/mpich#6547. pmodels/mpich#6547 (comment) suggested to export the environment variables

MPIR_CVAR_OFI_SKIP_IPV6=0
FI_PROVIDER=tcp

as an alternative work around, without messing up with hostname configuration. That bug was reportedly fixed in MPICH by pmodels/mpich#6558, which first appeared in v4.2.0, but you're using v4.1.2.

This also looks similar to pmodels/mpich#6547. pmodels/mpich#6547 (comment) suggested to export the environment variables

MPIR_CVAR_OFI_SKIP_IPV6=0
FI_PROVIDER=tcp

as an alternative work around, without messing up with hostname configuration. That bug was reportedly fixed in MPICH by pmodels/mpich#6558, which first appeared in v4.2.0, but you're using v4.1.2.

Unfortunately, this work around failed for me. But the first one works. Thanks a lot for your reply!

Can you try to revert your changes to /etc/hosts and update MPICH_jll to v4.2.0 (]add MPICH_jll@v4.2.0, if ]up doesn't upgrade it automatically)? That should also do the trick

Can you try to revert your changes to /etc/hosts and update MPICH_jll to v4.2.0 (]add MPICH_jll@v4.2.0, if ]up doesn't upgrade it automatically)? That should also do the trick

It doesn't work.

Alright, I'll open a PR to add a known issue to the documentation, but this is strictly not a bug in MPI.jl, rather in your MPI/system configuration, as other independent projects have experienced it as well.

Alright, I'll open a PR to add a known issue to the documentation, but this is strictly not a bug in MPI.jl, rather in your MPI/system configuration, as other independent projects have experienced it as well.

I see. Thanks again!