MPICH segfault

Thanks for the fantastic project, it has been instrumental to our distributed parallel tempering work!

Recently though our CI tests started failing in larger test runs due to the MPICH bundled in MPI.jl segfaulting. The only information given by MPICH is

signal (11): Segmentation fault
MPIDI_CH3U_Recvq_FDU_or_AEP.cold.8 at /home/runner/.julia/artifacts/438b71472ff15106624c9ec1bf26706227e678ea/lib/libmpi.so (unknown line)

We contacted the MPICH team first and they recommended:

It'll be helpful if you can build mpich with debug symbols (configure --enable-g=dbg) and try to obtain a stack backtrace. One way is via a coredump.

Thus, I was wondering if by any chance MPI.jl includes options for JLL-provided versions of the binaries with debug symbol? Alternatively, are the scripts used to bundle the JLL binaries available in another repository?

Thank you in advance for your time!

Nice, I used to work on MCMC, so I'm glad to see this getting used in that area!

I don't think we include the debug symbols in the JLLs. The JLL binaries are built using the recipe in Yggdrasil. However it might be easier if you can use the system or locally-built MPI. See https://juliaparallel.org/MPI.jl/latest/configuration/ for details on how to do this. If you're using Github Actions, you might be able to reuse our unit tests:

MPI.jl/.github/workflows/UnitTests.yml

Lines 170 to 216 in 6d513bb

    
             test-system-apt: 
        
               timeout-minutes: 20 
        
               strategy: 
        
                 matrix: 
        
                   mpi: 
        
                     - libmpich-dev 
        
                     - libopenmpi-dev 
        
                   julia_version: 
        
                     - "1" 
        
                 fail-fast: false 
        
               runs-on: ubuntu-20.04 
        
               env: 
        
                 JULIA_MPI_TEST_BINARY: system 
        
                 OMPI_MCA_btl_base_warn_component_unused: 0 
        
               steps: 
        
               - name: Checkout 
        
                 uses: actions/checkout@v3 
        
               - name: Install MPI via apt 
        
                 run: | 
        
                   sudo apt-get update 
        
                   sudo apt-get install $MPI 
        
                 env: 
        
                   MPI: ${{ matrix.mpi }} 
        
               - uses: julia-actions/setup-julia@latest 
        
                 with: 
        
                   version: ${{ matrix.julia_version }} 
        
               - uses: julia-actions/cache@v1 
        
               - name: add MPIPreferences 
        
                 shell: julia --color=yes --project=. {0} 
        
                 run: | 
        
                   using Pkg 
        
                   Pkg.develop(path="lib/MPIPreferences") 
        
               - name: use system MPI 
        
                 shell: julia --color=yes --project=. {0} 
        
                 run: | 
        
                   using MPIPreferences 
        
                   MPIPreferences.use_system_binary() 
        
               - uses: julia-actions/julia-runtest@latest

Awesome, thanks! Indeed, building MPICH locally seems the best approach (thankfully, we found a way to replicate the issue locally!)

And thank you also pointing to the github workflows, the MPI matrix will be super helpful to improve our CI.

One thing I forgot to mention is that Hui Zhou from the MPICH group inferred from the error message that MPICH ch3 was used rather than the more recent ch4. Looking at the Yggdrasil recipe indeed it seems ch3 is used (https://github.com/JuliaPackaging/Yggdrasil/blob/fa61398f401eda74898ad2849b874417f5a5217c/M/MPICH/build_tarballs.jl#L74). Perhaps worth upgrading unless there were other issues with ch4?

If that can help, in the next days I may be able to do a debug build of MPICH for you, if you tell me your Base.BinaryPlatforms.triplet(Base.BinaryPlatforms.HostPlatform()).

If there are no contraindications switching to ch4, we can build new version (4.1.1 was released yesterday) with it, but I don't know if that has any limitations.

This is a really kind offer Mosè, thank you so much! My BinaryPlatform is "x86_64-apple-darwin-libgfortran4-cxx11-julia_version+1.8.1".

To this I will add that if it simpler to do it for linux, my colleague @miguelbiron who is also working on the same project has the BinaryPlatform "x86_64-linux-gnu-libgfortran5-cxx11-libstdcxx30-julia_version+1.8.5"

There you go:

I hope the name of the repositories is self-explanatory 🙂

Hold on..... I did it wrong, need to rebuild them 🤦

@alexandrebouchard @miguelbiron ok, now you should be able to ]add those repositories to get debug builds of MPICH, with either ch3 or ch4. Note that it doesn't come with the source code, so you'll want to check out the source code of MPICH 4.1.0, and use set substitute-path to remap the directories in GDB/LLDB. It'd be good to know if the issue goes away with ch4, we may switch to it in JuliaPackaging/Yggdrasil#6355

Thank you so much! BinaryBuilder is amazing...

..and I will make sure to come back here later to report on the ch3 vs ch4 question

Thanks again for all your help on this. We have identified the cause of the problem! If you are curious, here is what happened:

We had an async send operation for which we did not need to wait for completion (synchronization is done via tags, and only async recv needed a wait). As a result, we did not capture the return value of MPI.Isend(...)
As a result, MPICH used more than a hard coded limit of 2^16 requests objects. There seems to be no bound checks, instead it segfaults with error message

signal (11): Segmentation fault
MPIDI_CH3U_Recvq_FDU_or_AEP.cold.8

(sometimes, sometimes it does not crash, presumably if you get "lucky" and malloc is far enough from other important memory locations!)

The solution is simple, to capture the return value and call free on it:

dummy_request = MPI.Isend(...)
MPI.free(dummy_request)

A few follow up items in case you are interested:

I could not find a place in the documentation that mentions that request objects have to be free'ed explicitly. Would it be helpful to create an issue or PR to document this requirement?
Along debugging, we got to try ch3 vs ch4. While this distinction ended up having no impact on this particular issue, @miguelbiron reported anecdotal results for a 2x speedup when going from ch3 to ch4 on a point-to-point communication-dominated test cases

Huh, TIL that is actually a valid thing to do!

I could not find a place in the documentation that mentions that request objects have to be free'ed explicitly. Would it be helpful to create an issue or PR to document this requirement?

We do actually attach a finalizer to Request objects, so they should get freed by the GC:

MPI.jl/src/nonblocking.jl

Lines 181 to 184 in 6d513bb

    
           function Request() 
        
               req = Request(API.MPI_REQUEST_NULL[], nothing) 
        
               return finalizer(free, req) 
        
           end

I'm not sure why that didn't happen in this case?

But yes, a note in the docs somewhere would be helpful (I think it only applies to Isend so maybe there?)

Interesting... I think what is going on here is that we were very careful to make the inner loop of the code non-allocating, so the GC might not actually get called and hence finalize the stale requests...

Might apply to also Ircv, isend, perhaps others too?

Ah, I think you're right:

For a request representing a nonblocking point-to-point or a persistent point-to-point
operation, it is permitted (although strongly discouraged) to call MPI_REQUEST_FREE
when the request is active. In this special case, MPI_REQUEST_FREE will only mark the
request for freeing and MPI will actually do the freeing stage of the associated operation
later

Ah I didn't know about the "strongly discouraged" bit! Based on this maybe our use of Isend somehow falls in a rare corner case...

Now that I think about, a corollary of free() being called by the finalizer is that free() may be called twice on the same request with our current approach. So maybe it's a bad idea and we should just wait for both outgoing and incoming requests (even though this algorithm only technically would have to wait for the latter).

Free first checks if the request is null, so it's safe to call twice

	test-system-apt:
	timeout-minutes: 20
	strategy:
	matrix:
	mpi:
	- libmpich-dev
	- libopenmpi-dev
	julia_version:
	- "1"

	fail-fast: false

	runs-on: ubuntu-20.04

	env:
	JULIA_MPI_TEST_BINARY: system
	OMPI_MCA_btl_base_warn_component_unused: 0

	steps:
	- name: Checkout
	uses: actions/checkout@v3

	- name: Install MPI via apt
	run: \|
	sudo apt-get update
	sudo apt-get install $MPI
	env:
	MPI: ${{ matrix.mpi }}

	- uses: julia-actions/setup-julia@latest
	with:
	version: ${{ matrix.julia_version }}

	- uses: julia-actions/cache@v1
	- name: add MPIPreferences
	shell: julia --color=yes --project=. {0}
	run: \|
	using Pkg
	Pkg.develop(path="lib/MPIPreferences")

	- name: use system MPI
	shell: julia --color=yes --project=. {0}
	run: \|
	using MPIPreferences
	MPIPreferences.use_system_binary()

	- uses: julia-actions/julia-runtest@latest

	function Request()
	req = Request(API.MPI_REQUEST_NULL[], nothing)
	return finalizer(free, req)
	end