pmodels/mpich

hang in init with high ppn

Opened this issue · 2 comments

MPI_Init hangs on Aurora. Reliably reproducible with nodes=700,ppn=96. Backtrace suggests it is stuck in PMIx_Fence, possibly in shm file handle sharing. Adding back full PMIx_Fence barrier during init works around the problem.

The top suggestion is to try Openpmix latest release (5) to see if the issue reproduces.

Note: while the PMIx_fence issue is not resolved, a work around is to do a PMI_Barrier at init, which prevents the PMIX_fence leak.