hang in init with high ppn
Opened this issue · 2 comments
raffenet commented
MPI_Init
hangs on Aurora. Reliably reproducible with nodes=700,ppn=96. Backtrace suggests it is stuck in PMIx_Fence
, possibly in shm file handle sharing. Adding back full PMIx_Fence
barrier during init works around the problem.
hzhou commented
The top suggestion is to try Openpmix latest release (5) to see if the issue reproduces.
hzhou commented
Note: while the PMIx_fence issue is not resolved, a work around is to do a PMI_Barrier at init, which prevents the PMIX_fence leak.