This is a Python + mpi4py reproducer for the mlx5 error reported in Case 293224.
The MLX5 error occurs (pretty consistently) on Perlmutter's GPU nodes. What I have found the bug is reliably tripped when:
- Using 70 (or more) GPU nodes
- All sources and dependencies (including the Conda environtment) are on $CFS, which is our large GPFS filesystem
Substituting the DVS mount (/dvs_ro/
) stops this reproducer from tripping the
error (possibly completely, or at least supressing the probability). Installing
to $SCRATCH (our Lustre filesystem) also seems to prevent the error.
We note that this app also sometimes hangs. This may be a separate issue.
ssh perlmutter
#create dir at $CFS/nvendor/<your-username>
cd $CFS/nvendor/<your-username>
git clone https://github.com/JBlaschke/diffBragg-Reproducer
cd diffBragg-Reproducer
./setup.sh
This will create a conda environment called diffbragg
at $(pwd)/diffbragg
.
If you'd like to test on different filesystems, you can istall this repo in another place and follow the same instructions.
salloc -C gpu -n 308 -c 2 -t 60 --gpus-per-task 1 -A nvendor_g -q early_science
source env.sh
srun -u -N77 -n308 -c2 python -u -X faulthandler pmUsrIssue5.py
This reproducer can hang, as a workaround set:
export MPICH_COLL_OPT_OFF=1
mlx5: nid003244: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 20009232 00000000 00000300
00003c40 92083204 000180b8 0085a0e0
MPICH ERROR [Rank 256] [job id 126699.1] [Wed Oct 20 12:32:36 2021] [nid003244] - Abort(70891919) (rank 256 in comm 0): Fatal error in PMPI_Gatherv: Other MPI error, error stack:
PMPI_Gatherv(415)..............: MPI_Gatherv failed(sbuf=0x55ee4a4ebea0, scount=88, MPI_BYTE, rbuf=(nil), rcnts=(nil), displs=(nil), datatype=MPI_BYTE, root=0, comm=MPI_COMM_WORLD) failed
MPIR_CRAY_Gatherv(353).........:
MPIC_Recv(197).................:
MPIC_Wait(71)..................:
MPIR_Wait_impl(41).............:
MPID_Progress_wait(186)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - local protection error)
You can view the full output in example-error.out
We note that this app also sometimes hangs for reasons we don't understand. It may be related to a different problem.