pmodels/mpich

MPICH 4.2.0 causes MPICH to hang

Closed this issue · 12 comments

I was running an HPX application (OctoTiger) using MPICH/OFI netmod on Expanse (an Infiniband machine) with 32 nodes. It works fine with MPICH 4.1.2 but hangs with MPICH 4.2.0. The HPX application only uses MPI_Isend, MPI_Irecv, MPI_Test, and a single communicator. Multiple threads can call those functions simultaneously. It can generate a lot of pending MPI_Isend and MPI_Irecv, with an upperbound of 8192 pending communication. GDB suggests that some MPI_Isend is never completed.

I am using binary search to investigate which commit is causing the issue. Currently, I have narrowed it down to [4.1.2, v4.2.0b1]. To speed up this process, do you have any particular suspicious commits in mind?

Do you use multiple VCIs?

This is with the old MPI parcelport so no. It only uses one communicator and one VCI.

Is XPMEM enabled? If it is, does MPIR_CVAR_CH4_XPMEM_ENABLE=0 bypass the issue?

Is XPMEM enabled? If it is, does MPIR_CVAR_CH4_XPMEM_ENABLE=0 bypass the issue?

How do I say whether XPMEM Is enabled? I saw the following lines in the config log.

configure: RUNNING CONFIGURE FOR ch4:shm:xpmem
checking xpmem.h usability... no
checking xpmem.h presence... no
checking for xpmem.h... no

mpichversion gave me this

MPICH Version: 4.2a1
MPICH Release date: unreleased development copy
MPICH ABI: 0:0:0
MPICH Device: ch4:ofi
MPICH configure: --prefix=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-mpden2dfihxy6z463kc3cvkcrw73meui --disable-silent-rules --enable-shared --with-pm=no --enable-romio --without-ibverbs --enable-wrapper-rpath=yes --with-yaksa=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.3-pxq2cqqtymhqromeazzx3l5xdu3zrzsn --with-hwloc=/home/jackyan1/opt/hwloc/2.9.1 --disable-fortran --with-slurm=yes --with-slurm-include=/cm/shared/apps/slurm/current/include --with-slurm-lib=/cm/shared/apps/slurm/current/lib --with-pmi=slurm --without-cuda --without-hip --with-device=ch4:ofi --with-libfabric=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libfabric-1.21.0-rcsccb4mm2fje6jcwqlezedwfb5mnd6z --enable-libxml2 --enable-thread-cs=per-vci --with-datatype-engine=auto
MPICH CC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gcc -O2
MPICH CXX: /home/jackyan1/workspace/spack/lib/spack/env/gcc/g++ -O2
MPICH F77: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran
MPICH FC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran

So that means you don't have XPMEM. I don't have any more guesses, let me know how your bysect goes.

The current progress is

The following commit works.

commit 2f3a6ec5963fce27fb34ff79147d7fffa707c1ac (HEAD)
Merge: 38552cfab a5b80cbe2
Author: Ken Raffenetti <raffenet@users.noreply.github.com>
Date:   Mon Feb 27 10:26:48 2023 -0600

    Merge pull request #6417 from raffenet/ch4-shm-gpu
    
    ch4/shm: Tweaks to GPU communication settings
    
    Approved-by: Hui Zhou <hzhou321@anl.gov>

The following commit hangs.

commit 359c8683bff711209dde9fc54b5349887815e0cb (HEAD)
Merge: da4c21f58 a3917f55b
Author: Ken Raffenetti <raffenet@users.noreply.github.com>
Date:   Tue Apr 25 16:46:44 2023 -0500

    Merge pull request #6479 from raffenet/ofi-rma-req
    
    ch4/ofi: Remove unneeded RMA request allocation
    
    Approved-by: Hui Zhou <[hzhou321@anl.gov](mailto:hzhou321@anl.gov)>

@hzhou Through bysect I find it is this pr (#6487) that causes the problem. Still figuring out why.

Okay I roughly know what is happening:

I used GDB to inspect a hanging process. I can see some threads are doing MPI_Isend, while others are doing MPI_Test

I set a breakpoint on MPIDI_NM_progress. It was never triggered.

I trace the execution of MPIDI_progress_test for a sending thread. The thread always gets into MPIDI_POSIX_progress_send and sets made_progress to 1 and returns, so it never got to MPIDI_NM_progress.

Even though MPIDI_POSIX_progress_send sets made_progress to 1, it is not really making progress. MPIDU_genq_shmem_pool_cell_alloc always fails with no cell available.

Here is one sample of the backtrace:

#0  MPIDU_genq_shmem_queue_dequeue (pool=0x135f2c0, queue=0x15554acce0c0, cell=0x155539473fb0) at ./src/mpid/common/genq/mpidu_genq_shmem_queue.h:309
#1  0x00001555503e758f in MPIDU_genq_shmem_pool_cell_alloc (pool=0x135f2c0, cell=0x155539473fb0, block_idx=1, src_buf=0x798bc000) at ./src/mpid/common/genq/mpidu_genq_shmem_pool.h:38
#2  0x00001555503e7760 in MPIDI_POSIX_eager_send (grank=0, msg_hdr=0x77f3818, am_hdr=0x77f3820, am_hdr_sz=48, buf=0x798bc000, count=1291, datatype=1275068685, offset=0, src_vci=0, dst_vci=0, bytes_sent=0x1555394740d0) at src/mpid/ch4/shm/posix/eager/iqueue/iqueue_send.h:58
#3  0x00001555503d9f5a in MPIDI_POSIX_eager_send (grank=0, msg_hdr=0x77f3818, am_hdr=0x77f3820, am_hdr_sz=48, buf=0x798bc000, count=1291, datatype=1275068685, offset=0, src_vci=0, dst_vci=0, bytes_sent=0x1555394740d0) at ./src/mpid/ch4/shm/posix/eager/include/posix_eager_impl.h:18
#4  0x00001555503daba3 in MPIDI_POSIX_do_am_isend (grank=0, msg_hdr=0x77f3818, am_hdr=0x77f3820, am_hdr_sz=48, data=0x798bc000, count=1291, datatype=1275068685, sreq=0x698f4c8, issue_deferred=true, src_vci=0, dst_vci=0) at ./src/mpid/ch4/shm/src/../posix/posix_am.h:310
#5  0x00001555503db297 in MPIDI_POSIX_progress_send (vci=0, made_progress=0x155539474298) at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:117
#6  0x00001555503db3a1 in MPIDI_POSIX_progress (vci=0, made_progress=0x155539474298) at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:148
#7  0x00001555503db510 in MPIDI_SHM_progress (vci=0, made_progress=0x155539474298) at ./src/mpid/ch4/shm/src/shm_progress.h:18
#8  0x00001555503dc384 in MPIDI_progress_test (state=0x1555394742f0) at ./src/mpid/ch4/src/ch4_progress.h:125
#9  0x00001555503dcba3 in MPID_Progress_test (state=0x0) at ./src/mpid/ch4/src/ch4_progress.h:218
#10 0x00001555503dcbd7 in MPIDI_OFI_retry_progress () at src/mpid/ch4/netmod/ofi/util.c:15
#11 0x000015554fe0c453 in MPIDI_OFI_send_normal (buf=0x7cf1efc0, count=523, datatype=1275068685, cq_data=1, dst_rank=3, tag=0, comm=0x155550877710 <MPIR_Comm_direct+1712>, context_offset=0, addr=0x1100028, vci_src=0, vci_dst=0, request=0x155539474938, dt_contig=1, data_sz=523, 
    dt_ptr=0x0, dt_true_lb=0, type=0) at ./src/mpid/ch4/netmod/include/../ofi/ofi_send.h:285
#12 0x000015554fe0d719 in MPIDI_OFI_send (buf=0x7cf1efc0, count=523, datatype=1275068685, dst_rank=3, tag=0, comm=0x155550877710 <MPIR_Comm_direct+1712>, context_offset=0, addr=0x1100028, vci_src=0, vci_dst=0, request=0x155539474938, noreq=0, syncflag=0, err_flag=MPIR_ERR_NONE)
    at ./src/mpid/ch4/netmod/include/../ofi/ofi_send.h:438
#13 0x000015554fe0dc6c in MPIDI_NM_mpi_isend (buf=0x7cf1efc0, count=523, datatype=1275068685, rank=3, tag=0, comm=0x155550877710 <MPIR_Comm_direct+1712>, attr=0, addr=0x1100028, request=0x155539474938) at ./src/mpid/ch4/netmod/include/../ofi/ofi_send.h:484
#14 0x000015554fe10f6d in MPIDI_isend (buf=0x7cf1efc0, count=523, datatype=1275068685, rank=3, tag=0, comm=0x155550877710 <MPIR_Comm_direct+1712>, attr=0, av=0x1100028, req=0x155539474938) at ./src/mpid/ch4/src/ch4_send.h:31
#15 0x000015554fe110b6 in MPID_Isend (buf=0x7cf1efc0, count=523, datatype=1275068685, rank=3, tag=0, comm=0x155550877710 <MPIR_Comm_direct+1712>, attr=0, request=0x155539474938) at ./src/mpid/ch4/src/ch4_send.h:60
#16 0x000015554fe11bfa in internal_Isend (buf=0x7cf1efc0, count=523, datatype=1275068685, dest=3, tag=0, comm=-2080374783, request=0x78e5db38) at src/binding/c/pt2pt/isend.c:98
#17 0x000015554fe11cea in PMPI_Isend (buf=0x7cf1efc0, count=523, datatype=1275068685, dest=3, tag=0, comm=-2080374783, request=0x78e5db38) at src/binding/c/pt2pt/isend.c:155

So the thread gets into the infinite while loop of MPIDI_OFI_retry_progress and never returns. It holds the runtime lock and thus blocks all the other threads of the same process.

I lost myself in the shared memory send/receive logic so I don't know why there is no cell available even though the application is consistently invoking the MPI progress engine. The HPX process should always have one pre-posted receive available for incoming messages.

Thanks, @JiakunYan for tracing the bug!

I guess the first thing to fix is MPIDI_POSIX_progress_send should not set made_progress if it is deferred. However, I don't quite see how MPIDU_genq_shmem_pool_cell_alloc continue to fail. The cell should be freed and made available when the receiver process progresses.

@JiakunYan Try this patch - #7174

@JiakunYan Try this patch - #7174

It works! Thanks!

Fixed by #7174