charmplusplus/charm

MPI iprobe runtime error on Cray machines

Opened this issue · 1 comments

Hello, I'm running into MPI_iprobe runtime errors on HPE/Cray (Slingshot 11 interconnect) while running Quinoa (native Charm++). This is similar to #3701. The application works fine on under 40 nodes. At 40 nodes, I get the following errors:

Running as 5120 OS processes: ../quinoa/build/gnu/Main/inciter -c mmtriplepoint3d.q -i meshes/triplept3d_4blk_247m.exo -b -r 5000 -l 5000 -v
charmrun> srun -n 5120 -c 2 ../quinoa/build/gnu/Main/inciter -c mmtriplepoint3d.q -i meshes/triplept3d_4blk_247m.exo -b -r 5000 -l 5000 -v
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running in non-SMP mode: 5120 processes (PEs)
Converse/Charm++ Commit ID: 
Charm++> MPI timer is synchronized
Isomalloc> Synchronized global address space.
CharmLB> Load balancer assumes all CPUs are same.
Quinoa> Load balancing off
Charm++> Running on 40 hosts (2 sockets x 64 cores x 2 PUs = 256-way SMP)
Charm++> cpu topology info is gathered in 0.116 seconds.

...
Application-specific output
...

aborting job:
Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126).......: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=1375, comm=0x84000001, flag=0x7ffc090e5e30, status=0x7ffc090e5e40) failed
MPID_Iprobe(257).......: 
MPIDI_iprobe_safe(118).: 
MPIDI_iprobe_unsafe(42): 
(unknown)(): Other MPI error
MPICH ERROR [Rank 1340] [job id 8992043.0] [Wed Nov 22 12:14:37 2023] [nid001629] - Abort(404857871) (rank 1340 in comm 0): Fatal error in PMPI_Iprobe: Other MPI error, error stack:
PMPI_Iprobe(126).......: MPI_Iprobe(src=MPI_ANY_SOURCE, tag=1375, comm=0x84000001, flag=0x7ffdea939110, status=0x7ffdea939120) failed
MPID_Iprobe(257).......: 
MPIDI_iprobe_safe(118).: 
MPIDI_iprobe_unsafe(42): 
(unknown)(): Other MPI error

------------- Processor 3464 Exiting: Caught Signal ------------
Reason: Aborted
  [3464] Stack Traceback:
  [3464:0] inciter 0x769f77 
  [3464:1] libc.so.6 0x1497abd4cd50 
  [3464:2] libc.so.6 0x1497abd4ccbb gsignal
  [3464:3] libc.so.6 0x1497abd4e355 abort
  [3464:4] libfabric.so.1 0x1497aaee20e6 
  [3464:5] libfabric.so.1 0x1497aaef5cf1 
  [3464:6] libfabric.so.1 0x1497aaeb1111 
  [3464:7] libmpi_gnu_91.so.12 0x1497ad42afa6 
  [3464:8] libmpi_gnu_91.so.12 0x1497ad455dc6 
  [3464:9] libmpi_gnu_91.so.12 0x1497ad45693c PMPI_Test
  [3464:10] inciter 0x768272 
  [3464:11] inciter 0x768a95 
  [3464:12] inciter 0x768b73 CmiInterSendNetworkFunc(int, int, int, char*, int)
  [3464:13] inciter 0x7a1849 CldNodeEnqueue
  [3464:14] inciter 0x664bdb CkSendMsgNodeBranch
  [3464:15] libquinoa_inciter.so 0x1497b7c6381a inciter::CProxyElement_Partitioner::addMesh(int, std::unordered_map<int, std::tuple<std::vector<unsigned long, std::allocator<unsigned long> >, std::unordered_map<unsigned long, std::array<double, 3ul>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::array<double, 3ul> > > >, std::unordered_map<int, std::vector<unsigned long, std::allocator<unsigned long> >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::vector<unsigned long, std::allocator<unsigned long> > > > >, std::unordered_map<int, std::vector<unsigned long, std::allocator<unsigned long> >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::vector<unsigned long, std::allocator<unsigned long> > > > >, std::vector<unsigned long, std::allocator<unsigned long> > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::tuple<std::vector<unsigned long, std::allocator<unsigned ------------- Processor 3540 Exiting: Caught Signal ------------
  [3464:16] libquinoa_inciter.so 0x1497b7c6433b inciter::Partitioner::distribute(std::unordered_map<int, std::tuple<std::vector<unsigned long, std::allocator<unsigned long> >, std::unordered_map<int, std::vector<unsigned long, std::allocator<unsigned long> >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::vector<unsigned long, std::allocator<unsigned long> > > > >, std::unordered_map<int, std::vector<unsigned long, std::allocator<unsigned long> >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::vector<unsigned long, std::allocator<unsigned long> > > > >, std::vector<unsigned long, std::allocator<unsigned long> > >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::tuple<std::vector<unsigned long, std::allocator<unsigned long> >, std::unordered_map<int, std::vector<unsigned long, std::allocator<unsigned long> >, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, std::vector<unsigned long, std::allocator<uMPICH ERROR [Rank 3041] [job id 8992043.0] [Wed Nov 22 12:14:37 2023] [nid001642] - Abort(874643087) (rank 3041 in comm 0): Fatal error in PMPI_Test: Other MPI error, error stack:
  [3464:17] libquinoa_inciter.so 0x1497b7c6519f inciter::Partitioner::partition(int)
  [3464:18] libquinoa_inciter.so 0x1497b7c652e7 inciter::CkIndex_Partitioner::_call_partition_marshall2(void*, void*)
  [3464:19] inciter 0x662574 CkDeliverMessageFree
  [3464:20] inciter 0x667868 _processHandler(void*, CkCoreState*)
  [3464:21] inciter 0x73f361 CsdScheduleForever
  [3464:22] inciter 0x73f5e5 CsdScheduler
MPICH ERROR [Rank 3464] [job id 8992043.0] [Wed Nov 22 12:14:37 2023] [nid001646] - Abort(1) (rank 3464 in comm 496): application called MPI_Abort(comm=0x84000001, 1) - process 3464
  [3464:23] inciter 0x769eaa ConverseInit
  [3464:24] inciter 0x65bbdc charm_main
  [3464:25] libc.so.6 0x1497abd3729d __libc_start_main
  [3464:26] inciter 0x51596a _start

I build Charm using cmake (https://github.com/quinoacomputing/quinoa-tpl/). The modules I have loaded are:

  1) craype-x86-rome
  2) libfabric/1.15.2.0
  3) craype-network-ofi
  4) perftools-base/23.05.0
  5) xpmem/2.5.2-2.4_3.45__gd0f7936.shasta
  6) gcc/12.2.0
  7) craype/2.7.21
  8) cray-dsmml/0.2.2
  9) cray-mpich/8.1.26
 10) cray-libsci/23.05.1.4
 11) PrgEnv-gnu/8.4.0
 12) cmake/3.22.3
 13) cray-hdf5-parallel/1.12.2.3
 14) cray-netcdf-hdf5parallel/4.9.0.3

Any help is appreciated! Thank you!

Suggestion by @ZwFink and @ericjbohm to set MPI_POST_RECV 1 here did not help.