StanfordLegion/legion

S3D freeze at large node counts

Opened this issue ยท 41 comments

As suspected by Mike in our meeting today, we are seeing freezes in S3D at large node counts. This is a preliminary report because I just encountered the issue and am still getting data.

What I know so far:

  • My most recent job on the 24.03.0 rc branch froze and 2048 and 8192 nodes, but ran at 4096 nodes
  • A previous job with my hacked up branch mentioned in #1653 (comment) froze at 8192 nodes but worked at all other node counts. Edit: I have now seen freezes at 2048 and 4096 nodes as well.
  • To the best of my knowledge, we have never been able to run a job past 8192 nodes. When Seshu tried (I can't remember if this was February 2024 or November 2023), he hit symptoms that looked like a freeze. It's possible these are all the same symptoms, but we didn't investigate at the time so it's hard to be sure

I have backtraces from the 2048 node run I just did. Don't blame me for not having -ll:force_kthreads; I was not expecting the freeze and wasn't prepared to collect backtraces, so they are what they are. From an initial scan I haven't see anything interesting, so likely I will need to rerun with -ll:force_kthreads.

http://sapling.stanford.edu/~eslaught/s3d_freeze_debug2/bt2048/

Do these runs have the same bug that @syamajala found with incomplete partitions being marked complete? If so, that can lead to all sorts of undefined behavior in the runtime including hangs so I would make sure we don't have that bug in these runs at all before going about doing anymore debugging.

The branch @elliottslaughter is running does not have those partitions. I have fixed the issue with partitions incorrectly being marked as complete, it has not solved my problem though. Will open a separate issue for that soon.

For this issue I am running the same code as in #1653. In the initial investigation into that issue I ran a number of checks, including -lg:partcheck and -lg:safe_ctrlrepl 1. I do not have any reason to believe those results would have changed since the application code is unmodified.

My freeze appears to be sensitive to -ll:force_kthreads. When I run with that flag, I have not been able to replicate a freeze at lower than 8192 nodes. I set up a new job attempting to get backtraces at that node count. I'll reply to this issue again if I get them.

I got a run to freeze at 4096 nodes with -ll:force_kthreads. I collected three sets of backtraces, about 10 minutes apart. Because the collection process is finicky they don't all have the same number of nodes, but this should allow a direct comparison of at least the earlier node numbers:

One pattern I see in the backtraces is threads like this:

Thread 3 (Thread 0x7ffe5d20f740 (LWP 81080) "s3d.x"):
#0  0x00007fffed8f19d5 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fffdd874207 in ofi_genlock_lock (lock=<optimized out>) at ./include/ofi_lock.h:359
#2  cxip_send_common (txc=0x6754838, tclass=512, buf=0x7ffe75a1f278, len=912, desc=<optimized out>, data=data@entry=0, dest_addr=74, tag=0, context=0x7ffe75a1f250, flags=83886080, tagged=false, triggered=false, trig_thresh=0, trig_cntr=0x0, comp_cntr=0x0) at prov/cxi/src/cxip_msg.c:5510
#3  0x00007fffdd875001 in cxip_send (fid_ep=<optimized out>, buf=<optimized out>, len=<optimized out>, desc=<optimized out>, dest_addr=<optimized out>, context=<optimized out>) at prov/cxi/src/cxip_msg.c:5874
#4  0x00007fffe0f72136 in gasnetc_ofi_am_send_medium () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#5  0x00007fffe0fefd91 in gasnetc_AM_CommitRequestMediumM () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#6  0x00007fffe00d7628 in Realm::XmitSrcDestPair::push_packets (this=0x7ffd04031d50, immediate_mode=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2092
#7  0x00007fffe00d925e in Realm::GASNetEXInjector::do_work (this=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2725
#8  0x00007fffdff6d778 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7ffe5d20d520, max_time_in_ns=<optimized out>, interrupt_flag=0x0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:599
#9  0x00007fffdff6e051 in Realm::BackgroundWorkThread::main_loop (this=0x1fb2cdb0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:103
#10 0x00007fffe007e541 in Realm::KernelThread::pthread_entry (data=0x1fdad8b0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/threads.cc:831
#11 0x00007fffed8e96ea in start_thread () from /lib64/libpthread.so.0
#12 0x00007fffe912150f in clone () from /lib64/libc.so.6

Otherwise I don't see much going on.

This looks to me like the network locked up. There's multiple threads on pretty much every single node trying to push active messages into the network and they are all spinning on some lock in OFI inside a call to gex_AM_CommitRequestMedium2. If we can't send active messages it's not terribly surprising that things are going to hang. Now part of what might be going wrong here is that the code before frame 6 in Realm is using gex_AM_MaxRequestMedium to figure out how many medium active messages we can commit into the network without blocking and if we can't send them then the background worker threads will go off and do something else important (e.g. like handle incoming active messages). However, what it looks like is happening here is we got an indication from GASNet that we're going to be able to send these messages right away as the request call has succeeded, but then when we go to commit the messages, the actual commit code is blocking internally. Now I don't think this is necessarily GASNet's fault directly since it looks like OFI is actually the one that is blocking and not GASNet, but I think we might need to have GASNet modify the call to gex_AMRequestMedium to make sure it handles all the cases where OFI might block in the commit call. @PHHargrove @bonachea do you see something else going on here that I'm missing in these backtraces?

Since we're discussing a potential network hang, I will just include the network-related environment variables here for posterity:

export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=8M
export MPICH_OFI_NIC_POLICY=BLOCK

(I think that GASNET_OFI_DEVICE_0 et al. are actually superfluous here because we are running 1 rank/node.)

And for @PHHargrove @bonachea's benefit, this is GASNet 2023.9.0 without memory kinds enabled.

FWIW, I spot checked a few more of the threads in the same processes across the backtraces at different times and the threads are not making any progress. The this pointers of the objects at all the frames in each thread remain the same across the several minutes of checkpoints, so they are very much stuck and not going anywhere, happily spin waiting on their OFI locks.

Please try with GASNET_OFI_RECEIVE_BUFF_SIZE=recv (instead of 8M).
The MetaHipMer team has pretty clear evidence that the issue with FI_MULTI_RECV was not fixed in the SlingShot 2.1.0 release. So, this work-around to disable it may still be necessary.

Is this going to exacerbate the issue with the number of receive buffers and "unexpected" messages like we saw with the GASNET_OFI_RECEIVE_BUFF_SIZE=single workaround we had previously tested, or do you expect recv to avoid those problems?

@elliottslaughter wrote:

Is this going to exacerbate the issue with the number of receive buffers and "unexpected" messages

With use of CXI's buggy multi-recv feature disabled via GASNET_OFI_RECEIVE_BUFF_SIZE=recv, the conduit needs to post orders of magnitude more recv buffers in order to soak the same approximate volume of incoming AM messages before libfabric tips over into unexpected messages. The conduit adjusts the default value of GASNET_OFI_NUM_RECEIVE_BUFFS to help accomplish this (currently, the default changes from 8 to 450) and you also have the option to manually override GASNET_OFI_NUM_RECEIVE_BUFFS.

If the application is currently hitting the known bugs in multi-recv that arise under heavy load, then turning off that feature is probably the only surgical alternative from a correctness perspective. The only other (more intrusive) possibility that comes to mind is changing the communication pattern in the hopes of lowering the chances of breaking multi-recv.

like we saw with the GASNET_OFI_RECEIVE_BUFF_SIZE=single workaround we had previously tested, or do you expect recv to avoid those problems?

The GASNET_OFI_RECEIVE_BUFF_SIZE=recv setting activates a slightly improved version of the now-deprecated GASNET_OFI_RECEIVE_BUFF_SIZE=single setting (details here). Both seek to avoid the known bugs in the CXI provider implementation of FI_MULTI_RECV. The main difference is =recv entirely removes FI_MULTI_RECV from the arguments GASNet passes to libfabric, whereas =single took a less direct approach of leaving multi-recv enabled but tuning the buffer parameters to induce every buffer to be "single shot".

@lightsighter wrote:

the code before frame 6 in Realm is using gex_AM_MaxRequestMedium to figure out how many medium active messages we can commit into the network without blocking and if we can't send them then the background worker threads will go off and do something else important (e.g. like handle incoming active messages). However, what it looks like is happening here is we got an indication from GASNet that we're going to be able to send these messages right away as the request call has succeeded, but then when we go to commit the messages, the actual commit code is blocking internally. Now I don't think this is necessarily GASNet's fault directly since it looks like OFI is actually the one that is blocking and not GASNet, but I think we might need to have GASNet modify the call to gex_AMRequestMedium to make sure it handles all the cases where OFI might block in the commit call.

First a clarification: gex_AM_MaxRequestMedium() is a query of maximum message size limit (not count) and in particular is independent of current network state. Attempts to inject messages with payloads over that size are simply erroneous; gex_AM_MaxRequestMedium() is not in any way relevant to whether or not that injection might block.

Setting that aside, you are correct that Realm::XmitSrcDestPair::push_packets() might be asking GASNet to avoid blocking at injection. That request is controlled by GEX_FLAG_IMMEDIATE, and that flag is conditionally set by Realm::XmitSrcDestPair::push_packets() based on a caller argument (which is optimized away in the backtrace above, but appears to be hard-coded to true at this particular call-site).

Regardless of whether or not GEX_FLAG_IMMEDIATE is passed to gex_AM_PrepareRequestMedium() (or FPAM injection), this is a "best effort" GASNet feature, and does not (cannot) provide a hard guarantee that the underlying network library will not block during injection. Currently I'd characterize GASNet's "best effort" here as "pretty good" on ibv-conduit and "not so great" on ofi-conduit. My understanding of the current state of IMMEDIATE injection in ofi-conduit is that FPAM injection should faithfully detect cases where GASNet's conduit-level metadata (buffer descriptors) are running scarce, and should correctly return failure IFF the underlying fi_inject/fi_send call to the libfabric provider returns -FI_EAGAIN. However ofi-conduit does not attempt to model/predict the resource state of the underlying provider, who might choose to block without returning -FI_EAGAIN.

Moreover, it's worth noting that ofi-conduit does not currently have a native implementation of NPAM at all (GASNET_NATIVE_NP_ALLOC_REQ_MEDIUM==undef). Instead it currently uses a "reference implementation" of NPAM over FPAM, such that the Prepare/Commit calls in ofi-conduit currently just invoke FPAM underneath (as partially revealed in the stack trace). One important consequence of this "reference implementation" that is particularly critical here is that this reference implementation always strips off any GEX_FLAG_IMMEDIATE flags given to gex_AM_PrepareRequestMedium() (there are fundamental reasons for this arising from the NPAM vs FPAM interface difference). But the upshot here is that even if Realm is passing GEX_FLAG_IMMEDIATE to gex_AM_PrepareRequestMedium(), ofi-conduit is currently always ignoring that flag to NPAM injection.

Coincidentlaly we've just been awarded funding that we hope can be used to improve both of these sub-optimal behaviors in ofi-conduit, but that work is still early stages and won't be ready for users for some time.

Regardless of whether or not GEX_FLAG_IMMEDIATE is passed to gex_AM_PrepareRequestMedium() (or FPAM injection), this is a "best effort" GASNet feature, and does not (cannot) provide a hard guarantee that the underlying network library will not block during injection.

Ok, I see where Realm is running a separate polling thread that always pulls active messages off the wire and that should be sufficient to ensure forward progress even if all the background worker threads get stuck in a commit call. I thought we had gotten rid of the polling thread in the GASNetEX module, but apparently not.

The MetaHipMer team has pretty clear evidence that the issue with FI_MULTI_RECV was not fixed in the SlingShot 2.1.0 release. So, this work-around to disable it may still be necessary.

So to make sure I'm clear: this bug is sufficient to prevent forward progress? I would kind of hope that even if we were running short of buffers for sending and blocking in the commit calls, that eventually the active message polling thread can eventually pull a bunch of messages off the wire, drain the network, and thereby free up resources for doing the sends again and that would ensure forward progress. That is not possible here and something else is going wrong in OFI?

But the upshot here is that even if Realm is passing GEX_FLAG_IMMEDIATE to gex_AM_PrepareRequestMedium(), ofi-conduit is currently always ignoring that flag to NPAM injection.

How is that an upshot? ๐Ÿ˜…

Coincidentlaly we've just been awarded funding that we hope can be used to improve both of these sub-optimal behaviors in ofi-conduit, but that work is still early stages and won't be ready for users for some time.

That is promising to hear!

Thread partially continued in email...

Several people have been asking me to run experiments, so I am going to report the results of those experiments here.

Experiment 1: GASNET_OFI_RECEIVE_BUFF_SIZE=recv

My initial experiment with the default value of GASNET_OFI_NUM_RECEIVE_BUFFS (450, per #1657 (comment)) worked up to 2048 nodes and then froze at 4096 nodes. I tried doubling GASNET_OFI_NUM_RECEIVE_BUFFS to 900, and that rerun also froze at 4096 nodes.

It might help to get some guidance from @bonachea as to how much space I am actually allocating. For posterity, here is the most recent set of environment variables that I ran with:

export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=recv
export GASNET_OFI_NUM_RECEIVE_BUFFS=$(( 450 * 2 )) # default is 450 when in recv mode
export MPICH_OFI_NIC_POLICY=BLOCK

If I'm allocating (say) the maximum size of a medium active message, I could be quite a bit more aggressive about this setting, if that's what we think we need to do. I'll probably jump to 450 * 10 for my next experiment unless someone jumps in to tell me otherwise.

Experiment 2: GASNet 2023.3.0

This is the version that @syamajala previously ran to 8192 nodes, as mentioned in the top comment on this issue. I am not aware of any technical reason why GASNet 2023.3.0 would work, but just to cover all our bases, I am running it. So far, it has been successful up to 4096 nodes. (This does not necessarily mean anything because the freezes are non-deterministic anyway, so we won't really know if it makes any meaningful difference until I can get a new 8192 node job through.)

Experiment 3: Tracing 1 Timestep

The code in the top comment of this issue traces 10 timesteps at a time. This is a shot in the dark, but to the extent that the tracing code (especially when first capturing a trace) induces network communication, my thought was that reducing the number of timesteps per trace might reduce that. (At the very least, fences between different replays of the trace would slow things down.)

So far it's working up to 4096 nodes, which again doesn't mean much in a nondeterministic world. But so far so good.

I should also point out that I am hitting the following warning in my runs:

*** WARNING: ofi-conduit failed to configure FI_CXI_* environment variables due to prior conflicting settings. This may lead to unstable behavior and/or degraded performance. If you did not intentionally set FI_CXI_RX_MATCH_MODE='software', FI_CXI_RDZV_THRESHOLD='16384', or FI_CXI_RDZV_GET_MIN then this condition may have resulted from initializing MPI prior to initialization of GASNet. For more information on that scenario, please see "Limits to MPI interoperability" in the ofi-conduit README.

You can see my variables here: #1657 (comment) . I am indeed setting FI_CXI_RX_MATCH_MODE, but not the other two. I do not know if this matters.

(S3D is a hybrid code and I believe that we do initialize MPI first.)

I think it's worth noting that Frontier was upgraded to Slingshot 2.1.2 earlier this week (Mar 18 2024 if directory timestamps are to be believed). This upgrade included replacing the installed libfabric CXI provider, portions of Cray MPI and other components of the system stack. If you've observed changes in network behavior relative to runs preceding Mar 18 2024, then the new Slingshot stack should be high on the list of suspects (not that we can do much about it).

Additionally, I'd like to ensure we are not "chasing phantoms" here. Because we are encountering problems, Please ensure you've rebuilt all objects and executables from scratch since Mar 18, especially GASNet and anything making MPI calls.

I should also point out that I am hitting the following warning in my runs:

*** WARNING: ofi-conduit failed to configure FI_CXI_* environment variables due to prior conflicting settings. This may lead to unstable behavior and/or degraded performance. If you did not intentionally set FI_CXI_RX_MATCH_MODE='software', FI_CXI_RDZV_THRESHOLD='16384', or FI_CXI_RDZV_GET_MIN then this condition may have resulted from initializing MPI prior to initialization of GASNet. For more information on that scenario, please see "Limits to MPI interoperability" in the ofi-conduit README.

You can see my variables here: #1657 (comment) . I am indeed setting FI_CXI_RX_MATCH_MODE, but not the other two. I do not know if this matters.

First, I should note for the record that all FI_* envvar settings control the (closed-source) HPE libfabric CXI provider that sits underneath GASNet (and Cray MPI) on HPE Cray EX. Our knowledge of how these settings operate is mostly constrained to system documentation and closed-box experimentation. That being said, in the absence of conflicting FI_ settings, ofi-conduit over CXI defaults to setting:

FI_CXI_RX_MATCH_MODE=hybrid 
FI_CXI_RDZV_THRESHOLD=256 
FI_CXI_RDZV_GET_MIN=256

This is NOT an endorsement by GASNet of guaranteed stability for this combination of CXI provider settings or suitability for all use cases, IIUC it's just a "better than nothing" default that we've found to be reasonable (especially for "pure GASNet" use cases where Cray MPI is not in use). ofi-conduit issues this warning if it finds one of these settings is already set, partially because they globally affect CXI provider behavior and MPI is known to set some of these even if the user did not (so the warning alerts you to the current settings in use).

So in your case the appearance of the warning is expected (because you are explicitly setting export FI_CXI_RX_MATCH_MODE=software), but the content reveals that FI_CXI_RDZV_THRESHOLD='16384' has additionally been set by someone (presumably by MPI or some other part of your software stack, given you did not set it explicitly). This "combination of stealthy settings from disparate sources" is the other reason ofi-conduit prints this warning, to help clarify what's actually being used by CXI provider.

Here is some documentation from the fi_cxi man page (emphasis added):

FI_CXI_RDZV_THRESHOLD
Message size threshold for rendezvous protocol.

FI_CXI_RDZV_GET_MIN
Minimum rendezvous Get payload size. A Send with length less than or equal to FI_CXI_RDZV_THRESHOLD plus FI_CXI_RDZV_GET_MIN will be performed using the eager protocol. Larger Sends will be performed using the rendezvous protocol with FI_CXI_RDZV_THRESHOLD bytes of payload sent eagerly and the remainder of the payload read from the source using a Get. FI_CXI_RDZV_THRESHOLD plus FI_CXI_RDZV_GET_MIN must be less than or equal to FI_CXI_OFLOW_BUF_SIZE.

FI_CXI_RX_MATCH_MODE
Specify the receive message matching mode to be utilized. FI_CXI_RX_MATCH_MODE=hardware | software | hybrid

  • hardware - Message matching is fully offloaded, if resources become exhausted flow control will be performed and existing unexpected message headers will be onloaded to free re- sources.
  • software - Message matching is fully onloaded.
  • hybrid - Message matching begins fully offloaded, if resources become exhuasted hardware will transition message matching to a hybrid of hardware and software matching.

For both "hybrid" and "software" modes and care should be taken to minimize the threshold for rendezvous processing (i.e. FI_CXI_RDZV_THRESHOLD + FI_CXI_RDZV_GET_MIN). When running in software endpoint mode the environment variables FI_CXI_REQ_BUF_SIZE and FI_CXI_REQ_BUF_MIN_POSTED are used to control the size and number of the eager request buffers posted to handle incoming unmatched messages.

The docs above go out of their way to emphasize that the value (FI_CXI_RDZV_THRESHOLD + FI_CXI_RDZV_GET_MIN) should be "minimized" in FI_CXI_RX_MATCH_MODE=software mode. The rationale is presumably that larger eager messages greatly exacerbate the data volume of any unexpected messages, thus consuming the available overflow buffers even faster. I'm not sure that FI_CXI_RDZV_THRESHOLD=16384 constitutes "minimized", and FI_CXI_RDZV_GET_MIN is using an undocumented CXI default value of "who knows". So in this case the warning may be alerting you to a genuine mis-configuration problem (at least a configuration counter-indicated by HPE documentation). I think it's worth at least one small-scale run where you manually set all three "related" values together, eg:

export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_RDZV_THRESHOLD=256 
export FI_CXI_RDZV_GET_MIN=256
// plus your other settings

and consult the generated warning message to confirm all three values "survived" to configure CXI. Once that's confirmed, then perhaps a larger run to see if it impacts the deadlocks.

I tried doubling GASNET_OFI_NUM_RECEIVE_BUFFS to 900, and that rerun also froze at 4096 nodes. It might help to get some guidance from @bonachea as to how much space I am actually allocating.

The reason we currently default GASNET_OFI_NUM_RECEIVE_BUFFS=450 in GASNET_OFI_RECEIVE_BUFF_SIZE=recv mode is documented here, specifically to workaround what we believe to be a not-yet-mentioned bug in CXI provider at high PPN (eg 128 ppn) that results in hang or crash very early in GASNet startup, even at small node scales. Assuming S3D is running less than 8ish PPN, it's probably safe to greatly increase GASNET_OFI_NUM_RECEIVE_BUFFS. However in order to avoid wasting allocation I'd start with a smaller-scale run to validate your chosen GASNET_OFI_NUM_RECEIVE_BUFFS value makes it past Realm initialization.

Based on reading the code and GASNet trace outputs, the per-process memory impact of GASNET_OFI_NUM_RECEIVE_BUFFS in GASNET_OFI_RECEIVE_BUFF_SIZE=recv mode should very roughly be GASNET_OFI_RECEIVE_BUFF_SIZE * (GASNET_OFI_MAX_MEDIUM + am_header_sz + metadata_sz) bytes, where GASNET_OFI_MAX_MEDIUM defaults to 8KiB and (am_header_sz + metadata_sz) is under 100 bytes. So the defaults:

GASNET_OFI_MAX_MEDIUM=8192 
GASNET_OFI_RECEIVE_BUFF_SIZE=recv 
GASNET_OFI_NUM_RECEIVE_BUFFS=450

entail a total of about 3.5 MiB of GASNet-level receive buffers at each process. This excludes other sources of ofi-conduit buffer memory consumption (notably send buffers, which are controlled independently).
Using GASNet-only hello world (no MPI) and GASNET_OFI_RECEIVE_BUFF_SIZE=recv GASNET_OFI_MAX_MEDIUM=8192 I just was able to safely raise GASNET_OFI_NUM_RECEIVE_BUFFS=7700 with 8PPN on Frontier, for 60MiB of buffers per process, but GASNET_OFI_NUM_RECEIVE_BUFFS=7800 was "too much" at 8PPN and resulted in fi_mr_enable failed for aux_seg: -28(No space left on device). MPI is believed to consume some of the precious device resources that are the limiting factor here, so YMMV.

Thanks. I'll run these tests and get back to you. I do need to rebuild my software.

I'll just note that we have given up on the 8 PPN configuration of S3D and run exclusively in a 1 PPN configuration, which we intend to do for the foreseeable future (as we have other, unrelated issues at 8 PPN). Therefore, I think we could push these values even higher (if I understand correctly).

Amendment to previous post:
I believe the bug4478 h/w recv resource limitations I mentoined only apply in FI_CXI_RX_MATCH_MODE=hardware and FI_CXI_RX_MATCH_MODE=hybrid, which makes sense because the entire point of FI_CXI_RX_MATCH_MODE=software is to avoid relying on h/w offloaded recv. So that limit may be totally irrelevant in this use case.

we have given up on the 8 PPN configuration of S3D and run exclusively in a 1 PPN configuration, which we intend to do for the foreseeable future (as we have other, unrelated issues at 8 PPN). Therefore, I think we could push these values even higher (if I understand correctly).

In FI_CXI_RX_MATCH_MODE=software mode with 1PPN and the following settings:

FI_CXI_RX_MATCH_MODE=software 
FI_CXI_RDZV_THRESHOLD=256 
FI_CXI_RDZV_GET_MIN=256
GASNET_OFI_RECEIVE_BUFF_SIZE=recv 
GASNET_OFI_MAX_MEDIUM=8192

I'm able to crank Frontier's GASNET_OFI_NUM_RECEIVE_BUFFS up over 1000000 for a whopping 7.7 GiB of recv buffer space per process (i.e. per node) without encountering the init-time problem. (Again in a GASNet-only program with no MPI resource consumption. I'm not recommending that much buffer space in production (because it's basically guaranteed to incur heavy TLB performance penalties), but the point is in FI_CXI_RX_MATCH_MODE=software mode it seems you can freely crank up GASNET_OFI_NUM_RECEIVE_BUFFS until you exhaust main memory.

Here is my status report as of tonight. I rebuilt the code from an entirely fresh checkout. This is the version that traces 1 timestep at a time (similar to Experiment 3 in #1657 (comment), but note this is a fresh build now), because that version had the shortest startup time.

Experiment 4: GASNET_OFI_RECEIVE_BUFF_SIZE=recv and GASNET_OFI_NUM_RECEIVE_BUFFS=10000

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_RDZV_THRESHOLD=256
export FI_CXI_RDZV_GET_MIN=256
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=recv
export GASNET_OFI_NUM_RECEIVE_BUFFS=10000
export MPICH_OFI_NIC_POLICY=BLOCK

Results:

  • From 1 to 4096 nodes everything worked perfectly
  • At 8192 nodes, the job timed out without ever finishing a time step

For posterity, the contents of the GASNet warning printed in these runs confirms that our configuration is getting through:

*** WARNING: ofi-conduit failed to configure FI_CXI_* environment variables due to prior conflicting settings. This may lead to unstable behavior and/or degraded performance. If you did not intentionally set FI_CXI_RX_MATCH_MODE='software', FI_CXI_RDZV_THRESHOLD='256', or FI_CXI_RDZV_GET_MIN='256' then this condition may have resulted from initializing MPI prior to initialization of GASNet. For more information on that scenario, please see "Limits to MPI interoperability" in the ofi-conduit README.

Experiment 5: GASNET_OFI_RECEIVE_BUFF_SIZE=recv and GASNET_OFI_NUM_RECEIVE_BUFFS=100000

Increasing by another order of magnitude just to see what would happen.

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_RDZV_THRESHOLD=256 # suggested: https://github.com/StanfordLegion/legion/issues/1657#issuecomment-2016929847
export FI_CXI_RDZV_GET_MIN=256 # suggested: https://github.com/StanfordLegion/legion/issues/1657#issuecomment-2016929847
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=recv
export GASNET_OFI_NUM_RECEIVE_BUFFS=100000
export MPICH_OFI_NIC_POLICY=BLOCK

Results:

  • From 1 to 4096 nodes worked perfectly
  • At 8192 nodes, the job timed out without ever finishing a time step

Experiment 6: GASNET_OFI_RECEIVE_BUFF_SIZE=8M

This was my original configuration for Experiment 3, but I decided to rerun it in the new build.

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=8M
export MPICH_OFI_NIC_POLICY=BLOCK

Results:

  • From 1 to 4096 nodes everything worked perfectly
  • At 8192 nodes, the job timed out without ever finishing a time step

Experiment 7: GASNET_OFI_RECEIVE_BUFF_SIZE=64M

I figured if memory is the issue, why not go up? So I ran a test at 64 MB.

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=64M
export MPICH_OFI_NIC_POLICY=BLOCK

Results:

  • From 1 to 4096 nodes everything worked perfectly
  • At 8192 nodes, I hit the error below:
srun: error: Task launch for StepId=1794522.0 failed on node frontier04727: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Retrying this job resulted in a job killed by node failure.

I guess I'll keep trying?

Experiment 8: GASNET_OFI_RECEIVE_BUFF_SIZE=64M and RDZV variables to 256

It wasn't obvious to me if @bonachea's advice above would apply to this mode, so I figured I'd try.

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export FI_CXI_RDZV_THRESHOLD=256
export FI_CXI_RDZV_GET_MIN=256
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=64M
export MPICH_OFI_NIC_POLICY=BLOCK

Results:

  • From 1 to 4 nodes worked fine
  • 8 nodes timed out
  • 16 nodes crashed with the error below:
s3d.x: /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank5-trace1/legion/runtime/realm/utils.cc:118: uint32_t Realm::crc32c_accumulate(uint32_t, const void*, size_t): Assertion `accum == accum_sw' failed.

Experiment 9: GASNET_OFI_RECEIVE_BUFF_SIZE=8M with -lg:inorder -ll:force_kthreads

All this had me wondering if we were on the right track. So I figured I'd go back and try to capture a backtrace that tells us exactly what Legion is doing at the point of the freeze. Or maybe -lg:inorder will slow things down enough that it doesn't happen. It seems crazy, but the application does actually run at scale with -lg:inorder.

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=8M
export MPICH_OFI_NIC_POLICY=BLOCK

Results:

  • From 1 to 4096 nodes worked, although the application is about 60% slower in this configuration. Fortunately, the slowdown holds steady as I scale out, so it seems plausible I could take this out all the way.
  • At 8192 nodes, the job timed out without ever finishing a time step. Unfortunately I was not awake at the time to collect backtraces; I'll set it up to be able to do that.

Discussion

Obviously, the runs haven't worked yet and I am not sure I have validated our hypotheses. I am waiting some runs that would help firm this up. But perhaps it's worth going back to the drawing board to consider what else could be going on.

I was asked to do runs with libfabric logging enabled. The results are below.

This is still the same application configuration as #1657 (comment)

Experiment 10: GASNET_OFI_RECEIVE_BUFF_SIZE=8M (with CXI logging)

Expand for full variables
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export FI_LOG_LEVEL=warn
export FI_LOG_PROV=cxi
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=8M
export MPICH_OFI_NIC_POLICY=BLOCK

Logs: http://sapling.stanford.edu/~eslaught/ammonia.run5-trace1-logs/pwave_x_8192_hept/run/

Application output is under the respective out_*.txt files. Legion logging is under run_*.log.

Note that like before, 1 to 4096 nodes ran to completion while 8192 nodes froze.

Experiment 11: GASNET_OFI_RECEIVE_BUFF_SIZE=recv and GASNET_OFI_NUM_RECEIVE_BUFFS=100000 (with CXI logging)

Expand for full variables
export LD_LIBRARY_PATH=$DEV/build/$MECHANISM:$KERNEL_PATH:$DEV/legion/language/build/lib:$LD_LIBRARY_PATH
export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export FI_LOG_LEVEL=warn
export FI_LOG_PROV=cxi
export FI_CXI_RDZV_THRESHOLD=256
export FI_CXI_RDZV_GET_MIN=256
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=recv
export GASNET_OFI_NUM_RECEIVE_BUFFS=100000
export MPICH_OFI_NIC_POLICY=BLOCK

Logs: http://sapling.stanford.edu/~eslaught/ammonia.run5-trace1-logs-buff-size-recv-100k/pwave_x_8192_hept/run/

Application output is under the respective out_*.txt files. Legion logging is under run_*.log.

Note that like before, 1 to 4096 nodes ran to completion while 8192 nodes froze.

There are too many experiments to track at this point, so I have moved the tracking into a spreadsheet here:

https://docs.google.com/spreadsheets/d/1OoDW9ie4uewUWaGHKQTOC8bIavGzQfxPNHE2EwJIY5Q/edit?usp=sharing

To follow up on this, we did eventually reach a set of variables that allow S3D to run up to 8,192 nodes on Frontier. Having said that, it's not entirely obvious that these variables are actually necessary or sufficient.

They may not be necessary because Seshu has been running recently without them, and got up to 8,192 nodes.

They may not be sufficient because we have still seen issues: Seshu in #1683 and myself in #1696

I'm not sure what else to say. There is more to do to figure out what actually is necessary and sufficient for running on these networks.

In the Realm meeting today I think we did identify a potential forward progress issue in the GASNetEX module. My investigation of the Realm code earlier asserting that Realm has an independent progress thread for pulling messages off of the wire is not actually correct:

Ok, I see where Realm is running a separate polling thread that always pulls active messages off the wire and that should be sufficient to ensure forward progress even if all the background worker threads get stuck in a commit call.

Instead, my first assertion that we had gotten rid of this progress thread and instead relied on the background worker threads to service incoming active messages is actually right:

I thought we had gotten rid of the polling thread in the GASNetEX module

Realm's GASNet1 networking module does have such a progress thread, but the GASNetEX module has indeed removed this polling progress thread.

Unfortunately this model of forward progress is only sound if calls to gex_AM_PrepareRequest* with the GEX_FLAG_IMMEDIATE flag will precisely predict whether the subsequent call to gex_AM_CommitRequest* will block or not. As we've previously established in this issue, GASNet cannot guarantee precision in this regard for reasons having to do with the details of the networking stack:

Regardless of whether or not GEX_FLAG_IMMEDIATE is passed to gex_AM_PrepareRequestMedium() (or FPAM injection), this is a "best effort" GASNet feature, and does not (cannot) provide a hard guarantee that the underlying network library will not block during injection. Currently I'd characterize GASNet's "best effort" here as "pretty good" on ibv-conduit and "not so great" on ofi-conduit. My understanding of the current state of IMMEDIATE injection in ofi-conduit is that FPAM injection should faithfully detect cases where GASNet's conduit-level metadata (buffer descriptors) are running scarce, and should correctly return failure IFF the underlying fi_inject/fi_send call to the libfabric provider returns -FI_EAGAIN. However ofi-conduit does not attempt to model/predict the resource state of the underlying provider, who might choose to block without returning -FI_EAGAIN.

What this means is that it is possible for all the background worker threads in Realm to get stuck trying to send active messages and not have any remaining threads available for polling incoming active messages to drain the network. The fact that we see this much more commonly on Slingshot than Infiniband aligns with the assertion that GASNet is much more precise in its responses about when blocking active messages are going to occur for Infiniband than Slingshot.

I suspect that we'll need to alter the architecture of the Realm GASNetEX module to look more like the GASNet1 module if this is the case in order to guarantee forward progress.

@manopapad for determining priorities for addressing this issue
@SeyedMir to check on similar properties for the UCX network module
@JiakunYan @streichler for general interest

Checked with the UCX team and we should be fine with UCX. ucp_worker_progress will not block the calling thread. Of course, this is not applicable/related to Slingshot in any way as UCX does not have support for Slingshot.

However ofi-conduit does not attempt to model/predict the resource state of the underlying provider, who might choose to block without returning -FI_EAGAIN.

@bonachea Do you suspect that ofi communication calls (fi_send, fi_recv, etc) can be blocked indefinitely if users does not call the progress function (fi_cq_read*)?

@JiakunYan Together, @bonachea and I have looked previously at the libfabric documentation and we determined that it allows for the (unfortunate) behavior you describe. However, we do not know the behavior of any particular providers (such as cxi) nor would there be any guarantee that the behavior would not change in a future release of libfabric.

@PHHargrove @bonachea Thanks! To be honest, I would be very surprised if any provider actually behaves in this way in practice. If this is the case, it seems to me the only way to avoid this issue is to always have dedicated progress threads. It would put most MPI programs at the risk of deadlock.

@JiakunYan
I believe that at least one class of problematic conditions we identified are ones in which a sending process/thread was blocked in fi_send*() because the remote process was not processing completion records as needed to complete receives and/or was not providing the necessary buffer resources. A table in the fi_domain manpage actually appears to require "retry" (and prohibit FI_EAGAIN) for the lack of a receive buffer at the target (for RDM endpoints and RM enabled).

@PHHargrove Thanks! I see what you mean. I don't understand why they have this weird "retry" requirement. It would still be good to make sure it is what actually happens (stuck inside the libfabric, instead of the inifinite gasnetex send-retry-progress loop).

FWIW, I did see libfabric progress was being called from @elliottslaughter freeze backtrace.

Thread 3 (Thread 0x7ffe5d20f740 (LWP 81080) "s3d.x"):
#0  0x00007fffdd8877df in cxip_evtq_progress (evtq=evtq@entry=0x67885c0) at prov/cxi/src/cxip_evtq.c:388
#1  0x00007fffdd8591c9 in cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:184
#2  0x00007fffdd85e969 in cxip_util_cq_progress (util_cq=0x40b6810) at prov/cxi/src/cxip_cq.c:112
#3  0x00007fffdd83a301 in ofi_cq_readfrom (cq_fid=0x40b6810, buf=<optimized out>, count=64, src_addr=0x0) at prov/util/src/util_cq.c:232
#4  0x00007fffe0f6f62c in gasnetc_ofi_tx_poll () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#5  0x00007fffe0f7211d in gasnetc_ofi_am_send_medium () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#6  0x00007fffe0fefd91 in gasnetc_AM_CommitRequestMediumM () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#7  0x00007fffe00d7628 in Realm::XmitSrcDestPair::push_packets (this=0x7ffd04031d50, immediate_mode=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2092
#8  0x00007fffe00d925e in Realm::GASNetEXInjector::do_work (this=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2725
#9  0x00007fffdff6d778 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7ffe5d20d520, max_time_in_ns=<optimized out>, interrupt_flag=0x0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:599
#10 0x00007fffdff6e051 in Realm::BackgroundWorkThread::main_loop (this=0x1fb2cdb0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:103
#11 0x00007fffe007e541 in Realm::KernelThread::pthread_entry (data=0x1fdad8b0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/threads.cc:831
#12 0x00007fffed8e96ea in start_thread () from /lib64/libpthread.so.0
#13 0x00007fffe912150f in clone () from /lib64/libc.so.6

(http://sapling.stanford.edu/~eslaught/s3d_freeze_debug2/bt4096-3/bt_frontier00001_81054.log line 322)

Here are the two earlier backtraces for the same thread on the same node:

Thread 3 (Thread 0x7ffe5d20f740 (LWP 81080) "s3d.x"):
#0  0x00007fffed8f19d5 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fffdd874207 in ofi_genlock_lock (lock=<optimized out>) at ./include/ofi_lock.h:359
#2  cxip_send_common (txc=0x6754838, tclass=512, buf=0x7ffe75a1f278, len=912, desc=<optimized out>, data=data@entry=0, dest_addr=74, tag=0, context=0x7ffe75a1f250, flags=83886080, tagged=false, triggered=false, trig_thresh=0, trig_cntr=0x0, comp_cntr=0x0) at prov/cxi/src/cxip_msg.c:5510
#3  0x00007fffdd875001 in cxip_send (fid_ep=<optimized out>, buf=<optimized out>, len=<optimized out>, desc=<optimized out>, dest_addr=<optimized out>, context=<optimized out>) at prov/cxi/src/cxip_msg.c:5874
#4  0x00007fffe0f72136 in gasnetc_ofi_am_send_medium () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#5  0x00007fffe0fefd91 in gasnetc_AM_CommitRequestMediumM () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#6  0x00007fffe00d7628 in Realm::XmitSrcDestPair::push_packets (this=0x7ffd04031d50, immediate_mode=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2092
#7  0x00007fffe00d925e in Realm::GASNetEXInjector::do_work (this=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2725
#8  0x00007fffdff6d778 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7ffe5d20d520, max_time_in_ns=<optimized out>, interrupt_flag=0x0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:599
#9  0x00007fffdff6e051 in Realm::BackgroundWorkThread::main_loop (this=0x1fb2cdb0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:103
#10 0x00007fffe007e541 in Realm::KernelThread::pthread_entry (data=0x1fdad8b0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/threads.cc:831
#11 0x00007fffed8e96ea in start_thread () from /lib64/libpthread.so.0
#12 0x00007fffe912150f in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7ffe5d20f740 (LWP 81080) "s3d.x"):
#0  0x00007fffed8f19d5 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007fffdd85e954 in ofi_genlock_lock (lock=0x40b6928) at ./include/ofi_lock.h:359
#2  cxip_util_cq_progress (util_cq=0x40b6810) at prov/cxi/src/cxip_cq.c:109
#3  0x00007fffdd83a301 in ofi_cq_readfrom (cq_fid=0x40b6810, buf=<optimized out>, count=64, src_addr=0x0) at prov/util/src/util_cq.c:232
#4  0x00007fffe0f6f62c in gasnetc_ofi_tx_poll () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#5  0x00007fffe0f7211d in gasnetc_ofi_am_send_medium () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#6  0x00007fffe0fefd91 in gasnetc_AM_CommitRequestMediumM () from /ccs/home/eslaught/frontier/gb2024-subrank3/legion/language/build/lib/librealm.so.1
#7  0x00007fffe00d7628 in Realm::XmitSrcDestPair::push_packets (this=0x7ffd04031d50, immediate_mode=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2092
#8  0x00007fffe00d925e in Realm::GASNetEXInjector::do_work (this=<optimized out>, work_until=...) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/gasnetex/gasnetex_internal.cc:2725
#9  0x00007fffdff6d778 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7ffe5d20d520, max_time_in_ns=<optimized out>, interrupt_flag=0x0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:599
#10 0x00007fffdff6e051 in Realm::BackgroundWorkThread::main_loop (this=0x1fb2cdb0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/bgwork.cc:103
#11 0x00007fffe007e541 in Realm::KernelThread::pthread_entry (data=0x1fdad8b0) at /autofs/nccs-svm1_home1/eslaught/frontier/gb2024-subrank3/legion/runtime/realm/threads.cc:831
#12 0x00007fffed8e96ea in start_thread () from /lib64/libpthread.so.0
#13 0x00007fffe912150f in clone () from /lib64/libc.so.6

It's going around a loop here somewhere. Not sure if that tells us anything or not. I wouldn't necessarily rule out part of the problem also being in the Slingshot driver/hardware not behaving as we would expect either.

[edit:] Given that this=0x7ffd04031d50 for the call to push_packets is the same across all three backtraces I'm very confident that we're in the same call to gasnetc_AM_CommitRequestMediumM.

What the backtraces tell me is that Realm is not stuck in ofi send but in the gasnetex try-send-progress loop.

#define OFI_INJECT_RETRY_IMM(lock, fxn, poll_type, imm, label)\
    do {\
        GASNETC_OFI_LOCK_EXPR(lock, fxn);\
        if (ret == -FI_EAGAIN) {\
          if (imm) goto label; \
          GASNETI_SPIN_DOWHILE(ret == -FI_EAGAIN, {\
            GASNETC_OFI_POLL_SELECTIVE(poll_type);\
            GASNETC_OFI_LOCK_EXPR(lock, fxn);\
          });\
        } \
    }while(0)

#define GASNETC_OFI_POLL_SELECTIVE(type) do {\
    if (type == OFI_POLL_ALL) {\
        gasnetc_ofi_am_recv_poll_cold(1);\
        gasnetc_AMPSHMPoll(0);\
    }\
    else {\
        gasnetc_AMPSHMPoll(1);\
    }\
    gasnetc_ofi_am_recv_poll_cold(0);\
    gasnetc_ofi_tx_poll();\
}while(0)

The GASNETC_OFI_POLL_SELECTIVE here may not be comprehensive enough to make progress on all resources that ofi/cxi needs to make the send happen (e.g. poll the right completion queue, post buffers to the right endpoint).

@JiakunYan

Tracing down the code connected to your previous posting:

The backtraces show a call to gasnetc_AM_CommitRequestMediumM().
So, as a "request" call we know isreq==1 in gasnetc_medium_commit() (called from gasnetc_ofi_am_send_medium() but not show in the backtrace, presumably due to inlining).

Quoting from gasnetc_medium_commit():

    if (isreq) {
        ep = gasnetc_ofi_request_epfd;
        am_dest = gasnetc_fabric_addr(REQ, dest);
        poll_type = OFI_POLL_ALL;
    }

So, type == OFI_POLL_ALL in the code you've quoted.
There is no more "comprehensive" option than polling "ALL".

@lightsighter wrote (in part)

I wouldn't necessarily rule out part of the problem also being in the Slingshot driver/hardware not behaving as we would expect either

I agree that the "flow-control gets stuck" behavior we've seen and/or speculated about when speaking w/ HPE would likely lead to the observed lack of progress.

@PHHargrove Thanks for the explanation! I don't want to trouble you with too many questions. I will just poke around one last possibility.

The progress of a sending process also depends on its target process having receive buffers posted/completion queue polled, so only the sending process doing "comprehensive" progress is not enough. I see gasnetex is using two sets of endpoints and completion queues (one for request and one for reply). If all threads in process A sending requests to process B (so it only polls the request ep/cq) but process B is sending replies to process A (so it only polls the reply ep/cq), could this lead to a deadlock?

Just found it also polls the reply ep/cq when it is sending requests. Nevermind.

I agree that the "flow-control gets stuck" behavior we've seen and/or speculated about when speaking w/ HPE would likely lead to the observed lack of progress.

@PHHargrove Do you think that if Realm had a separate progress thread here that did nothing but poll for incoming active messages and pull them off the wire would it make a difference or would it not matter?

I agree that the "flow-control gets stuck" behavior we've seen and/or speculated about when speaking w/ HPE would likely lead to the observed lack of progress.

@PHHargrove Do you think that if Realm had a separate progress thread here that did nothing but poll for incoming active messages and pull them off the wire would it make a difference or would it not matter?

I think it would, at best, reduce the probability of the Slingshot network stack getting into "bad state". Same is true of using various environment variables to increase the buffering in the network stack. Would not care to speculate if that is "good enough" or not.

If the concern is that all the threads get stuck in a pattern such as the one shown in the stack traces (and assuming that a Slingshot bug is not the root cause), then perhaps there is a better approach than dedicating a progress/polling thread.

Imagine you maintain a counting semaphore with initial value one less than the number of threads. Before an operation which might block, you "try down" the semaphore. It that succeeds, then there must be at least one thread not attempting potentially blocking operations. So, you proceed with the (potentially) blocking operation and "up" the semaphore on completion.

In the event you fail the "try down", you know that the caller is the "last" thread not to be in a potentially blocking operation. In this case that thread could just begin alternating poll/try-down until obtaining the semaphore indicates at least one other thread is now not in a potentially blocking operation. Alternatively, a failure to obtain the semaphore might be handled in the same manner as "immediate failure" (a return of GEX_AM_SRCDESC_NO_OP) of gex_AM_Prepare*(), though I don't actually know exactly what Realm does in that case.

Alternatively, a failure to obtain the semaphore might be handled in the same manner as "immediate failure" (a return of GEX_AM_SRCDESC_NO_OP) of gex_AM_Prepare*(), though I don't actually know exactly what Realm does in that case.

Realm should come back around and call gasnet_AMPoll here in that case:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L2955

It's not clear to me if that is equivalent to the polling loop already happening in gasnetc_AM_CommitRequestMediumM() (e.g. the loop referenced above) or whether gasnet_AMPoll does some additional kind of polling that might be necessary for forward progress.

Alternatively, a failure to obtain the semaphore might be handled in the same manner as "immediate failure" (a return of GEX_AM_SRCDESC_NO_OP) of gex_AM_Prepare*(), though I don't actually know exactly what Realm does in that case.

Realm should come back around and call gasnet_AMPoll here in that case: https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L2955

It's not clear to me if that is equivalent to the polling loop already happening in gasnetc_AM_CommitRequestMediumM() (e.g. the loop referenced above) or whether gasnet_AMPoll does some additional kind of polling that might be necessary for forward progress.

I don't believe there should be a difference, based on the dive I took into the code on Wed.

That is what I would expect as well. That suggests that the fact that Realm does not have an independent progress thread is not actually a forward progress issue. It might cause some performance hiccups but we should not hang as a result since GASNet is going to do polling inside the active message commit to ensure forward progress as necessary.

FWIW, I'm reasonably confident the failure to return from fi_cq_readfrom that @elliottslaughter observed on Frontier is a Slingshot software bug. I found enough evidence in the backtraces to indicate that some progress calls were neither returning nor actually making progress. This is on top of the FI_MULTI_RECV issues that @PHHargrove has already isolated and reported.

We were never able to isolate a reproducer we could report for the progress hang (which I'd still like us to do) but I'm aware of at least one internal ticket with a lot of similar characteristics with an MPI reproducer that was very recently fixed. I have no way to predict the timeline for that actually making it to the DoE systems though, and without a reproducer no way to promise that it actually is the same issue.

I don't think we ever validated that processing the receives actually would have allowed progress to resume. It seems to me like figuring out how to test that would be a good first step.

I'm a little worried that we're trying to architect a solution based on reasoning about a software bug that is not functioning as expected or in a reasonable way. I'd be more interested in Realm or GASNet detecting when progress has stopped or slowed to a crawl and giving as much information as they can about what they were doing. Even once the software bugs are fixed, the Legion communication patterns tend to be different enough from standard MPI applications that users might need to change some of the default CXI provider settings based on their workload to get reasonable performance.