StanfordLegion/legion

GASNet debug assert failure in Stencil on Perlmutter

Opened this issue · 3 comments

Ok, I got an assertion failure with a debug GASNet and release Legion:

*** FATAL ERROR: Assertion failure (proc 103): in gasnetc_ofi_handle_am() at anguage/gasnet/GASNet-2023.9.0/ofi-conduit/gasnet_ofi.c:1725: isreq == header->isreq
   op1 :           1 (0x00000001) == isreq
   op2 :           0 (0x00000000) == header->isreq

Full log here.

Originally posted by @rupanshusoi in #1449 (comment)

I moved this issue here from #1449 (comment) because this appears to be a different underlying root cause.

Summary:

  • Perlmutter
  • Regent Stencil (modified version that includes a nested control replicated task)
  • Original symptoms look like: #1449 (comment)
  • Does NOT reproduce with debug GASNet + debug Legion
  • Does reproduce with debug GASNet + release Legion (with assert failure at top of this issue)

I haven't seen this one before so maybe @bonachea @PHHargrove can comment?

This assertion failure is one of the three known manifestations of "the FI_MULTI_RECV bug".
Specifically, the provider has delivered a message buffer to us which is all zeros which happens to result in a detectable inconsistency in our header.

Please let me know immediately if this has occurred when running with GASNET_OFI_RECEIVE_BUFF_SIZE=recv, since that would not use FI_MULTI_RECV and therefore be a new/different issue.

Regarding "Does NOT reproduce with debug GASNet + debug Legion":
That could be either a timing or "chance" issue. In extensive work with the MetaHipMer team, this failure mode (one of three believe to be related to multi-recv buffer handing in the provider) probably accounted for at most 1% of their bad runs.

Please tell me if the failing run took place before or after the Perlmutter maintenance of March 20.
If was after, then this is evidence that the issue is still present in SlingShot 2.1.2.

As far as I'm aware, this run did not use GASNET_OFI_RECEIVE_BUFF_SIZE=recv. And it was after the March 20 maintenance.