mochi-hpc/mochi-ssg

weird behavior with SSG's use of margo_forward_timeout

Closed this issue · 2 comments

In GitLab by @shanedsnyder on Dec 16, 2020, 10:30

In some testing, we've seen evidence that there could be issues with margo_forward_timeout behavior in SSG. Testing at scale, the default timeout of 2 seconds used by SSG has not been sufficient, but when bumping the default timeout value and timing SSG RPCs, things seem to complete in under 2 seconds.

We should investigate to see if there are bugs in the forward_timed call and should also consider whether we want to use general margo_forward within SSG (or come up with more flexible timeout values).

In GitLab by @shanedsnyder on Mar 18, 2021, 14:15

We only observed this issue on Summit (POWER architecture), and turns out there was an issue with Argobots mutexes that was leading to this issue. More details here:

https://lists.argobots.org/pipermail/discuss/2021-January/000094.html

In any case, this issue is resolved in Argobots (at the very least, using master branch).

In GitLab by @shanedsnyder on Mar 18, 2021, 14:15

closed