etascale/argodsm

Rare coherence failure when prefetching a page unused until after synchronization

Closed this issue · 5 comments

Issue

A rare coherence failure can occur when a remote ArgoDSM page is prefetched to a node at one point, but not accessed until after another node has released changes made to the page. ArgoDSM may during synchronization falsely consider the prefetched page to be in "single writer" state, and keep it on the node (without performing a self-downgrade on it) instead of correctly self-invalidating the page. This error does not correct itself unless the page is evicted from the node cache, and subsequent reads from this page will not recognize changes made by remote nodes even after global synchronization.
Note that this issue only occurs if prefetch (DUAL_LOAD) is enabled.

How to reproduce

It is possible to reliably reproduce this error by initializing ArgoDSM with specific argo_size and cache_size parameters along with specific code to trigger the offending prefetch (to ensure that a page is prefetched but not used). I have been able to reproduce this with a cache_size smaller than argo_size/2 in the following code:
https://github.com/lundgren87/paratest/blob/prefetch_fail/simpletest.cpp

This specific code must executed on exactly two ArgoDSM nodes (same result obtained on HW or SW nodes) in order to reproduce this failure (other values of argo_size and cache_size may reproduce the issue with other node sizes) as such:
mpirun -n 2 ./simpletest

ArgoDSM details

  • In the example above, 20 ArgoDSM pages are allocated during initialization, pages 0-9 on "node 0" and 10-19 on "node 1". Node 0 initializes pages 9-12, which triggers remote loads of pages 10 and 12, and prefetch of pages 11 and 13. Page 13 is not accessed during initialization by node 0. Node 1 meanwhile initializes pages 13-16.
  • At the first synchronization point, page 13 is considered to be in "single writer" state by node 0 even though it has not accessed the page and is as such kept on the node (without being self-downgraded, since it is in state CLEAN).
  • After the first synchronization point, node 1 writes additional changes to page 13.
  • At the second synchronization point, the changes made by node 1 should be made visible to node 0. However, node 0 does not recognize that any writer has been added to page 13 and again keeps it in the cache instead of self-invalidating it.
  • After the second synchronization point, node 0 attempts to check writes made to pages 9-16, but upon accessing page 13 it accesses the page present in the local node cache instead of remotely loading page 13 containing the latest updates from node 1, resulting in missed writes.

as @pekemark points out in #26, there is an odd difference between

if(isPowerOf2((tempsharer)&invid) && tempsharer != id && prevsharer == 0){ //Other private. but may not have loaded page yet.
and
if(isPowerOf2((tempsharer)&invid) && prevsharer == 0){ //Other private. but may not have loaded page yet.

Is this related to this problem?

When inspecting the body of the if-statement it seems like the missing condition is handles another way. So this is probably not the problem. Someone should double check in case I got something wrong. (And do try to see if adding the condition solves the issue.)

as @pekemark points out in #26, there is an odd difference between

if(isPowerOf2((tempsharer)&invid) && tempsharer != id && prevsharer == 0){ //Other private. but may not have loaded page yet.

and

if(isPowerOf2((tempsharer)&invid) && prevsharer == 0){ //Other private. but may not have loaded page yet.

Is this related to this problem?

tl;dr - This involves a few things that are "wrong" but can not cause any real problems and does not fix the issue. This is as much of an explanation as it is a sanity check, feel free to let me know if I misunderstand something.

As far as I understand, the following applies:
prevsharer - id if we are already sharer of this page, else 0
tempsharer - All sharers registered at homenode before adding workrank, or 0 if workrank is already sharer

This means that the first line should read:
If there is, other than me, any private sharer, and I am not the only sharer, and I was previously not sharer.
Which I believe translates to (correct me if I'm wrong):
P>S classification, the previously single private sharer must be notified of classification change.

However, looking in to the function:

unsigned long isPowerOf2(unsigned long x){
unsigned long retval = ((x & (x - 1)) == 0); //Checks if x is power of 2 (or zero)
return retval;
}

0 is also considered a valid power of 2 and the first line actually reads:
If there is, other than me, any private sharer OR no sharer, and I am not the only sharer, and I was previously not sharer.

tempsharer != id checks for the case where we were previously not sharer and have just added ourself to the homenode directory, but this case should not happen as MPI_Get_accumulate returns the content of the target buffer before the BOR operation and thus id should never be part of tempsharer.

This in addition to what @pekemark replied with means that fixing this does not solve the error (I have also verified to be 100% sure).

Related to this, should isPowerOf2 include a check that excludes 0,

return (x != 0) && ((x & (x - 1)) == 0);

provided that this does not break additional functionality? (I can see one case that may need an additional check for 0).

Okay, good analysis @lundgren87. I had missed that tempsharer is the state before the BOR operation. Then there are two reasons this difference doesn't matter, good for eliminating one of the copies in the future. Fixing the misleading power of 2 could be another issue but I don't think it's critical.

It appears that this bug may be a side-effect of #29 . I have been able to reproduce this in all versions of the code for specific problem sizes, and the common denominator is that it manifests when reading a page that has been falsely kept in the cache, after self-invalidation, after another node has written to it (failing to recognize writer).

Allowing the Pyxis directory to scale with argo_size removes this issue for all instances of prefetching tested.

Theory: I believe that there is a fundamental issue in having one classification index represent multiple pages that are physically allocated on different nodes. The Pyxis directory on each node will at times be different (when a sharer/writer registers itself, it only does so on the homenode, not on all nodes that own pages corresponding to the classification index). When loading or writing to a page that is not present in the cache, only the local Pyxis directory is checked to determine whether we need to add ourselves as sharer/writer remotely. If we have the "wrong" version of the classification index represented on our node (a version of the index fetched from another remote node), I suspect that this registration may be falsely skipped, leading to the problem above.