StanfordLegion/legion

cuda_arrays test failure

SeyedMir opened this issue · 4 comments

Enabling registration of dynamically allocated buffers (passed to Realm attach) leads to failure of the cuda_arrays test. The failure is rooted in another issue that needs to be resolved independently. As per guidance from @streichler, the test is disabled until the issue is resolved.

What exactly is the "another issue that needs to be resolved independently"?

An external instance described by an ExternalCudaArrayResource should never be in a memory that is associated
with a RemoteWriteChannel (i.e. the thing that asks a network module to do RMAs to/from instances). If Realm is
going to offer the "dynamic fbmem" memory to the network module for opportunistic memory registration, it needs
to create a different memory that can be suggested for external cuda arrays.

There are no active users of ExternalCudaArrayResource instances right now, so it seems like a fine short-term tradeoff
to have that be broken when enabling RMA for ExternalCudaMemoryResource instances, which are much (well, infinitely)
more common.

Did we put in some kind of code into Realm to make sure that users that try to use ExternalCudaArrayResource get a reasonable error message?

If I understand it correctly, the ExternalCudaArrayResource is used for attaching application managed fbmem, why we need it for dynamic fbmem?