cuda_arrays test failure

Question

cuda_arrays test failure

SeyedMir opened this issue 7 months ago · 4 comments

Enabling registration of dynamically allocated buffers (passed to Realm attach) leads to failure of the cuda_arrays test. The failure is rooted in another issue that needs to be resolved independently. As per guidance from @streichler, the test is disabled until the issue is resolved.

Answer 1 · 2024-02-08T22:38:06.000Z

What exactly is the "another issue that needs to be resolved independently"?

Answer 2 · 2024-02-08T22:58:42.000Z

An external instance described by an ExternalCudaArrayResource should never be in a memory that is associated
with a RemoteWriteChannel (i.e. the thing that asks a network module to do RMAs to/from instances). If Realm is
going to offer the "dynamic fbmem" memory to the network module for opportunistic memory registration, it needs
to create a different memory that can be suggested for external cuda arrays.

There are no active users of ExternalCudaArrayResource instances right now, so it seems like a fine short-term tradeoff
to have that be broken when enabling RMA for ExternalCudaMemoryResource instances, which are much (well, infinitely)
more common.

Answer 3 · 2024-02-08T23:25:06.000Z

Did we put in some kind of code into Realm to make sure that users that try to use ExternalCudaArrayResource get a reasonable error message?

Answer 4 · 2024-02-09T00:03:12.000Z

If I understand it correctly, the ExternalCudaArrayResource is used for attaching application managed fbmem, why we need it for dynamic fbmem?