RDMA protocol
eliekozah opened this issue · 9 comments
I'm fixing a bug with our provider's use of RDMA. Does RDMA universally lack FI_CONTEXT support, or is this absence specific to AWS OFI NCCL's implementation?
This absence is specific to Aws-ofi-nccl's RDMA protocol implementation. Context was omitted only for convenience in our first implementation, and can be added in the future.
Hi @rauteric,
Thank you for the clarification. Our team is considering contributing a fix to introduce FI_CONTEXT
support with RDMA and would love to get your thoughts on any implementation preferences or directions your team might have. This would help us tailor our contribution to fit into the project.
Hi @eliekozah-cornelisnetworks,
Thank you for considering a contribution. In general, I think FI_CONTEXT
support can be implemented similarly to how we implement it for the SENDRECV protocol (https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_sendrecv.c), by having a fi_context
member of the request data structure.
The main obstacle we have currently to supporting FI_CONTEXT
is that right now the context we use can be shared between multiple simultaneous Libfabric operations. The particular example I can find is RDMA write operations for individual rails:
aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 4226 to 4230 in 2adb8f5
req
). To support FI_CONTEXT
these would need to be split either into separate fi_context
objects inside the request structure, or preferably, subrequests of the parent request structure.
There may be other examples of this in the current code, but that's the only one I can find at the moment.
Hi @rauteric,
I see you removed the check of provider requiring FI_CONTEXT and provider support of RMA here:
aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 5927 to 5931 in 1006b3f
Any specific reason behind this change?
Hi, this was mostly just a refactoring.
The previous code set hints->mode = FI_CONTEXT
for both RDMA and SENDRECV protocols (since SENDRECV protocol supports it), so we needed some logic to check that the provider didn't enable it when using RDMA protocol.
Since a recent refactoring, the hints->mode
parameter is set separately for the two protocols. The RDMA protocol does not include FI_CONTEXT
in hints->mode
(which means it does not support it), so it no longer needs the extra check.
Similarly, the check for FI_RMA
support from provider is not needed since it is requested as a capability bit in the hints:
aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 5763 to 5764 in 1006b3f
Hi @eliekozah-cornelisnetworks,
Thank you for considering a contribution. In general, I think
FI_CONTEXT
support can be implemented similarly to how we implement it for the SENDRECV protocol (https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_sendrecv.c), by having afi_context
member of the request data structure.The main obstacle we have currently to supporting
FI_CONTEXT
is that right now the context we use can be shared between multiple simultaneous Libfabric operations. The particular example I can find is RDMA write operations for individual rails:aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 4226 to 4230 in 2adb8f5
will all use the same context (
req
). To supportFI_CONTEXT
these would need to be split either into separatefi_context
objects inside the request structure, or preferably, subrequests of the parent request structure.
There may be other examples of this in the current code, but that's the only one I can find at the moment.
Hi @rauteric,
just to confirm, by "rail" do you mean NIC? and does your implementation of RDMA always assume that there are multiple NICs in the network?
just to confirm, by "rail" do you mean NIC?
Yes. The RDMA protocol will group multiple NICs together and expose them to NCCL as a single NIC per GPU. The individual NICs are referred to as "rails" in the code.
and does you implementation of RDMA always assume that there are multiple NICs in the network?
The protocol has some code to detect the number of NICs on the node and their proximity to the GPUs using hwloc. It should work even if there is only one NIC per node, or only one NIC per GPU. However, we don’t regularly test these configurations.
Thanks, @rauteric for clarifying. One more question about the req_lock
mutex lock field in the request structure. So you are trying to protect updating critical fields from access of other threads. Are these other threads created by the NCCL library or you assume they should be created by the libfabric provider? (I assume this is done using the POSIX pthread_create
function).
req_lock
protects access from threads created by NCCL or the application. We don't need to protect access to plugin data by threads created by the Libfabric provider, since the Libfabric provider shouldn't have access to any plugin data.