aws/aws-ofi-nccl

RDMA protocol

eliekozah opened this issue · 9 comments

I'm fixing a bug with our provider's use of RDMA. Does RDMA universally lack FI_CONTEXT support, or is this absence specific to AWS OFI NCCL's implementation?

This absence is specific to Aws-ofi-nccl's RDMA protocol implementation. Context was omitted only for convenience in our first implementation, and can be added in the future.

Hi @rauteric,

Thank you for the clarification. Our team is considering contributing a fix to introduce FI_CONTEXT support with RDMA and would love to get your thoughts on any implementation preferences or directions your team might have. This would help us tailor our contribution to fit into the project.

Hi @eliekozah-cornelisnetworks,

Thank you for considering a contribution. In general, I think FI_CONTEXT support can be implemented similarly to how we implement it for the SENDRECV protocol (https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_sendrecv.c), by having a fi_context member of the request data structure.

The main obstacle we have currently to supporting FI_CONTEXT is that right now the context we use can be shared between multiple simultaneous Libfabric operations. The particular example I can find is RDMA write operations for individual rails:

rc = fi_writedata(comm_rail->local_ep, send_data->buff + xfer_info->offset,
xfer_info->msg_size, desc, send_data->wdata,
comm_rail->remote_addr,
send_data->remote_buff + xfer_info->offset,
send_data->remote_mr_key[rail_id], req);
will all use the same context (req). To support FI_CONTEXT these would need to be split either into separate fi_context objects inside the request structure, or preferably, subrequests of the parent request structure.

There may be other examples of this in the current code, but that's the only one I can find at the moment.

Hi @rauteric,

I see you removed the check of provider requiring FI_CONTEXT and provider support of RMA here:

struct fi_info *info_list;
/* Retrieve NIC info list from topology */
info_list = nccl_ofi_topo_next_info_list(&data_iter);

Any specific reason behind this change?

Hi, this was mostly just a refactoring.

The previous code set hints->mode = FI_CONTEXT for both RDMA and SENDRECV protocols (since SENDRECV protocol supports it), so we needed some logic to check that the provider didn't enable it when using RDMA protocol.

Since a recent refactoring, the hints->mode parameter is set separately for the two protocols. The RDMA protocol does not include FI_CONTEXT in hints->mode (which means it does not support it), so it no longer needs the extra check.

Similarly, the check for FI_RMA support from provider is not needed since it is requested as a capability bit in the hints:

/* Primary Capabilities */
hints->caps = FI_MSG | FI_RMA | FI_HMEM;

Hi @eliekozah-cornelisnetworks,

Thank you for considering a contribution. In general, I think FI_CONTEXT support can be implemented similarly to how we implement it for the SENDRECV protocol (https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_sendrecv.c), by having a fi_context member of the request data structure.

The main obstacle we have currently to supporting FI_CONTEXT is that right now the context we use can be shared between multiple simultaneous Libfabric operations. The particular example I can find is RDMA write operations for individual rails:

rc = fi_writedata(comm_rail->local_ep, send_data->buff + xfer_info->offset,
xfer_info->msg_size, desc, send_data->wdata,
comm_rail->remote_addr,
send_data->remote_buff + xfer_info->offset,
send_data->remote_mr_key[rail_id], req);

will all use the same context (req). To support FI_CONTEXT these would need to be split either into separate fi_context objects inside the request structure, or preferably, subrequests of the parent request structure.
There may be other examples of this in the current code, but that's the only one I can find at the moment.

Hi @rauteric,

just to confirm, by "rail" do you mean NIC? and does your implementation of RDMA always assume that there are multiple NICs in the network?

just to confirm, by "rail" do you mean NIC?

Yes. The RDMA protocol will group multiple NICs together and expose them to NCCL as a single NIC per GPU. The individual NICs are referred to as "rails" in the code.

and does you implementation of RDMA always assume that there are multiple NICs in the network?

The protocol has some code to detect the number of NICs on the node and their proximity to the GPUs using hwloc. It should work even if there is only one NIC per node, or only one NIC per GPU. However, we don’t regularly test these configurations.

Thanks, @rauteric for clarifying. One more question about the req_lock mutex lock field in the request structure. So you are trying to protect updating critical fields from access of other threads. Are these other threads created by the NCCL library or you assume they should be created by the libfabric provider? (I assume this is done using the POSIX pthread_create function).

req_lock protects access from threads created by NCCL or the application. We don't need to protect access to plugin data by threads created by the Libfabric provider, since the Libfabric provider shouldn't have access to any plugin data.