NCCL RDMA expects fi_cq_data_entry, but OPX provider fills CQ with fi_cq_tagged_entry
lsavers opened this issue · 2 comments
Commit rdma: switch to untagged send/recv (12ed337) removed the use of tagged entries for NCCL protocol RDMA. Even though RDMA uses untagged send/recv operations, the CQ format attribute is initialized based on the capability flag from the provider.
aws-ofi-nccl/src/nccl_ofi_ofiutils.c
Lines 266 to 270 in d459367
OPX has FI_TAGGED set for the provider's capabilities, so the CQ format is being set to FI_CQ_FORMAT_TAGGED and then the CQ is filled with fi_cq_tagged_entry's by OPX. Should the spots that changed fi_cq_tagged_entry to fi_cq_data_entry still be fi_cq_tagged_entry to handle the CQ filling with that type?
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1113 in d459367
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1281 in d459367
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1389 in d459367
aws-ofi-nccl/src/nccl_ofi_rdma.c
Line 1740 in d459367
Or should the cq_attr.format be set based on a different condition?
aws-ofi-nccl/src/nccl_ofi_ofiutils.c
Line 266 in d459367
Hello,
Interestingly enough, this exact problem came up during the review for this code: #361 (comment). The conclusion was that FI_TAGGED
is a "primary capability", which means it should not be enabled unless specifically requested by the application.
As Raghu noted, here is the relevant text from the Libfabric spec (https://ofiwg.github.io/libfabric/v1.22.0/man/fi_getinfo.3.html):
Capabilities may be grouped into three general categories: primary, secondary, and primary modifiers. Primary capabilities must explicitly be requested by an application, and a provider must enable support for only those primary capabilities which were selected. Primary modifiers are used to limit a primary capability, such as restricting an endpoint to being send-only. If no modifiers are specified for an applicable capability, all relevant modifiers are assumed. See above definitions for details.
My reading is that, since the plugin does not request FI_TAGGED
capability, the OPX provider should not enable it.
Thanks for the prompt response and background information. I will look to update OPX to treat FI_TAGGED
as a primary capability.