aws/aws-ofi-nccl

Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts

tmh97 opened this issue · 4 comments

tmh97 commented
  1. Setting FI_OPT_CUDA_API_PERMITTED to false or 0 doesn't seem to make a difference to the config scripts, it still treats it as if FI_OPT_CUDA_API_PERMITTED is set to 1.
  2. I am unable to force usage of the FI_HMEM implementation of my libfabric provider.

I would greatly appreciate some clarity about how FI_OPT_CUDA_API_PERMITTED is used, it's relationship to FI_HMEM, and it's relationship to GPUDirect/GDRCopy if any.

Hi Thomas,

  1. I'm a bit unclear on what you mean by "make a difference to the config scripts". The plugin sets the option FI_OPT_CUDA_API_PERMITTED to false by using the fi_setopt call to Libfabric (v1.18 and later) at runtime. Which config scripts are you referring to?

  2. FI_HMEM indicates support for Libfabric directly accessing device memory. The plugin by default first tries to find an HMEM-capable provider (see

    hints->caps = FI_TAGGED | FI_MSG | FI_HMEM | FI_REMOTE_COMM;
    ), so it should find one if available. What problem did you encounter in using your FI_HMEM implementation?

Attempting to answer your last question: FI_OPT_CUDA_API_PERMITTED prohibits Libfabric from making calls to the CUDA API, which NCCL forbids. FI_HMEM indicates support for data transfer to/from device memory, i.e., GPUDirect for GPUs. (A provider used with NCCL can't use the CUDA API, even if supporting FI_HMEM.) Finally, gdrcopy is an optional NVIDIA library to improve the performance of GPU memory copies. The Libfabric EFA provider will use it if available.

tmh97 commented

@rauteric Thanks a million for the clarification, this was extremely helpful!

If you would be so kind as to answer a few follow ups:

  • Is the OFI plugin able to utilize the chosen libfabric provider's FI_HMEM support, even if the provider does not support GPUDirect? (my provider does not use the CUDA API explicitly in it's implementation of FI_HMEM)
  • For some reason, the 'nccl_message_transfer' test is exercising my libfabric provider's FI_HMEM support, but NVIDIA's nccl-tests (i.e. all_reduce_perf) are not exercising my libfabric provider's FI_HMEM support. Are you aware of any different in your unit tests vs NVIDIA's nccl tests that could cause this?

Thanks for your time!

Hello. If I understand correctly, your Libfabric provider supports FI_HMEM (i.e., using GPU memory directly in Libfabric APIs) but does not support GPUDirect (i.e., the network device writing directly to GPU memory). In this case, as long as the provider does not make any CUDA calls, the plugin should be able to use the FI_HMEM implementation of your provider. I'm also not aware of any difference between the unit tests and NCCL/nccl-tests in this regard.

If you run nccl-tests with NCCL_DEBUG=TRACE, it should give some helpful info to determine why the plugin is not choosing the FI_HMEM implementation of your provider.

tmh97 commented

@rauteric Thanks for taking the time to answer my questions Eric this has been quite helpful