CUDA-aware MPI documentation needs some additional details

Question

maxpkatz opened this issue 5 years ago · 1 comments

The section on CUDA-aware MPI could use a little more detail. Some of the changes I want to make:

Emphasize that GPUDirect RDMA is not always the right solution for sending messages, especially for bandwidth-bound messages.
Standardize on "GPUDirect" versus "staging" terminology as the two paths that can be taken for sending messages, and note that not everyone uses this, so be prepared for some confusion (especially IBM)
Describe how to disable the GPUDirect path
Describe how to use Spectrum MPI's environment variable PAMI_CUDA_AWARE_THRESH to control the crossover point between the GPUDirect and staging paths
Explain a little bit what SMPI is doing under the hood when you have CUDA-aware MPI enabled (specifically, when you're using the PAMI backend), and address some of the requirements this has on the application like not doing any CUDA calls before MPI_Init(). This addresses the problem noted in #78, which I will still resolve separately with a contribution to the known issues page.

Answer 1 · 2020-10-12T19:20:36.000Z

I am interested in the CUDA-aware MPI documentation. Could someone familiar with the topic finish this issue?