NVIDIA/gdrcopy

Query: Confusion about sudo requirement

tylerjereddy opened this issue · 3 comments

I'm a bit confused about the need for sudo when installing gdrcopy--it seems to be required even for the from-source build. And yet the user-space HPC package manager spack has a gdrcopy package that installs just fine in user space, though I'm not convinced that it actually works via that route. Can you clarify on this--is it likely that the user-space install that spack does simply isn't viable or what is going on here?

Hi @tylerjereddy,

GDRCopy composes of 1) user-space library, 2) benchmark and test applications, and 3) driver. You need sudo to install the driver. The rest can be installed by normal users into folders that they have write permission. One scenario that you may want to install them separately is when using containers. Inside your container, you will want just 1) and/or 2). The driver should be installed on the baremetal. Then, you pass /dev/gdrdrv to your container when you launch it.

I am not sure about spack. If it installs only 1) and/or 2), you may have to install the gdrdrv driver separately. The gdrcopy package on spack seems to be required by ucx and nvshmem. Those libraries have a way to detect that GDRCopy is working properly on your system. If it does not work, ucx and nvshmem will silently switch to different algorithms. So, you probably do not see failure even if GDRCopy is not properly installed, unless you use GDRCopy directly.

agray3 commented

Hi Pak,

The context is this GROMACS issue: https://gitlab.com/gromacs/gromacs/-/issues/4846. Tyler is trying to use GROMACS with cuFFTMp (and hence NVSHMEM) on an HPC cluster, but is seeing:

WARN: GDRCopy open call failed, falling back to not using GDRCopy 
src/topo/topo.cpp:68: [GPU 7] Peer GPU 0 is not accessible, exiting ...

I'm not 100% sure that the GDRCopy warning and Peer GPU error are related, but a search of our internal Slack suggests they may well be so I suggested that he ensured that GDRCopy was properly installed on the system and tried again. Please let us know if you have further insight on this - thanks.

Alan Gray (NVIDIA Devtech)

GDRCopy hasn't been a problem for my config recently, so closing.