Mellanox/nv_peer_memory

Why is nv_peer_memory severely deteriorating all_reduce_perf result?

EdwardZhang88 opened this issue · 2 comments

I am running benchmark testing using nccl_test. I have 2 nodes, which are connected via RoCE. I have also installed the nv_peer_memory. However, once I turn on GPU Direct RDMA, the all_reduce_perf bandwidth gets dramatically worse than without GPU Direct RDMA. I am aware that GPU PCIe topology matters and that's why I am only using GPU0 on both nodes since GPU0 and the Mellanox HAC are connected to the same CPU.
The GPU topology is
Screen Shot 2019-04-16 at 8 23 46 PM
Without GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
Screen Shot 2019-04-16 at 8 34 58 PM

With GPU Direct RDMA and just plain RoCE, GPU0 on node 1 <-> GPU0 on node 2
Screen Shot 2019-04-16 at 8 31 29 PM

According to this suggested system support, having single CPU in between GPU and the Mellanox HAC will yield worse performance. But I never expected it to be this much worse.

At this point, I am wondering if there is any tool which can help debug nv_peer_mem to make sure it really takes effect? Or maybe there is sth I misconfigured?

Here is the detail about my environment.
Nvidia Tesla V100
CUDA9.0
NCCL2.2.13
OFED4.2-1.2.0
Mellanox MT27710 ConnectX-4Lx
nvidia_peer_memory1.0-8

I notice that the log says that 'No module present for GPU Direct RDMA'. When I check its status, this is what it look like. Is this normal?
Screen Shot 2019-04-16 at 8 52 55 PM

Even when the 'No module present for GPU Direct RDMA'. message is gone after I re-installed nv_peer_mem, the performance still doesn't get any better for GPU Direct RDMA case.

see this post :
https://devblogs.nvidia.com/benchmarking-gpudirect-rdma-on-modern-server-platforms/

RDMA transfer from NIC to GPU Mem using GPUDirect is slower than RDMA from NIC to pinned CPUMEM followed by cudaMemcpy from CPU Mem to GPU Mem.

this is a PCIeDirect (Peer to Peer) issue

In my setup (connectx5, quadroP6000, RoCEv2) I have 97.4Gb/s(with intermediate step in cpumem) or 71Gb/s (GPUDirect)