Performance issue with NVSHMEM example

Question

Performance issue with NVSHMEM example

covoge opened this issue 2 years ago · 13 comments

Hello,

I am observing significant performance issues running the nvshmem benchmark over InfiniBand. I'm running RHEL 7.9 on two compute nodes, each with 8x NVIDIA A100-SXM4 (40 GB RAM) and 2x AMD EPYC CPU 7352 (24 cores). The A100s inside the compute nodes are connected via NVLink, the compute nodes are connected via InfiniBand.

I'm configuring the nvshmem example with nx = ny = 32768. Running it with two GPUs on a single compute node yields the expected result of about half the single GPU runtime (from ~10 seconds to ~5 seconds). Running the same example with two GPUs on 2 compute nodes (so one GPU per compute node) results in a runtime of ~43 seconds, so about 4.3 times slower than the single GPU version.

I'm NVSHMEM 2.5.0 and OpenMPI 4.1.1. Do you have any idea what could cause this issue? Did I make a mistake while configuring/installing NVSHMEM?

I'd greatly appreciate any tips or suggestions.

Answer 1 · 2022-08-01T13:07:51.000Z

@pazkaI for multinode runs please use nvshmem_opt directory as that is optimized for multinode runs.

Answer 2 · 2022-08-09T16:59:26.000Z

@akhillanger thanks for the suggestion - I tried the nvshmem_opt version with multiple compute nodes and it seems to scale as expected.

Answer 3 · 2024-12-15T04:50:20.000Z

Hi! I'm also trying to run nvshmem on multi-node environment. However, I encountered an error that two GPUs from two different nodes could not access each other.
When I run my application on SLURM using:

mpirun -np $SLURM_NTASKS $APP_PATH $DATA $SLURM_NTASKS

I got the following error report:

mpi_info: rank: 0, nranks: 2
mpi_info: rank: 1, nranks: 2
PE-0, local_gpu_num: 1, local_gpu_id: 0
PE-1, local_gpu_num: 1, local_gpu_id: 0

src/topo/topo.cpp:61: [GPU 1] Peer GPU 0 is not accessible, exiting ... 
src/init/init.cpp:276: non-zero status: 3 building transport map failed 
src/mem/mem.cpp:nvshmem_malloc:342: nvshmem initialization failed, exiting 
src/topo/topo.cpp:61: [GPU 0] Peer GPU 1 is not accessible, exiting ... 
src/init/init.cpp:276: non-zero status: 3 building transport map failed 
src/mem/mem.cpp:nvshmem_malloc:342: nvshmem initialization failed, exiting

To clarify, these two nodes have an IB connection. Does the problem come from my mpirun command? Or do I have to configure nvshmem to notify it to use the IB connection?
It's very glad to see that you have successfully run NVSHMEM on multi-node. And, could you please help me solve this problem?

Answer 4 · 2024-12-16T16:21:28.000Z

@zzzlxhhh can you share a bit more about your system configuration? What GPUs are you using? Are you using dmabuf or nvidia-peermem to setup GPUDirect RDMA? Working GPUDirect RDMA is a requirement for multi node NVSHMEM.

Answer 5 · 2024-12-18T03:25:28.000Z

Thanks for your reply. I'm running my application on two nodes each with 8 A800 GPUs on the school cluster. It's very tricky to solve all the dependencies required for NVSHMEM on multi-node since I’m not the cluster administrator. I'm going to use Nvidia HPC SDK to build a container to run my application.

To clarity, I can run my application successfully on single-node multi-GPU.

By the way, to enable GPUDirect RDMA means I have to install GDRCopy and nvidia-peermem? Also, before nvidia-peermem, there exists nv_peer_mem. Do all NVSHMEM versions support the latest nvidia-peermem? Or some versions of NVSHMEM rely on nv_peer_mem?

Answer 6 · 2024-12-18T04:32:26.000Z

My main concern is about the installation guide provided by Nvidia. It is unclear what installation settings should be met to run NVSHMEM on multi-node multi-GPU platform, since cross-node communication is more complicated than single-node multi-GPU. They list so many optional settings.

Answer 7 · 2024-12-18T07:44:10.000Z

To update, I use nvidia hpc sdk container 22.11-devel-cuda11.8-ubuntu22.04 to run my application to ensure that all the dependencies are installed properly. However, I still run into trouble.
The SLURM script is as follows:

#!/bin/bash  
#SBATCH -p g078t2  
#SBATCH -N 2  
#SBATCH --ntasks=2  
#SBATCH --ntasks-per-node=1  
#SBATCH --gres=gpu:1  
#SBATCH --time=10:00  
#SBATCH --comment=idmg_bupt  

mpirun -n 2 singularity exec --nv -B /usr/local/nvidia \
        ~/nvsdk_118.sif ./coco.sh

The content in ./coco.sh is to call singularity to run my application:

#!/bin/bash
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:$LD_LIBRARY_PATH
export PATH=/usr/local/nvidia:$PATH
source /usr/share/lmod/lmod/init/bash
module load nvhpc/22.11

~/distSpMM/AdaCo_AE/build/distSpMM com-Amazon 2 32 0 0

However, the error report is:

INFO:    Mounting image with FUSE.
INFO:    Mounting image with FUSE.
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (428) bind mounts
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (428) bind mounts
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            gpu8
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There are more than one active ports on host 'gpu8', but the
default subnet GID prefix was detected on more than one of these
ports.  If these ports are connected to different physical IB
networks, this configuration will fail in Open MPI.  This version of
Open MPI requires that every physically separate IB subnet that is
used between connected MPI processes must have different subnet ID
values.

Please see this FAQ entry for more details:

  http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_default_gid_prefix to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           gpu8
  Local device:         mlx5_2
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
mpi_info: rank: 0, nranks: 2
mpi_info: rank: 1, nranks: 2
PE-1, local_gpu_num: 1, local_gpu_id: 0
PE-0, local_gpu_num: 1, local_gpu_id: 0
group_size is set to 0 for rowSpMM
graph_name: com-Amazon, num_GPUs: 2, dim: 32 
IO read from bin time: 42.304001 ms
[gpu9:16180] *** Process received signal ***
[gpu9:16180] Signal: Segmentation fault (11)
[gpu9:16180] Signal code: Invalid permissions (2)
[gpu9:16180] Failing at address: 0x14724700e790
[gpu9:16180] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x147247bf0520]
[gpu9:16180] [ 1] [0x14724700e790]
[gpu9:16180] *** End of error message ***
[gpu8:16579] *** Process received signal ***
[gpu8:16579] Signal: Segmentation fault (11)
[gpu8:16579] Signal code: Invalid permissions (2)
[gpu8:16579] Failing at address: 0x147ff9535790
[gpu8:16579] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x147ff99f0520]
[gpu8:16579] [ 1] [0x147ff9535790]
[gpu8:16579] *** End of error message ***
/home/u2022110987/distSpMM/AdaCo_AE/multi_node/coco.sh: line 7: 16579 Segmentation fault      (core dumped) ~/distSpMM/AdaCo_AE/build/distSpMM com-Amazon 2 32 0 0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/home/u2022110987/distSpMM/AdaCo_AE/multi_node/coco.sh: line 7: 16180 Segmentation fault      (core dumped) ~/distSpMM/AdaCo_AE/build/distSpMM com-Amazon 2 32 0 0
[gpu8:16510] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[gpu8:16510] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[gpu8:16510] 5 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix
[gpu8:16510] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[37160,1],0]
  Exit code:    139
--------------------------------------------------------------------------

Does it because the openmpi-4.1.1 in the host is not the same with openmpi-4.1.5a1 in the sigularity container? Should I recompile an openmpi on the host side?