NCCL Tree allreduce test cannot reach the theoretical bus bandwidth on 2 nodes with 4 nics
ProHuper opened this issue · 0 comments
ProHuper commented
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX SYS SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX SYS SYS SYS 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS PIX SYS SYS 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS PIX SYS 48-95,144-191 1 N/A
NIC0 PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS
NIC1 SYS SYS PIX SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC2 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS X SYS SYS SYS
NIC3 SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS X SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS X SYS
NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_4
NIC3: mlx5_5
NIC4: mlx5_6
NIC5: mlx5_bond_0
2 nodes allreduce test,8 H100 each node,using 4 nics,busbw is 309,theoretical busbw should be 360。
$ mpirun --allow-run-as-root --hostfile hosts.txt --oversubscribe -x NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 16 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
2097152 524288 float sum -1 118.1 17.75 33.29 0 92.65 22.64 42.44 0
4194304 1048576 float sum -1 104.8 40.01 75.03 0 105.4 39.78 74.59 0
8388608 2097152 float sum -1 140.7 59.60 111.75 0 142.9 58.72 110.10 0
16777216 4194304 float sum -1 231.9 72.33 135.62 0 237.8 70.56 132.29 0
33554432 8388608 float sum -1 412.3 81.39 152.60 0 417.3 80.40 150.75 0
67108864 16777216 float sum -1 663.5 101.14 189.64 0 672.7 99.76 187.05 0
134217728 33554432 float sum -1 1168.2 114.89 215.42 0 1311.3 102.35 191.91 0
268435456 67108864 float sum -1 2130.3 126.01 236.27 0 2130.6 125.99 236.23 0
536870912 134217728 float sum -1 3611.0 148.68 278.77 0 3603.2 149.00 279.37 0
1073741824 268435456 float sum -1 6793.3 158.06 296.36 0 6781.1 158.34 296.89 0
2147483648 536870912 float sum -1 13184 162.89 305.41 0 13129 163.56 306.68 0
4294967296 1073741824 float sum -1 25986 165.28 309.90 0 25893 165.87 311.01 0
2 nodes allreduce test,1 H100 each node,using 4 nics,busbw is 50,theoretical busbw should be 200。
$ mpirun --allow-run-as-root --hostfile hosts.txt --oversubscribe -x NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 2 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
2097152 524288 float sum -1 113.2 18.53 18.53 0 93.35 22.46 22.46 0
4194304 1048576 float sum -1 154.4 27.16 27.16 0 153.3 27.37 27.37 0
8388608 2097152 float sum -1 231.4 36.24 36.24 0 227.8 36.83 36.83 0
16777216 4194304 float sum -1 420.5 39.90 39.90 0 419.9 39.95 39.95 0
33554432 8388608 float sum -1 812.3 41.31 41.31 0 808.2 41.52 41.52 0
67108864 16777216 float sum -1 1545.1 43.43 43.43 0 1561.3 42.98 42.98 0
134217728 33554432 float sum -1 2973.1 45.14 45.14 0 2970.4 45.19 45.19 0
268435456 67108864 float sum -1 5715.9 46.96 46.96 0 5676.1 47.29 47.29 0
536870912 134217728 float sum -1 11146 48.17 48.17 0 11156 48.12 48.12 0
1073741824 268435456 float sum -1 22062 48.67 48.67 0 21997 48.81 48.81 0
2147483648 536870912 float sum -1 43733 49.10 49.10 0 43697 49.15 49.15 0
4294967296 1073741824 float sum -1 87278 49.21 49.21 0 87197 49.26 49.26 0
8589934592 2147483648 float sum -1 174121 49.33 49.33 0 174234 49.30 49.30 0
17179869184 4294967296 float sum -1 347919 49.38 49.38 0 347833 49.39 49.39 0
LOG INFO shows GDR only used 1 nic.
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48049 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:38570:38582 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:38570:38582 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:48036:48049 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:48036:48051 [0] NCCL INFO Connected all rings
qh100-gpu20:38570:38584 [0] NCCL INFO Connected all rings