has nvswitch, but uses 0 nvls channels
MiyazonoKaori opened this issue · 3 comments
MiyazonoKaori commented
The host has nvlink and nvswitch, but when using nccl-tests, it displays 0 nvls channels and the bandwidth is only 10GB/s. How should I troubleshoot and repair ?
root@user:/home/nccl-tests-master# mpirun --allow-run-as-root -np 8 -x NCCL_DEBUG=INFO ./build/all_reduce_perf -b 128M -e 4096M -f 2
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: user
Local adapter: mlx5_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: user
Local device: mlx5_0
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 134217728 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 5179 on user device 0 [0x27] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 5180 on user device 1 [0x2a] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 5181 on user device 2 [0x51] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 5182 on user device 3 [0x57] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 5183 on user device 4 [0x9e] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 5184 on user device 5 [0xa4] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 5185 on user device 6 [0xc7] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 5186 on user device 7 [0xca] NVIDIA A100-SXM4-80GB
user:5179:5179 [0] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5179:5179 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5179:5179 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5179:5179 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.1+cuda12.1
user:5180:5180 [1] NCCL INFO cudaDriverVersion 12020
user:5180:5180 [1] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5180:5180 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5180:5180 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5181:5181 [2] NCCL INFO cudaDriverVersion 12020
user:5181:5181 [2] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5181:5181 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5181:5181 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5182:5182 [3] NCCL INFO cudaDriverVersion 12020
user:5182:5182 [3] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5182:5182 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5182:5182 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5185:5185 [6] NCCL INFO cudaDriverVersion 12020
user:5185:5185 [6] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5185:5185 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5185:5185 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5186:5186 [7] NCCL INFO cudaDriverVersion 12020
user:5186:5186 [7] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5186:5186 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5186:5186 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5183:5183 [4] NCCL INFO cudaDriverVersion 12020
user:5184:5184 [5] NCCL INFO cudaDriverVersion 12020
user:5183:5183 [4] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5183:5183 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5183:5183 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation
user:5184:5184 [5] NCCL INFO Bootstrap : Using ibs85f0:192.168.1.10<0>
user:5184:5184 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
user:5184:5184 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation
[user:05146] 7 more processes have sent help message help-mpi-btl-openib.txt / ib port not selected
[user:05146] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[user:05146] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init
user:5179:5230 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5179:5230 [0] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5181:5232 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5181:5232 [2] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5179:5230 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5179:5230 [0] NCCL INFO Using network IB
user:5182:5233 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5182:5233 [3] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5186:5235 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5186:5235 [7] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5183:5236 [4] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5183:5236 [4] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5181:5232 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5181:5232 [2] NCCL INFO Using network IB
user:5184:5237 [5] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5184:5237 [5] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5182:5233 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5182:5233 [3] NCCL INFO Using network IB
user:5180:5231 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5180:5231 [1] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5186:5235 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5186:5235 [7] NCCL INFO Using network IB
user:5185:5234 [6] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
user:5185:5234 [6] NCCL INFO NCCL_IB_HCA set to mlx5_2:1,mlx5_0:1
user:5183:5236 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5183:5236 [4] NCCL INFO Using network IB
user:5180:5231 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5180:5231 [1] NCCL INFO Using network IB
user:5184:5237 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5184:5237 [5] NCCL INFO Using network IB
user:5185:5234 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ibs85f0:192.168.1.10<0>
user:5185:5234 [6] NCCL INFO Using network IB
user:5184:5237 [5] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5182:5233 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5184:5237 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:5184:5237 [5] NCCL INFO NVLS multicast support is not available on dev 5
user:5182:5233 [3] NCCL INFO NVLS multicast support is not available on dev 3
user:5186:5235 [7] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5186:5235 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
user:5186:5235 [7] NCCL INFO NVLS multicast support is not available on dev 7
user:5179:5230 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5179:5230 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
user:5179:5230 [0] NCCL INFO NVLS multicast support is not available on dev 0
user:5181:5232 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5181:5232 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
user:5181:5232 [2] NCCL INFO NVLS multicast support is not available on dev 2
user:5180:5231 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5180:5231 [1] NCCL INFO NVLS multicast support is not available on dev 1
user:5185:5234 [6] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5185:5234 [6] NCCL INFO NVLS multicast support is not available on dev 6
user:5183:5236 [4] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:5183:5236 [4] NCCL INFO NVLS multicast support is not available on dev 4
user:5179:5230 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
user:5179:5230 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
user:5179:5230 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
user:5179:5230 [0] NCCL INFO P2P Chunksize set to 131072
user:5184:5237 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
user:5184:5237 [5] NCCL INFO P2P Chunksize set to 131072
user:5180:5231 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
user:5180:5231 [1] NCCL INFO P2P Chunksize set to 131072
user:5181:5232 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
user:5181:5232 [2] NCCL INFO P2P Chunksize set to 131072
user:5185:5234 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
user:5185:5234 [6] NCCL INFO P2P Chunksize set to 131072
user:5186:5235 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
user:5186:5235 [7] NCCL INFO P2P Chunksize set to 131072
user:5182:5233 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2
user:5182:5233 [3] NCCL INFO P2P Chunksize set to 131072
user:5183:5236 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3
user:5183:5236 [4] NCCL INFO P2P Chunksize set to 131072
user:5184:5237 [5] NCCL INFO Channel 00 : 5[a4000] -> 6[c7000] via SHM/direct/direct
user:5186:5235 [7] NCCL INFO Channel 00 : 7[ca000] -> 0[27000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 00 : 3[57000] -> 4[9e000] via SHM/direct/direct
user:5186:5235 [7] NCCL INFO Channel 01 : 7[ca000] -> 0[27000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Channel 01 : 5[a4000] -> 6[c7000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 01 : 3[57000] -> 4[9e000] via SHM/direct/direct
user:5179:5230 [0] NCCL INFO Channel 00 : 0[27000] -> 1[2a000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 00 : 4[9e000] -> 5[a4000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 00 : 2[51000] -> 3[57000] via SHM/direct/direct
user:5179:5230 [0] NCCL INFO Channel 01 : 0[27000] -> 1[2a000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Channel 00 : 1[2a000] -> 2[51000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 01 : 4[9e000] -> 5[a4000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 00 : 6[c7000] -> 7[ca000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 01 : 2[51000] -> 3[57000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Channel 01 : 1[2a000] -> 2[51000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 01 : 6[c7000] -> 7[ca000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Connected all rings
user:5183:5236 [4] NCCL INFO Connected all rings
user:5186:5235 [7] NCCL INFO Connected all rings
user:5182:5233 [3] NCCL INFO Connected all rings
user:5186:5235 [7] NCCL INFO Channel 00 : 7[ca000] -> 6[c7000] via SHM/direct/direct
user:5186:5235 [7] NCCL INFO Channel 01 : 7[ca000] -> 6[c7000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Connected all rings
user:5179:5230 [0] NCCL INFO Connected all rings
user:5181:5232 [2] NCCL INFO Connected all rings
user:5185:5234 [6] NCCL INFO Connected all rings
user:5180:5231 [1] NCCL INFO Channel 00 : 1[2a000] -> 0[27000] via SHM/direct/direct
user:5180:5231 [1] NCCL INFO Channel 01 : 1[2a000] -> 0[27000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 00 : 4[9e000] -> 3[57000] via SHM/direct/direct
user:5183:5236 [4] NCCL INFO Channel 01 : 4[9e000] -> 3[57000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 00 : 3[57000] -> 2[51000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Channel 00 : 5[a4000] -> 4[9e000] via SHM/direct/direct
user:5184:5237 [5] NCCL INFO Channel 01 : 5[a4000] -> 4[9e000] via SHM/direct/direct
user:5182:5233 [3] NCCL INFO Channel 01 : 3[57000] -> 2[51000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 00 : 6[c7000] -> 5[a4000] via SHM/direct/direct
user:5185:5234 [6] NCCL INFO Channel 01 : 6[c7000] -> 5[a4000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 00 : 2[51000] -> 1[2a000] via SHM/direct/direct
user:5181:5232 [2] NCCL INFO Channel 01 : 2[51000] -> 1[2a000] via SHM/direct/direct
user:5179:5230 [0] NCCL INFO Connected all trees
user:5179:5230 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5179:5230 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5186:5235 [7] NCCL INFO Connected all trees
user:5186:5235 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5186:5235 [7] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5183:5236 [4] NCCL INFO Connected all trees
user:5183:5236 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5183:5236 [4] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5181:5232 [2] NCCL INFO Connected all trees
user:5181:5232 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5181:5232 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5180:5231 [1] NCCL INFO Connected all trees
user:5180:5231 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5180:5231 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5182:5233 [3] NCCL INFO Connected all trees
user:5182:5233 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5182:5233 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5184:5237 [5] NCCL INFO Connected all trees
user:5184:5237 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5184:5237 [5] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5185:5234 [6] NCCL INFO Connected all trees
user:5185:5234 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
user:5185:5234 [6] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
user:5184:5237 [5] NCCL INFO comm 0x564d7ea5fc30 rank 5 nranks 8 cudaDev 5 busId a4000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5186:5235 [7] NCCL INFO comm 0x55f350239b90 rank 7 nranks 8 cudaDev 7 busId ca000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5179:5230 [0] NCCL INFO comm 0x564058dcc010 rank 0 nranks 8 cudaDev 0 busId 27000 commId 0x854cf9285512ab36 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
user:5180:5231 [1] NCCL INFO comm 0x56295fb9c590 rank 1 nranks 8 cudaDev 1 busId 2a000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5183:5236 [4] NCCL INFO comm 0x5626b0ec9f90 rank 4 nranks 8 cudaDev 4 busId 9e000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5182:5233 [3] NCCL INFO comm 0x561edcb35a60 rank 3 nranks 8 cudaDev 3 busId 57000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5181:5232 [2] NCCL INFO comm 0x5586b2a03e40 rank 2 nranks 8 cudaDev 2 busId 51000 commId 0x854cf9285512ab36 - Init COMPLETE
user:5185:5234 [6] NCCL INFO comm 0x55d06b211a20 rank 6 nranks 8 cudaDev 6 busId c7000 commId 0x854cf9285512ab36 - Init COMPLETE
134217728 33554432 float sum -1 25061 5.36 9.37 0 25105 5.35 9.36 0
268435456 67108864 float sum -1 50084 5.36 9.38 0 50148 5.35 9.37 0
536870912 134217728 float sum -1 100124 5.36 9.38 0 100244 5.36 9.37 0
root@user:/home/nccl-tests-master# nvidia-smi
Tue Jun 25 11:06:19 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:27:00.0 Off | 0 |
| N/A 29C P0 56W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:2A:00.0 Off | 0 |
| N/A 27C P0 59W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:51:00.0 Off | 0 |
| N/A 27C P0 60W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:57:00.0 Off | 0 |
| N/A 30C P0 59W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:9E:00.0 Off | 0 |
| N/A 29C P0 56W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:A4:00.0 Off | 0 |
| N/A 28C P0 58W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C7:00.0 Off | 0 |
| N/A 26C P0 57W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:CA:00.0 Off | 0 |
| N/A 30C P0 59W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@user:/home/nccl-tests-master# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB PXB SYS SYS 0-31,64-95 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB PXB SYS SYS 0-31,64-95 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS SYS SYS 32-63,96-127 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS SYS SYS 32-63,96-127 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS 32-63,96-127 1 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS 32-63,96-127 1 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X PIX PXB PXB SYS SYS
NIC1 PXB PXB SYS SYS SYS SYS SYS SYS PIX X PXB PXB SYS SYS
NIC2 PXB PXB SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS SYS
NIC3 PXB PXB SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
NIC5 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
root@user:/home/nccl-tests-master# systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-06-25 10:22:43 UTC; 36min ago
Process: 3418 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
Main PID: 3429 (nv-fabricmanage)
Tasks: 18 (limit: 629145)
Memory: 21.4M
CGroup: /system.slice/nvidia-fabricmanager.service
└─3429 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
6月 25 10:22:19 user systemd[1]: Starting NVIDIA fabric manager service...
6月 25 10:22:33 user nv-fabricmanager[3429]: Connected to 1 node.
6月 25 10:22:43 user nv-fabricmanager[3429]: Successfully configured all the available GPUs and NVSwitches to route NVLink traffic.
6月 25 10:22:43 user systemd[1]: Started NVIDIA fabric manager service.
root@user:/home/nccl-tests-master# nvidia-smi nvlink -s
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-ff8a40d4-abc6-08d1-f939-15848b5d4e05)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-0c512562-871f-8257-6187-b5ae1b986d5e)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-f4fb3d24-8773-b6b6-ae47-af7b37f5137d)
Link 0: 25 GB/s
......
sjeaugey commented
NVLS was a new feature of H100. A100 GPUs do not support it.
kiskra-nvidia commented
Yes, NVLS won't be available on this platform, but NCCL should still be using regular NVLinks instead of SHM/direct... I see the following in the output:
NCCL_P2P_LEVEL set by environment to LOC
This forces P2P off. Please unset this variable and you should see a considerable speedup...
MiyazonoKaori commented
export NCCL_P2P_DISABLE=0, bandwidth has reached 220GB/s, thanks~