How weights and hops are calculated
vitduck opened this issue · 1 comments
Hi,
Considering the following output:
======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
GPU0 GPU1 GPU2 GPU3
GPU0 0 52 52 52
GPU1 52 0 52 52
GPU2 52 52 0 52
GPU3 52 52 52 0
============================ Hops between two GPUs =============================
GPU0 GPU1 GPU2 GPU3
GPU0 0 3 3 3
GPU1 3 0 3 3
GPU2 3 3 0 3
GPU3 3 3 3 0
========================== Link Type between two GPUs ==========================
GPU0 GPU1 GPU2 GPU3
GPU0 0 PCIE PCIE PCIE
GPU1 PCIE 0 PCIE PCIE
GPU2 PCIE PCIE 0 PCIE
GPU3 PCIE PCIE PCIE 0
================================== Numa Nodes ==================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 0
GPU[1] : (Topology) Numa Node: 1
GPU[1] : (Topology) Numa Affinity: 1
GPU[2] : (Topology) Numa Node: 3
GPU[2] : (Topology) Numa Affinity: 3
GPU[3] : (Topology) Numa Node: 2
GPU[3] : (Topology) Numa Affinity: 2
============================= End of ROCm SMI Log ==============================
I could not find relevant documents explaining how hops and weights are calculated between AMD GPUs.
From the source code, it seems that these are summation of intrinsically assigned values for a specific HW.
In case of NVIDIA, there is a clear hierarchical of topology connect: NVLINK -> PIX -> PXB -> PHB -> NODE -> SYS -> X
So I can deduce the spatial relationship between GPUs from nvidia-smi
From rocm-smi
it is not immediately clear to me how to interpret the aforementioned weights and hops.
Some clarifications are much appreciated.
Topology hops: XGMI/PCIE: GPU -> CPU(0->N) -> GPU(0->N)
https://rocm.docs.amd.com/en/latest/how_to/tuning_guides/mi200.html#hardware-verification-with-rocm has a good overview.
The first block of the output shows the distance between the GPUs similar to what the numactl command outputs for the NUMA domains of a system. The weight is a qualitative measure for the “distance” data must travel to reach one GPU from another one. While the values do not carry a special (physical) meaning, the higher the value the more hops are needed to reach the destination from the source GPU.
The second block has a matrix named “Hops between two GPUs”, where 1 means the two GPUs are directly connected with XGMI, 2 means both GPUs are linked to the same CPU socket and GPU communications will go through the CPU, and 3 means both GPUs are linked to different CPU sockets so communications will go through both CPU sockets. This number is one for all GPUs in this case since they are all connected to each other through the Infinity Fabric links.
Also you could dig into how KFD creates these weights, but its essentially what the documents state.
Hope this helps!