Need help understanding and running likwid on a dual socket system; failure to detect / bench inter-socket link AKA UPI
Opened this issue · 2 comments
I'm trying to use likwid to better understand and detect the "inter-socket link" AKA UPI on my system, which is a:
CPU name: Intel(R) Xeon(R) Platinum 8480+
CPU type: Intel SapphireRapids processor
But so far nothing has worked as expected:
I built and installed likwid like this:
$ sudo apt update
$ sudo apt install --yes build-essential git libnuma-dev
$ git clone https://github.com/RRZE-HPC/likwid.git
$ cd likwid
$ make
$ sudo make install
When I ask it to "print available performance groups for current processor":
$ sudo likwid-perfctr -a
Group name Description
--------------------------------------------------------------------------------
MEM_DP Overview of DP arithmetic and main memory performance
BRANCH Branch prediction miss rate/ratio
DATA Load to store ratio
MEM_SP Overview of SP arithmetic and main memory performance
L2CACHE L2 cache miss rate/ratio
CLOCK Power and Energy consumption
ENERGY Power and Energy consumption
FLOPS_HP Half Precision MFLOP/s
L3 L3 cache bandwidth in MBytes/s
HBM_SP Overview of SP arithmetic and main memory performance
MEM Memory bandwidth in MBytes/s
TLB_DATA L1 data TLB miss rate/ratio
DIVIDE Divide unit information
FLOPS_AVX Packed AVX MFLOP/s
FLOPS_DP Double Precision MFLOP/s
L3CACHE L3 cache miss rate/ratio
TLB_INSTR L1 Instruction TLB miss rate/ratio
DDR_HBM Memory bandwidth in MBytes/s for DDR and HBM
MEM_HP Overview of HP arithmetic and main memory performance
L2 L2 cache bandwidth in MBytes/s
HBM_HP Overview of HP arithmetic and main memory performance
HBM_DP Overview of DP arithmetic and main memory performance
HBM HBM bandwidth in MBytes/s
TMA Top down cycle allocation
FLOPS_SP Single Precision MFLOP/s
SPECI2M Memory bandwidth in MBytes/s including SpecI2M
This is where things get confusing already because I'm expecting it to show something like this [1], but above UPI is missing:
$ likwid-perfctr -a
Group name Description
--------------------------------------------------------------------------------
MEM_DP Overview of arithmetic and main memory performance
UPI UPI data traffic
...
Why is UPI missing for me? And how to get it to show up?
Also, [1] shows commands to test the performance impact of UPI, e.g.:
# pin domain is the whole node but workgroup specifies 20 threads on first S0 socket and four
# data streams for the data are also on first socket S0
$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0
Cycles: 7995555762
Time: 3.198232e+00 sec
MByte/s: 100055.29
Cycles per update: 0.799556
Cycles per cacheline: 6.396445
# pin domain is still whole node but four data streams are now pinned on second socket S1
$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1
Cycles: 19063612050
Time: 7.625461e+00 sec
MByte/s: 41964.68
Cycles per update: 1.906361
Cycles per cacheline: 15.250890
And for its own example above, [1] says "In the above example, we can see that the bandwidth is dropped from ~100 GB/s to ~41.9 GB/s. This is almost a 2.5x performance difference."
However, when I run the above commands on my dual socket system then curiously there appears to be little "MByte/s" difference:
$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 3371184384
Time: 1.685579e+00 sec
MByte/s: 189845.71
Cycles per update: 0.337118
Cycles per cacheline: 2.696948
$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 3350410040
Time: 1.675192e+00 sec
MByte/s: 191022.87
Cycles per update: 0.335041
Cycles per cacheline: 2.680328
How can there be so little difference assuming the UPI should / must be working hard to synchronize the memory between the 2 sockets? Or how to modify these commands to make them work as expected?
Thanks!
[1] https://pramodkumbhar.com/2020/03/architectural-optimisations-using-likwid-profiler/
LIKWID does not offer the UPI group for Intel SapphireRapids. The system used in the tutorial page is a Intel Haswell EP system. Counters, events and therefore performance groups are architecture-specific. So, it doesn't hide just for you ;)
I checked on two of our SPR nodes:
- In the production system Linux's NUMA balancing feature is active (
/proc/sys/kernel/numa_balancing) and I get similar performance independent of the data location. The Linux kernel detects that all accesses to the data are from the remote socket, so it moves the data over. - For the test system, I can enable/disable the NUMA balancing feature. When NUMA balancing is off, you get the expected results:
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 11011722738
Time: 5.505768e+00 sec
MByte/s: 116241.74
Cycles per update: 0.550586
Cycles per cacheline: 4.404689
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 5490815030
Time: 2.745366e+00 sec
MByte/s: 233120.10
Cycles per update: 0.274541
Cycles per cacheline: 2.196326
I committed a UPI group for SPR to the master branch. You can reinstall LIKWID from the master branch or you just download the file and put it into ~/.likwid/groups/SPR.
$ likwid-perfctr -c S0:0@S1:0 -g UPI likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S1,1:S1,2:S1,3:S1
MByte/s: 116132.91
+----------------------------------------+-------------+------------+------------+------------+
| Metric | Sum | Min | Max | Avg |
+----------------------------------------+-------------+------------+------------+------------+
| Received data bandwidth [MByte/s] STAT | 25805.0898 | 12902.5449 | 12902.5449 | 12902.5449 |
| Received data volume [GByte] STAT | 348.5890 | 174.2945 | 174.2945 | 174.2945 |
| Sent data bandwidth [MByte/s] STAT | 77215.5450 | 38607.7725 | 38607.7725 | 38607.7725 |
| Sent data volume [GByte] STAT | 1043.0690 | 521.5345 | 521.5345 | 521.5345 |
| Total data bandwidth [MByte/s] STAT | 103020.6348 | 51510.3174 | 51510.3174 | 51510.3174 |
| Total data volume [GByte] STAT | 1391.6582 | 695.8291 | 695.8291 | 695.8291 |
+----------------------------------------+-------------+------------+------------+------------+
$ likwid-perfctr -c S0:0@S1:0 -g UPI likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20:1:2-0:S0,1:S0,2:S0,3:S0
MByte/s: 230375.63
+----------------------------------------+---------+---------+---------+---------+
| Metric | Sum | Min | Max | Avg |
+----------------------------------------+---------+---------+---------+---------+
| Received data bandwidth [MByte/s] STAT | 55.1544 | 27.5772 | 27.5772 | 27.5772 |
| Received data volume [GByte] STAT | 0.5542 | 0.2771 | 0.2771 | 0.2771 |
| Sent data bandwidth [MByte/s] STAT | 15.7742 | 7.8871 | 7.8871 | 7.8871 |
| Sent data volume [GByte] STAT | 0.1584 | 0.0792 | 0.0792 | 0.0792 |
| Total data bandwidth [MByte/s] STAT | 70.9288 | 35.4644 | 35.4644 | 35.4644 |
| Total data volume [GByte] STAT | 0.7126 | 0.3563 | 0.3563 | 0.3563 |
+----------------------------------------+---------+---------+---------+---------+
This is without MarkerAPI. If you want to use the MarkerAPI and NUMA balancing is active, you probably do not see the UPI traffic because the Linux kernel would move it already in the warmup phase thus no UPI traffic while the benchmark runs.
I hope this clarifies it for you. If you have further questions, feel free to ask.
@TomTheBear thanks so much for the comment and explanation which all makes sense to me! :-)
However, when trying out the commands then I'm still getting unexpected results.
Here I disable NUMA balancing and re-run the command lines:
$ cat /proc/sys/kernel/numa_balancing
1
$ sudo sysctl -w kernel.numa_balancing=0; cat /proc/sys/kernel/numa_balancing # disable NUMA balancing
kernel.numa_balancing = 0
0
$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 6696678972
Time: 3.348323e+00 sec
MByte/s: 191140.47
Cycles per update: 0.334834
Cycles per cacheline: 2.678672
$ likwid-pin -c N:0-39 likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 3510586692
Time: 1.755282e+00 sec
MByte/s: 182306.94
Cycles per update: 0.351059
Cycles per cacheline: 2.808469
$ sudo sysctl -w kernel.numa_balancing=1; cat /proc/sys/kernel/numa_balancing # enable NUMA balancing
kernel.numa_balancing = 1
1
But the MByte/s is still unexpectedly similar. Why?
Then I noticed that in your commands above, there is no likwid-pin command, so I tried again without the likwid-pin command:
$ sudo sysctl -w kernel.numa_balancing=0; cat /proc/sys/kernel/numa_balancing # disable NUMA balancing
kernel.numa_balancing = 0
0
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 8593043156
Time: 4.296490e+00 sec
MByte/s: 148958.80
Cycles per update: 0.429652
Cycles per cacheline: 3.437217
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 6035619464
Time: 3.017794e+00 sec
MByte/s: 106037.72
Cycles per update: 0.603562
Cycles per cacheline: 4.828496
$ sudo sysctl -w kernel.numa_balancing=1; cat /proc/sys/kernel/numa_balancing # enable NUMA balancing
kernel.numa_balancing = 1
1
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S0,1:S0,2:S0,3:S0 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 4378500026
Time: 2.189245e+00 sec
MByte/s: 146169.12
Cycles per update: 0.437850
Cycles per cacheline: 3.502800
$ likwid-bench -t triad_mem_avx512_fma -w S0:20GB:20-0:S1,1:S1,2:S1,3:S1 2>&1 | egrep "(Cycles:|Time:|MByte/s:|Cycles per update:|Cycles per cacheline:)"
Cycles: 5847790280
Time: 2.923883e+00 sec
MByte/s: 109443.50
Cycles per update: 0.584779
Cycles per cacheline: 4.678232
Now I do see a difference with MByte/s but it appears to have nothing to do with the NUMA balancing setting.
Questions:
- Why does
likwid-pinappear -- in the above examples -- to unexpectedly always cause UPI usage / activity, i.e. forceMByte/sslower? - Why does enabling NUMA balancing work for you, but not for me?
Thanks for you help so far! :-)