RRZE-HPC/likwid

[BUG] perfctr crashes on a64fx

jdomke opened this issue · 3 comments

jdomke commented

Describe the bug
likwid-perfctr throws different Aborted (core dumped) errors depending on runtime of the sleep command

 $ likwid-perfctr -C 0 -g L2 sleep 1
--------------------------------------------------------------------------------
CPU name:
CPU type:       Fujitsu A64FX
CPU clock:      0.00 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
malloc(): unaligned tcache chunk detected
[1]+  Aborted                 (core dumped) likwid-perfctr -C 0 -g L2 sleep 1
Aborted (core dumped)
$ likwid-perfctr -C 0 -g L2 sleep 2
------------------------------------------------------------------------------
--
CPU name:
CPU type:       Fujitsu A64FX
CPU clock:      0.00 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: L2
+------------------+---------+------------+
<<snip>>
|    L1<->L2 data volume [GBytes]    |     0.0020 |
+------------------------------------+------------+

double free or corruption (out)
Aborted (core dumped)

To Reproduce

  • LIKWID command and/or API usage
    ** see above
  • LIKWID version and download source (Github, FTP, package manger, ...)
    ** v5.3.0 tag compiled with GCCARMv8 and ACCESSMODE=direct
  • Operating system
    ** RHEL 8.8 (Ootpa)
  • Does your application use libraries like MPI, OpenMP or Pthreads?
  • In case of Nvidia GPUs, which CUDA version?
  • Are you using the MarkerAPI (CPU code instrumentation) or the NvMarkerAPI (Nvidia GPU code instrumentation)?

To Reproduce with a LIKWID command
Please supply the output of the command with -V 3 added to the command:

  • likwid-perfctr
$ likwid-perfctr -V 3 -C 0 -g L2 sleep 1
DEBUG - [hwloc_init_cpuInfo:367] HWLOC CpuInfo Family 8 Model 1 Stepping 0 Vendor 0x46 Part 0x1 isIntel 0 numHWThreads 24 activeHWThreads 24
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 2 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 3 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 4 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 5 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 6 Thread 0 Core 0 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 7 Thread 0 Core 1 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 8 Thread 0 Core 6 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 9 Thread 0 Core 7 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 10 Thread 0 Core 8 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 11 Thread 0 Core 10 Die 0 Socket 1 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 12 Thread 0 Core 0 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 13 Thread 0 Core 5 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 14 Thread 0 Core 6 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 15 Thread 0 Core 8 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 16 Thread 0 Core 10 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 17 Thread 0 Core 11 Die 0 Socket 2 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 18 Thread 0 Core 0 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 19 Thread 0 Core 5 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 20 Thread 0 Core 6 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 21 Thread 0 Core 8 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 22 Thread 0 Core 10 Die 0 Socket 3 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 23 Thread 0 Core 11 Die 0 Socket 3 inCpuSet 1
DEBUG - [affinity_init:547] Affinity: Socket domains 4
DEBUG - [affinity_init:549] Affinity: CPU die domains 4
DEBUG - [affinity_init:554] Affinity: CPU cores per LLC 12
DEBUG - [affinity_init:557] Affinity: Cache domains 0
DEBUG - [affinity_init:561] Affinity: NUMA domains 4
DEBUG - [affinity_init:562] Affinity: All domains 13
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 24 HW threads on 24 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 6 HW threads on 6 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S1: 6 HW threads on 6 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S2: 6 HW threads on 6 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S3: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D1: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D2: 6 HW threads on 6 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D3: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M1: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M2: 6 HW threads on 6 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M3: 6 HW threads on 6 cores
DEBUG - [create_lookups:295] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 2 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 3 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 4 T2C 8 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 5 T2C 10 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:295] T 6 T2C 0 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 7 T2C 1 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 8 T2C 6 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 9 T2C 7 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 10 T2C 8 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 11 T2C 10 T2S 1 T2D 1 T2LLC 0 T2M 1
DEBUG - [create_lookups:295] T 12 T2C 0 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 13 T2C 5 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 14 T2C 6 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 15 T2C 8 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 16 T2C 10 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 17 T2C 11 T2S 2 T2D 2 T2LLC 0 T2M 2
DEBUG - [create_lookups:295] T 18 T2C 0 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 19 T2C 5 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 20 T2C 6 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 21 T2C 8 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 22 T2C 10 T2S 3 T2D 3 T2LLC 0 T2M 3
DEBUG - [create_lookups:295] T 23 T2C 11 T2S 3 T2D 3 T2LLC 0 T2M 3
--------------------------------------------------------------------------------
CPU name:	
CPU type:	Fujitsu A64FX
CPU clock:	0.00 GHz
CPU family:	8
CPU model:	1
CPU short:	arm64fx
CPU stepping:	0
CPU features:	FP ASIMD AES PMULL ASIMDRDM SVE 
CPU arch:	armv8
--------------------------------------------------------------------------------
[likwid-pin] Main PID -> hwthread 0 - OK
Executing: sleep 1
DEBUG - [perfmon_addEventSet:2326] Currently 1 groups of 2 active
DEBUG - [perfgroup_readGroup:873] Reading group L2 from /home/domke/CPUStudy_A64FX_2600Mhz/testCompile_llvm/dep/likwid/share/likwid/perfgroups/arm64fx/L2.txt
DEBUG - [perfmon_addEventSet:2385] Eventstring INST_RETIRED:PMC0,CPU_CYCLES:PMC1,L1D_CACHE_REFILL:PMC2,L1D_CACHE_WB:PMC3,L1I_CACHE_REFILL:PMC4
DEBUG - [perfmon_addEventSet:2512] Added event INST_RETIRED for counter PMC0 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event CPU_CYCLES for counter PMC1 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event L1D_CACHE_REFILL for counter PMC2 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event L1D_CACHE_WB for counter PMC3 to group 0
DEBUG - [perfmon_addEventSet:2512] Added event L1I_CACHE_REFILL for counter PMC4 to group 0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x0 , Flags: 0x8 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x1 , Flags: 0x11 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x2 , Flags: 0x3 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x3 , Flags: 0x15 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
DEBUG - [perfmon_setupCountersThread_perfevent:1084] SETUP_PMC [0] Register 0x4 , Flags: 0x1 
DEBUG - [perfmon_setupCountersThread_perfevent:1416] perf_event_open: cpu_id=0 pid=-1 flags=0
--------------------------------------------------------------------------------
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1472] RESET_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_startCountersThread_perfevent:1485] START_COUNTER [0] Register 0x0 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x5 , Flags: 0x7049 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x6 , Flags: 0x10653 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x7 , Flags: 0xE4 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x8 , Flags: 0x58 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1559] FREEZE_COUNTER [0] Register 0x9 , Flags: 0x0 
DEBUG - [perfmon_readCountersThread_perfevent:1563] READ_COUNTER [0] Register 0x9 , Flags: 0x213 
DEBUG - [perfmon_readCountersThread_perfevent:1586] UNFREEZE_COUNTER [0] Register 0x9 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x5 , Flags: 0x952C8 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x5 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x6 , Flags: 0x1070C7 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x6 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x7 , Flags: 0xD87 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x7 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x8 , Flags: 0x4F8 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x8 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1508] FREEZE_COUNTER [0] Register 0x9 , Flags: 0x0 
DEBUG - [perfmon_stopCountersThread_perfevent:1512] READ_COUNTER [0] Register 0x9 , Flags: 0xF23 
DEBUG - [perfmon_stopCountersThread_perfevent:1537] RESET_COUNTER [0] Register 0x9 , Flags: 0x0 
--------------------------------------------------------------------------------
Group 1: L2
+------------------+---------+------------+
|       Event      | Counter | HWThread 0 |
+------------------+---------+------------+
|   INST_RETIRED   |   PMC0  |     611016 |
|    CPU_CYCLES    |   PMC1  |    1077447 |
| L1D_CACHE_REFILL |   PMC2  |       3463 |
|   L1D_CACHE_WB   |   PMC3  |       1272 |
| L1I_CACHE_REFILL |   PMC4  |       3875 |
+------------------+---------+------------+

+------------------------------------+------------+
|               Metric               | HWThread 0 |
+------------------------------------+------------+
|         Runtime (RDTSC) [s]        |     1.0025 |
|                 CPI                |     1.7634 |
|  L1D<-L2 load bandwidth [MBytes/s] |     0.8843 |
|  L1D<-L2 load data volume [GBytes] |     0.0009 |
| L1D->L2 evict bandwidth [MBytes/s] |     0.3248 |
| L1D->L2 evict data volume [GBytes] |     0.0003 |
|  L1I<-L2 load bandwidth [MBytes/s] |     0.9895 |
|  L1I<-L2 load data volume [GBytes] |     0.0010 |
|    L1<->L2 bandwidth [MBytes/s]    |     2.1986 |
|    L1<->L2 data volume [GBytes]    |     0.0022 |
+------------------------------------+------------+

double free or corruption (out)
jdomke commented

note: using FCC results in similar crashes

jdomke commented

The issue results from having disabled cores in a 24-core version of A64FX (the chip has all 48 nodes, but only 24 are active). Unlike on Intel/AMD the kernel does not properly mask/map the coreIDs to be consecutive. Visible here:

DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 2 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 3 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 4 Thread 0 Core 8 Die 0 Socket 0 inCpuSet 1
DEBUG - [proc_init_nodeTopology:713] PROC Thread Pool PU 5 Thread 0 Core 10 Die 0 Socket 0 inCpuSet 1

for one of the CMGs of the chip.

I was able to "fix" this part with

diff --git a/src/topology_proc.c b/src/topology_proc.c
index 398be11f..77fa871a 100644
--- a/src/topology_proc.c
+++ b/src/topology_proc.c
@@ -602,6 +602,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
     int (*ownatoi)(const char*);
     ownatoi = &atoi;
     int last_socket = -1;
+    int last_coreid = -1;
     int num_sockets = 0;
     int num_cores_per_socket = 0;
     int num_threads_per_core = 0;
@@ -631,6 +632,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
             {
                 num_sockets++;
                 last_socket = packageId;
+                last_coreid = -1;
             }
             fclose(fp);
         }
@@ -639,7 +641,7 @@ proc_init_nodeTopology(cpu_set_t cpuSet)
         if (NULL != (fp = fopen (bdata(file), "r")))
         {
             bstring src = bread ((bNread) fread, fp);
-            hwThreadPool[i].coreId = ownatoi(bdata(src));
+            hwThreadPool[i].coreId = (++last_coreid); //ownatoi(bdata(src));
             if (hwThreadPool[i].packageId == 0)
             {
                 num_cores_per_socket++;

but it will only move the error to other parts of the code. I think likwid has severe issues when cores, sockets, cachedomains, etc. are not in idea conditions.

Should be fixed with #603.