RRZE-HPC/likwid

Unable to Access Uncore Counters on SPR

Mockingjay1316 opened this issue · 6 comments

Hi,

I am trying to run likwid 5.3 on an SPR machine with a fairly new version of linux (5.15 from uname -a). I have followed the steps in build instructions, including the boot option and capabilities. Secure boot is off.

boot option:

$ sudo dmesg | grep allow_writes
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-119-generic root=[###] ro msr.allow_writes=on
[    1.422673] Kernel command line: BOOT_IMAGE=/vmlinuz-5.15.0-119-generic root=[###] ro msr.allow_writes=on

capabilities (I have set them all):

$ getcap -r .
./likwid-perfscope cap_sys_rawio,cap_sys_admin=ep
./likwid-topology cap_sys_rawio,cap_sys_admin=ep
./likwid-setFrequencies cap_sys_rawio,cap_sys_admin=ep
./likwid-pin cap_sys_rawio,cap_sys_admin=ep
./likwid-features cap_sys_rawio,cap_sys_admin=ep
./likwid-lua cap_sys_rawio,cap_sys_admin=ep
./likwid-memsweeper cap_sys_rawio,cap_sys_admin=ep
./likwid-bench cap_sys_rawio,cap_sys_admin=ep
./likwid-mpirun cap_sys_rawio,cap_sys_admin=ep
./likwid-powermeter cap_sys_rawio,cap_sys_admin=ep
./likwid-genTopoCfg cap_sys_rawio,cap_sys_admin=ep

msrs:

$ ll /dev/cpu/*/msr
crw----rw- 1 root root 202,  0 Sep  2 17:21 /dev/cpu/0/msr
crw----rw- 1 root root 202, 10 Sep  2 17:21 /dev/cpu/10/msr
crw----rw- 1 root root 202, 11 Sep  2 17:21 /dev/cpu/11/msr
crw----rw- 1 root root 202, 12 Sep  2 17:21 /dev/cpu/12/msr
crw----rw- 1 root root 202, 13 Sep  2 17:21 /dev/cpu/13/msr
crw----rw- 1 root root 202, 14 Sep  2 17:21 /dev/cpu/14/msr
crw----rw- 1 root root 202, 15 Sep  2 17:21 /dev/cpu/15/msr
crw----rw- 1 root root 202, 16 Sep  2 17:21 /dev/cpu/16/msr
crw----rw- 1 root root 202, 17 Sep  2 17:21 /dev/cpu/17/msr
crw----rw- 1 root root 202, 18 Sep  2 17:21 /dev/cpu/18/msr
crw----rw- 1 root root 202, 19 Sep  2 17:21 /dev/cpu/19/msr
crw----rw- 1 root root 202,  1 Sep  2 17:21 /dev/cpu/1/msr
crw----rw- 1 root root 202, 20 Sep  2 17:21 /dev/cpu/20/msr
crw----rw- 1 root root 202, 21 Sep  2 17:21 /dev/cpu/21/msr
crw----rw- 1 root root 202, 22 Sep  2 17:21 /dev/cpu/22/msr
crw----rw- 1 root root 202, 23 Sep  2 17:21 /dev/cpu/23/msr
crw----rw- 1 root root 202, 24 Sep  2 17:21 /dev/cpu/24/msr
crw----rw- 1 root root 202, 25 Sep  2 17:21 /dev/cpu/25/msr
crw----rw- 1 root root 202, 26 Sep  2 17:21 /dev/cpu/26/msr
crw----rw- 1 root root 202, 27 Sep  2 17:21 /dev/cpu/27/msr
crw----rw- 1 root root 202, 28 Sep  2 17:21 /dev/cpu/28/msr
crw----rw- 1 root root 202, 29 Sep  2 17:21 /dev/cpu/29/msr
crw----rw- 1 root root 202,  2 Sep  2 17:21 /dev/cpu/2/msr
crw----rw- 1 root root 202, 30 Sep  2 17:21 /dev/cpu/30/msr
crw----rw- 1 root root 202, 31 Sep  2 17:21 /dev/cpu/31/msr
crw----rw- 1 root root 202,  3 Sep  2 17:21 /dev/cpu/3/msr
crw----rw- 1 root root 202,  4 Sep  2 17:21 /dev/cpu/4/msr
crw----rw- 1 root root 202,  5 Sep  2 17:21 /dev/cpu/5/msr
crw----rw- 1 root root 202,  6 Sep  2 17:21 /dev/cpu/6/msr
crw----rw- 1 root root 202,  7 Sep  2 17:21 /dev/cpu/7/msr
crw----rw- 1 root root 202,  8 Sep  2 17:21 /dev/cpu/8/msr
crw----rw- 1 root root 202,  9 Sep  2 17:21 /dev/cpu/9/msr

I have tried accessdeamon mode and direct mode, however likwid-perfctr -e gives out only core counters, but no uncore counters. I also tried likwid-perfctr -C 0 -g MEM ls as a test, and the results shows no info for memory, but with an error message (in direct mode):

Group 1: MEM
+-----------------------+----------+------------+
|         Event         |  Counter | HWThread 0 |
+-----------------------+----------+------------+
|   INSTR_RETIRED_ANY   |   FIXC0  |     533551 |
| CPU_CLK_UNHALTED_CORE |   FIXC1  |     870762 |
|  CPU_CLK_UNHALTED_REF |   FIXC2  |     700756 |
|     TOPDOWN_SLOTS     |   FIXC3  |    5224572 |
|      CAS_COUNT_RD     |  MBOX0C0 |      -     |
|      CAS_COUNT_WR     |  MBOX0C1 |      -     |
|      CAS_COUNT_RD     |  MBOX1C0 |      -     |
|      CAS_COUNT_WR     |  MBOX1C1 |      -     |
|      CAS_COUNT_RD     |  MBOX2C0 |      -     |
|      CAS_COUNT_WR     |  MBOX2C1 |      -     |
|      CAS_COUNT_RD     |  MBOX3C0 |      -     |
|      CAS_COUNT_WR     |  MBOX3C1 |      -     |
|      CAS_COUNT_RD     |  MBOX4C0 |      -     |
|      CAS_COUNT_WR     |  MBOX4C1 |      -     |
|      CAS_COUNT_RD     |  MBOX5C0 |      -     |
|      CAS_COUNT_WR     |  MBOX5C1 |      -     |
|      CAS_COUNT_RD     |  MBOX6C0 |      -     |
|      CAS_COUNT_WR     |  MBOX6C1 |      -     |
|      CAS_COUNT_RD     |  MBOX7C0 |      -     |
|      CAS_COUNT_WR     |  MBOX7C1 |      -     |
|      CAS_COUNT_RD     |  MBOX8C0 |      -     |
|      CAS_COUNT_WR     |  MBOX8C1 |      -     |
|      CAS_COUNT_RD     |  MBOX9C0 |      -     |
|      CAS_COUNT_WR     |  MBOX9C1 |      -     |
|      CAS_COUNT_RD     | MBOX10C0 |      -     |
|      CAS_COUNT_WR     | MBOX10C1 |      -     |
|      CAS_COUNT_RD     | MBOX11C0 |      -     |
|      CAS_COUNT_WR     | MBOX11C1 |      -     |
|      CAS_COUNT_RD     | MBOX12C0 |      -     |
|      CAS_COUNT_WR     | MBOX12C1 |      -     |
|      CAS_COUNT_RD     | MBOX13C0 |      -     |
|      CAS_COUNT_WR     | MBOX13C1 |      -     |
|      CAS_COUNT_RD     | MBOX14C0 |      -     |
|      CAS_COUNT_WR     | MBOX14C1 |      -     |
|      CAS_COUNT_RD     | MBOX15C0 |      -     |
|      CAS_COUNT_WR     | MBOX15C1 |      -     |
+-----------------------+----------+------------+
...
ERROR - [./src/includes/perfmon_sapphirerapids.h:perfmon_finalizeCountersThread_sapphirerapids:2222] No such file or directory.
MSR read operation failed

likwid-perfctr -i gives out information like this:

$ likwid-perfctr -i
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 5415+
CPU type:       Intel SapphireRapids processor
CPU clock:      2.90 GHz
CPU family:     6
CPU model:      143
CPU short:      SPR
CPU stepping:   8
CPU features:   FP ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3
CPU arch:       x86_64
--------------------------------------------------------------------------------
PERFMON version:                        5
PERFMON number of counters:             8
PERFMON width of counters:              48
PERFMON number of fixed counters:       4
--------------------------------------------------------------------------------

Can you shed some light on how to proceed? Thank you!

I havn't tested LIKWID with capabilities for a long time, especially not on SPR. I remember that they where hard to configure correctly. Quick guess, remove the capabilities from all except likwid-lua with ACCESSMODE=direct. Warning: This is a security issue as anyone using this interpreter can use the capabilities! If you use ACCESSMODE=accessdaemon, the daemons in sbin require the capabilities and there should be no security problem. Where did you get the info how to configure the capabilities correctly? I test it myself and update the docs.

In order to set the capabilities you probably had root privileges. Try an installation with ACCESSMODE=accessdaemon to some user-local prefix but with sudo (The user-local prefix is just to easily delete the whole installation again). Adjust the PATH and LD_LIBRARY_PATH and check again whether it still does not work.

You should get some better understanding where it fails with debugging mode -V 3.

Thank you for the reply! I just rebuilt with ACCESSMODE=accessdaemon and the problem persists. With -V 3 I got more error messages (thousands of them) like:

...
DEBUG - [access_client_check:562] Device check for dev 199 on socket 0 with accessDaemon failed
...

As a result MBOX cannot be accesses. Seems other errors are of similar type.

I also examined the syslog:

Sep  4 16:39:06 accessD: AccessDaemon runs with UID 1172040, eUID 0
Sep  4 16:39:08 accessD: Failed to read data from register 0x63a on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to read data from register 0x641 on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep  4 16:39:08 accessD: Input/output error
Sep  4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep  4 16:39:08 accessD: Input/output error

Capabilities cap_sys_rawio,cap_sys_admin=ep are given to likwid-accessD.
Am I missing something here?

Have you tried CAP_DAC_OVERRIDE? Based on the answer at this SO entry, the two capabilities you specified are not enough to fully read /dev/mem.