Unable to Access Uncore Counters on SPR
Mockingjay1316 opened this issue · 6 comments
Hi,
I am trying to run likwid 5.3 on an SPR machine with a fairly new version of linux (5.15 from uname -a
). I have followed the steps in build instructions, including the boot option and capabilities. Secure boot is off.
boot option:
$ sudo dmesg | grep allow_writes
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-119-generic root=[###] ro msr.allow_writes=on
[ 1.422673] Kernel command line: BOOT_IMAGE=/vmlinuz-5.15.0-119-generic root=[###] ro msr.allow_writes=on
capabilities (I have set them all):
$ getcap -r .
./likwid-perfscope cap_sys_rawio,cap_sys_admin=ep
./likwid-topology cap_sys_rawio,cap_sys_admin=ep
./likwid-setFrequencies cap_sys_rawio,cap_sys_admin=ep
./likwid-pin cap_sys_rawio,cap_sys_admin=ep
./likwid-features cap_sys_rawio,cap_sys_admin=ep
./likwid-lua cap_sys_rawio,cap_sys_admin=ep
./likwid-memsweeper cap_sys_rawio,cap_sys_admin=ep
./likwid-bench cap_sys_rawio,cap_sys_admin=ep
./likwid-mpirun cap_sys_rawio,cap_sys_admin=ep
./likwid-powermeter cap_sys_rawio,cap_sys_admin=ep
./likwid-genTopoCfg cap_sys_rawio,cap_sys_admin=ep
msrs:
$ ll /dev/cpu/*/msr
crw----rw- 1 root root 202, 0 Sep 2 17:21 /dev/cpu/0/msr
crw----rw- 1 root root 202, 10 Sep 2 17:21 /dev/cpu/10/msr
crw----rw- 1 root root 202, 11 Sep 2 17:21 /dev/cpu/11/msr
crw----rw- 1 root root 202, 12 Sep 2 17:21 /dev/cpu/12/msr
crw----rw- 1 root root 202, 13 Sep 2 17:21 /dev/cpu/13/msr
crw----rw- 1 root root 202, 14 Sep 2 17:21 /dev/cpu/14/msr
crw----rw- 1 root root 202, 15 Sep 2 17:21 /dev/cpu/15/msr
crw----rw- 1 root root 202, 16 Sep 2 17:21 /dev/cpu/16/msr
crw----rw- 1 root root 202, 17 Sep 2 17:21 /dev/cpu/17/msr
crw----rw- 1 root root 202, 18 Sep 2 17:21 /dev/cpu/18/msr
crw----rw- 1 root root 202, 19 Sep 2 17:21 /dev/cpu/19/msr
crw----rw- 1 root root 202, 1 Sep 2 17:21 /dev/cpu/1/msr
crw----rw- 1 root root 202, 20 Sep 2 17:21 /dev/cpu/20/msr
crw----rw- 1 root root 202, 21 Sep 2 17:21 /dev/cpu/21/msr
crw----rw- 1 root root 202, 22 Sep 2 17:21 /dev/cpu/22/msr
crw----rw- 1 root root 202, 23 Sep 2 17:21 /dev/cpu/23/msr
crw----rw- 1 root root 202, 24 Sep 2 17:21 /dev/cpu/24/msr
crw----rw- 1 root root 202, 25 Sep 2 17:21 /dev/cpu/25/msr
crw----rw- 1 root root 202, 26 Sep 2 17:21 /dev/cpu/26/msr
crw----rw- 1 root root 202, 27 Sep 2 17:21 /dev/cpu/27/msr
crw----rw- 1 root root 202, 28 Sep 2 17:21 /dev/cpu/28/msr
crw----rw- 1 root root 202, 29 Sep 2 17:21 /dev/cpu/29/msr
crw----rw- 1 root root 202, 2 Sep 2 17:21 /dev/cpu/2/msr
crw----rw- 1 root root 202, 30 Sep 2 17:21 /dev/cpu/30/msr
crw----rw- 1 root root 202, 31 Sep 2 17:21 /dev/cpu/31/msr
crw----rw- 1 root root 202, 3 Sep 2 17:21 /dev/cpu/3/msr
crw----rw- 1 root root 202, 4 Sep 2 17:21 /dev/cpu/4/msr
crw----rw- 1 root root 202, 5 Sep 2 17:21 /dev/cpu/5/msr
crw----rw- 1 root root 202, 6 Sep 2 17:21 /dev/cpu/6/msr
crw----rw- 1 root root 202, 7 Sep 2 17:21 /dev/cpu/7/msr
crw----rw- 1 root root 202, 8 Sep 2 17:21 /dev/cpu/8/msr
crw----rw- 1 root root 202, 9 Sep 2 17:21 /dev/cpu/9/msr
I have tried accessdeamon
mode and direct
mode, however likwid-perfctr -e
gives out only core counters, but no uncore counters. I also tried likwid-perfctr -C 0 -g MEM ls
as a test, and the results shows no info for memory, but with an error message (in direct
mode):
Group 1: MEM
+-----------------------+----------+------------+
| Event | Counter | HWThread 0 |
+-----------------------+----------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 533551 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 870762 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 700756 |
| TOPDOWN_SLOTS | FIXC3 | 5224572 |
| CAS_COUNT_RD | MBOX0C0 | - |
| CAS_COUNT_WR | MBOX0C1 | - |
| CAS_COUNT_RD | MBOX1C0 | - |
| CAS_COUNT_WR | MBOX1C1 | - |
| CAS_COUNT_RD | MBOX2C0 | - |
| CAS_COUNT_WR | MBOX2C1 | - |
| CAS_COUNT_RD | MBOX3C0 | - |
| CAS_COUNT_WR | MBOX3C1 | - |
| CAS_COUNT_RD | MBOX4C0 | - |
| CAS_COUNT_WR | MBOX4C1 | - |
| CAS_COUNT_RD | MBOX5C0 | - |
| CAS_COUNT_WR | MBOX5C1 | - |
| CAS_COUNT_RD | MBOX6C0 | - |
| CAS_COUNT_WR | MBOX6C1 | - |
| CAS_COUNT_RD | MBOX7C0 | - |
| CAS_COUNT_WR | MBOX7C1 | - |
| CAS_COUNT_RD | MBOX8C0 | - |
| CAS_COUNT_WR | MBOX8C1 | - |
| CAS_COUNT_RD | MBOX9C0 | - |
| CAS_COUNT_WR | MBOX9C1 | - |
| CAS_COUNT_RD | MBOX10C0 | - |
| CAS_COUNT_WR | MBOX10C1 | - |
| CAS_COUNT_RD | MBOX11C0 | - |
| CAS_COUNT_WR | MBOX11C1 | - |
| CAS_COUNT_RD | MBOX12C0 | - |
| CAS_COUNT_WR | MBOX12C1 | - |
| CAS_COUNT_RD | MBOX13C0 | - |
| CAS_COUNT_WR | MBOX13C1 | - |
| CAS_COUNT_RD | MBOX14C0 | - |
| CAS_COUNT_WR | MBOX14C1 | - |
| CAS_COUNT_RD | MBOX15C0 | - |
| CAS_COUNT_WR | MBOX15C1 | - |
+-----------------------+----------+------------+
...
ERROR - [./src/includes/perfmon_sapphirerapids.h:perfmon_finalizeCountersThread_sapphirerapids:2222] No such file or directory.
MSR read operation failed
likwid-perfctr -i
gives out information like this:
$ likwid-perfctr -i
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) Gold 5415+
CPU type: Intel SapphireRapids processor
CPU clock: 2.90 GHz
CPU family: 6
CPU model: 143
CPU short: SPR
CPU stepping: 8
CPU features: FP ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE FMA SSE4.1 SSE4.2 AES AVX RDRAND AVX2 AVX512 RDSEED SSE3
CPU arch: x86_64
--------------------------------------------------------------------------------
PERFMON version: 5
PERFMON number of counters: 8
PERFMON width of counters: 48
PERFMON number of fixed counters: 4
--------------------------------------------------------------------------------
Can you shed some light on how to proceed? Thank you!
I havn't tested LIKWID with capabilities for a long time, especially not on SPR. I remember that they where hard to configure correctly. Quick guess, remove the capabilities from all except likwid-lua
with ACCESSMODE=direct
. Warning: This is a security issue as anyone using this interpreter can use the capabilities! If you use ACCESSMODE=accessdaemon
, the daemons in sbin
require the capabilities and there should be no security problem. Where did you get the info how to configure the capabilities correctly? I test it myself and update the docs.
In order to set the capabilities you probably had root privileges. Try an installation with ACCESSMODE=accessdaemon
to some user-local prefix but with sudo
(The user-local prefix is just to easily delete the whole installation again). Adjust the PATH
and LD_LIBRARY_PATH
and check again whether it still does not work.
You should get some better understanding where it fails with debugging mode -V 3
.
Thank you for the reply! I just rebuilt with ACCESSMODE=accessdaemon
and the problem persists. With -V 3
I got more error messages (thousands of them) like:
...
DEBUG - [access_client_check:562] Device check for dev 199 on socket 0 with accessDaemon failed
...
As a result MBOX
cannot be accesses. Seems other errors are of similar type.
I also examined the syslog:
Sep 4 16:39:06 accessD: AccessDaemon runs with UID 1172040, eUID 0
Sep 4 16:39:08 accessD: Failed to read data from register 0x63a on core 0
Sep 4 16:39:08 accessD: Input/output error
Sep 4 16:39:08 accessD: Failed to read data from register 0x641 on core 0
Sep 4 16:39:08 accessD: Input/output error
Sep 4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep 4 16:39:08 accessD: Input/output error
Sep 4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep 4 16:39:08 accessD: Input/output error
Sep 4 16:39:08 accessD: Failed to write data to register 0x8000 on core 0
Sep 4 16:39:08 accessD: Input/output error
Capabilities cap_sys_rawio,cap_sys_admin=ep
are given to likwid-accessD
.
Am I missing something here?
Have you tried CAP_DAC_OVERRIDE
? Based on the answer at this SO entry, the two capabilities you specified are not enough to fully read /dev/mem
.