RRZE-HPC/likwid

[BUG] Wrong RAPL DRAM energy unit for Sapphire Rapids (SPR)

Closed this issue · 0 comments

Hi @TomTheBear,

I was quite caught off guard to observe the following issue in LIKWID's RAPL measurements for the DRAM domain on Intel Sapphire Rapids. I am surprised noone has observed this issue yet - but then I guess the hardware is pretty new. An associated PR with the straightforward fixes implemented is on its way.

Hope this helps. Let me know if you need more debug info but I think the issue is quite clear.

Greetings

Christian

Describe the bug

LIKWID uses the same DRAM energy unit of 15.3 uJ on Sapphire Rapids (SPR) as for previous generations of Intel architectures (see relevant code here). This causes too low reported energy values (and respective power values) as the same 61 uJ energy unit should be used as for all other RAPL domains. A patch to fix this issue in the associated Linux drivers and tools was also raised on the kernel mailing list a while back. The relevant merged commit can be found here.

To Reproduce

  • likwid-perfctr -g ENERGY sleep 1 on an Intel Sapphire Rapids system
  • LIKWID version: commit d8fea29 (but associated code still in current master)
  • OS: Rocky Linux release 8.10 (Green Obsidian)

Additional context

  • The difference between LIKWID and other tools could be shown on the same system when using a more recent kernel version (see above links).
  • Existing measurement values can be fixed by scaling the reported energy (or power) by 0.5 ** 14 / 15.3e-6 (≈ 3.9892258986928106).