open-mpi/hwloc

Incorrect CPU Kinds on NVIDIA Grace CPU

jlinford opened this issue · 14 comments

What version of hwloc are you using?

lstopo 3.0.0a1-git
git commit 96e1889f3a1d28fda7f15e3901519e517c2d3b16

Which operating system and hardware are you running on?

NVIDIA Grace CPU
Ubuntu 22.04.2 with kernel 6.2.0-1009-nvidia-64k

jlinford@ss02-gh01:~/src/hwloc$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
jlinford@ss02-gh01:~/src/hwloc$ uname -a
Linux ss02-gh01 6.2.0-1009-nvidia-64k #9-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 16 04:17:37 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
jlinford@ss02-gh01:~/src/hwloc$

Details of the problem

hwloc-ls shows multiple CPU kinds on NVIDIA Grace CPU due to very minor variations in base/max frequency per core (a few Mhz). NVIDIA Grace is a homogeneous CPU and should have all cores represented as the same kind.

jlinford@ss02-gh01:~/src/hwloc$ hwloc-ls --cpukinds
CPU kind #0 efficiency 0 cpuset 0x00000200,0x00008000
  FrequencyMaxMHz = 3366
  LinuxCapacity = 1007
CPU kind #1 efficiency 1 cpuset 0x00000034,0xc8801c02,0x00170020
  FrequencyMaxMHz = 3375
  LinuxCapacity = 1010
CPU kind #2 efficiency 2 cpuset 0x000000c1,0x07606000,0xf20800c0
  FrequencyMaxMHz = 3384
  LinuxCapacity = 1013
CPU kind #3 efficiency 3 cpuset 0x00000008,0x30168035,0x04807110
  FrequencyMaxMHz = 3393
  LinuxCapacity = 1015
CPU kind #4 efficiency 4 cpuset 0x000101c0,0x09000e0c
  FrequencyMaxMHz = 3402
  LinuxCapacity = 1018
CPU kind #5 efficiency 5 cpuset 0x00000002,0x00080008,0x00000003
  FrequencyMaxMHz = 3411
  LinuxCapacity = 1021
CPU kind #6 efficiency 6 cpuset 0x00600000
  FrequencyMaxMHz = 3420
  LinuxCapacity = 1024

Hello. We could add a NVIDIA-specific quirk so that this sort of frequency and capacity minor changes are ignored. But we'd a way to precisely identify Grace CPUs (anything to look for in /proc/cpuinfo or so?) or I may end up applying it to all NVIDIA ARM CPUs? Maybe some sort of threshold? Looks like 5% would be enough here (but we've seen ARM CPUs where frequency were 3% similar but cores were really different).
By the way, do you know if all chips will show the same values? Same values for same cores after each reboot?

Adding something Grace-specific sounds great, thanks!

Checking /sys/devices/soc0/soc_id is the best way to identify Grace. This file will contain jep106:036b:0241 (NVIDIA JEP106 code and Grace chip ID). I suggest that when Grace is detected, force all cores to be of the same CPU kind. The kind's base frequency could be the minimum frequency of all cores, and the kind's maximum frequency could be the maximum frequency of all cores.

Different Grace chips will have different per-core base/max frequencies. You can see very different outputs from hwloc-ls --cpukinds from chip to chip. But on a given Grace SoC, the per-core base/max frequencies don't change, so the same Grace chip will show the same base/max frequencies, even after reboot.

Ok thanks. What about multisocket systems? (not sure they exist). Could we have a single system with 2 Grace chips with very different frequencies that we want t expose as different kinds?

I am not sure using the min freq for hwloc's FrequencyBase. This thing was designed with cpufreq and turboboost in mind (normal freq vs single core max freq when others are idle). These tiny variations on Grace look different and negligible. We might rather want to just define FrequencyMax since that's what cpufreq reports and ignore FrequencyBase?

By the way, is Grace the same thing as Tegra 241? I found the ID 036b:0241 in some Linux commits talking about Tegra 241, I want to make sure this ID is correct and unique to Grace.

What about multisocket systems? Could we have a single system with 2 Grace chips with very different frequencies that we want t expose as different kinds?

I expect to see similar base/max frequencies across all Grace chips, even in systems with multiple superchips.

We might rather want to just define FrequencyMax since that's what cpufreq reports and ignore FrequencyBase?

Sounds good to me!

is Grace the same thing as Tegra 241?

The short answer is "yes" but Grace is known publicly by it's product name "Grace". To avoid confusion, please don't use Tegra 241 to refer to Grace.

Just for fun and with no expectations I tested bgoglin@801fae5. It works if I set HWLOC_CPUKINDS_HOMOGENEOUS but it doesn't seem to autodetect Grace yet. Looks like WIP so I'll step back and wait for a ping to test again. Thanks for jumping on this so quickly!

jlinford@fc01-gg01:~/src/bgoglin_hwloc$ cat /sys/devices/soc0/soc_id
jep106:036b:0241
jlinford@fc01-gg01:~/src/bgoglin_hwloc$ /tmp/bgoglin_hwloc/bin/hwloc-ls --cpukinds
CPU kind #0 efficiency 0 cpuset 0x00000050,0x08000000,,0x0
  FrequencyMaxMHz = 3321
  LinuxCapacity = 996
CPU kind #1 efficiency 1 cpuset 0x00000020,,,0x0
  FrequencyMaxMHz = 3330
  LinuxCapacity = 999
CPU kind #2 efficiency 2 cpuset 0x00002000,0x00020408,0x02a00000,,0x0
  FrequencyMaxMHz = 3339
  LinuxCapacity = 1002
CPU kind #3 efficiency 3 cpuset 0x00001008,0x006c8100,0x00500000,,0x0
  FrequencyMaxMHz = 3348
  LinuxCapacity = 1005
CPU kind #4 efficiency 4 cpuset 0x00000086,0xb5912084,0xc1000000,,0x0
  FrequencyMaxMHz = 3357
  LinuxCapacity = 1007
CPU kind #5 efficiency 5 cpuset 0x0000c951,0x48004803,0x100ce800,,0x0
  FrequencyMaxMHz = 3366
  LinuxCapacity = 1010
CPU kind #6 efficiency 6 cpuset 0x00000420,0x02001200,0x04010000,0x00000100,0x00000044
  FrequencyMaxMHz = 3375
  LinuxCapacity = 1013
CPU kind #7 efficiency 7 cpuset 0x00000200,,0x20001600,0x0008c44a,0x2847c00a
  FrequencyMaxMHz = 3384
  LinuxCapacity = 1015
CPU kind #8 efficiency 8 cpuset 0x00020174,0x061028a5,0xd6b037b0
  FrequencyMaxMHz = 3393
  LinuxCapacity = 1018
CPU kind #9 efficiency 9 cpuset 0x0000000b,0xd9e61210,0x01080801
  FrequencyMaxMHz = 3402
  LinuxCapacity = 1021
CPU kind #10 efficiency 10 cpuset 0x00000080,0x20010000,0x0
  FrequencyMaxMHz = 3411
  LinuxCapacity = 1024
jlinford@fc01-gg01:~/src/bgoglin_hwloc$ export HWLOC_CPUKINDS_HOMOGENEOUS=1
jlinford@fc01-gg01:~/src/bgoglin_hwloc$ /tmp/bgoglin_hwloc/bin/hwloc-ls --cpukinds
CPU kind #0 efficiency 0 cpuset 0x0000ffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff
  FrequencyMaxMHz = 3411
  LinuxCapacity = 1024

Ah, thanks, I was going to ask to try that release. Can you run "/tmp/bgoglin_hwloc/bin/hwloc-info root | grep SoC" ?

jlinford@fc01-gg01:~$ /tmp/bgoglin_hwloc/bin/hwloc-info root | grep SoC
 info SoC0ID = jep106:036b:0241
 info SoC0Family = jep106:036b
 info SoC0Revision = 0x00000101

Try this on top of the previous git snapshot. I was reading soc_id too late.

--- a/hwloc/topology-linux.c
+++ b/hwloc/topology-linux.c
@@ -7603,6 +7603,7 @@ hwloc_look_linuxfs(struct hwloc_backend *backend, struct hwloc_disc_status *dsta
   if (data->need_global_infos) {
     hwloc_gather_system_info(topology, data);
     hwloc_linuxfs_check_kernel_cmdline(data);
+    hwloc__get_soc_info(data, topology->levels[0][0]);
   }
 
   if (dstatus->phase == HWLOC_DISC_PHASE_CPU) {
@@ -7664,7 +7665,6 @@ hwloc_look_linuxfs(struct hwloc_backend *backend, struct hwloc_disc_status *dsta
  out:
   if (data->need_global_infos) {
     hwloc__get_dmi_id_info(data, topology->levels[0][0]);
-    hwloc__get_soc_info(data, topology->levels[0][0]);
     hwloc__add_info(&topology->infos, "Backend", "Linux");
     /* data->utsname was filled with real uname or \0, we can safely pass it */
     hwloc_add_uname_info(topology, &data->utsname);

I pushed the fix in the PR. Can you verify that the tarball at https://ci.inria.fr/hwloc/job/basic/view/change-requests/job/PR-635/ works fine ?

Then can you also check the tarball at https://ci.inria.fr/hwloc/job/bgoglin/588/ ? It's a backport to hwloc 2.x, which will become the new hwloc 2.10 very soon.

Is it possible to get a dump of /sys of a Grace node for future regression testing? You'd have to run "hwloc-gather-topology grace" from one of these tarballs and send the generated "grace.tar.bz2". If it's too sensible to go in the public repository, I'd still like to get one for myself.

I am going to release 2.10rc1 with this change since I don't want to delay it any further. Hopefully you'll have time to test it before the final 2.10 release next week.

It works great, thanks!

jlinford@fc01-gg01:~$ /cm/shared/apps/hwloc/2.10rc1/bin/hwloc-ls --cpukinds
CPU kind #0 efficiency 0 cpuset 0x0000ffff,0xffffffff,0xffffffff,0xffffffff,0xffffffff
  FrequencyMaxMHz = 3411
  LinuxCapacity = 1024

And can confirm that HWLOC_CPUKINDS_HOMOGENEOUS=0 restores the original behavior. Nice feature, thanks for that.

I generated grace.tar.bz2 but it contains a lot of information. I suggest we wait until production Grace systems are deployed (very soon) and then I'll generate a file we can share publicly.

Ok, thanks!