softdevteam/krun

Add checks for Intel hardware cpu throttle counters

fsfod opened this issue · 10 comments

fsfod commented

The kernel has some counters for when hardware thermal throttling is triggered which can be accessed in /sys/devices/system/cpu/cpu0/thermal_throttle/ for older cpus they only have core_throttle_count and package_throttle_count. For newer cpus there's also core_power_limit_count and package_power_limit_count. There is also counter for all thermal event in /proc/interrupts line "TRM:" which seems to be zero of normally functioning systems.

The interrupt handling for them is setup in therm_throt.c#L520.
Most of these events come from hardware interrupts controlled with IA32_THERM_INTERRUPT. There is also the IA32_THERM_STATUS MSR that contains the state of various throttling behaviors.

OK, so we could add checks for these, but somehow i suspect our A/MPERF code is more robust.

Looking at bencher8:

vext01@bencher8:/sys/devices/system/cpu/cpu0/thermal_throttle$ ls
core_throttle_count  package_throttle_count
vext01@bencher8:/sys/devices/system/cpu/cpu0/thermal_throttle$ cat *
0
0

And as for interrupts:

TRM:          0          0          0          0   Thermal event interrupts

So this just means that the CPU never throttled due to heat, but not that the CPU never clocked down.

For example, this machine idles at 800MHz.

vext01@bencher8:/sys/devices/system/cpu/cpu0/thermal_throttle$ cpufreq-info cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 800 MHz - 3.60 GHz
  available frequency steps: 3.60 GHz, 3.40 GHz, 3.20 GHz, 3.00 GHz, 2.90 GHz, 2.70 GHz, 2.50 GHz, 2.30 GHz, 2.10 GHz, 1.90 GHz, 1.70 GHz, 1.50 GHz, 1.40 GHz, 1.20 GHz, 1000 MHz, 800 MHz
  available cpufreq governors: conservative, powersave, userspace, ondemand, performance, schedutil
  current policy: frequency should be within 800 MHz and 3.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 800 MHz.
  cpufreq stats: 3.60 GHz:58.93%, 3.40 GHz:0.00%, 3.20 GHz:0.00%, 3.00 GHz:0.00%, 2.90 GHz:0.00%, 2.70 GHz:0.00%, 2.50 GHz:0.00%, 2.30 GHz:0.00%, 2.10 GHz:0.00%, 1.90 GHz:0.00%, 1.70 GHz:0.00%, 1.50 GHz:0.00%, 1.40 GHz:0.00%, 1.20 GHz:0.01%, 1000 MHz:0.02%, 800 MHz:41.03%  (20845)
...

But I guess what you are really suggesting is that you could replace the temperature checks in krun with checks to those counters you found?

I'm not sure without doing more research. Isn't it possible there's a tier of throttling that doesn't raise an interrupt? Not sure.

I don't think this is an either/or thing. With APERF/MPERF checks we insulate ourselves against missed interrupts. However, because we have to use a tolerance, we might also miss small blips. If interrupts are generated for those, catching them would help us be even more accurate.

fsfod commented

Your using powersave governor in that output so its going to downclock instead of the performance one

@ltratt so you propose we add this on top of the existing mechanism? That won't help Tom with his laptop situation, but it will make Krun more robust.

@FSOD Correct, but that's irrelevant.

FWIW, I just turned on powersave as I saw the system was running in performance mode.

@vext01 Correct.

fsfod commented

It looks like you can at least disable the c states with /dev/cpu_dma_latency and when the bios is controlling CPU frequency it seems to be exposed in ACPI.

IIRC only certain hardware (mostly server hardware) respects the disabling of C states.

Some of our servers do allow disabling C-states, but we decided it was a waste of power to leave machines idling like that.

When the bios is controlling CPU frequency

Even the BIOS doesn't have ultimate control. Let me show you a quote from the linux docs:

The idea that frequency can be set to a single frequency is fictional for Intel Core processors.

https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt

I'm running an experiment on bencher8 to see if I can get an interrupt to occur.

In the meantime, I've looked at the code reponsible for servicing the interrupt. Looks like it emits a line into the dmesg if the kernel decides to throttle:
http://elixir.free-electrons.com/linux/v4.9/source/arch/x86/kernel/cpu/mcheck/therm_throt.c#L190

(pr_crit() ultimately invokes printk(), which prints to dmesg)

In light of that, we would have detected such interrupts via the dmesg checker.

We've spoke about this in person at a thrash.

The interrupts mentioned above amount to a subset of reasons for which the CPU might clock down, so they cannot be used as gospel.

I was unable to increment the counter by fully loading all cores on bencher8 for 2 hours.

Since we should get a dmesg line for any throttle count increments, we have decided to close this.