powa-team/pg_stat_kcache

about the difference between value of auto-detected linux_hz and CONFIG_HZ

atorik opened this issue · 6 comments

Hello,

I have little knowledge on kernel time and I may have basic misunderstanding, but I noticed that auto-detected linux_hz is different from the CONFIG_HZ.

$ cat /etc/redhat-release 
CentOS Linux release 8.1.1911 (Core) 

# grep CONFIG_HZ /boot/config-4.18.0-147.el8.x86_64 
...
CONFIG_HZ_1000=y
CONFIG_HZ=1000

$ pg_ctl start
...
2020-07-14 02:17:26.001 EDT [13688] LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2020-07-14 02:17:26.001 EDT [13688] LOG:  pg_stat_kcache.linux_hz is set to 500000
...

I first tried this on a virtual machine, and doubted VM might be the cause.
So I also tried the same thing on a physical server, but the guessed linux_hz is far more than CONFIG_HZ.

Is this an intentional behavior?
And as far as I read the explanation of linux_hz and man time(7), it seems better to set the value of CONFIG_HZ to linux_hz, is this a right way to go?

    The software clock, HZ, and jiffies
       The  accuracy  of  various system calls that set timeouts, (e.g., select(2), sigtimedwait(2)) and measure CPU time (e.g.,
       getrusage(2)) is limited by the resolution of the software clock, a clock maintained by the kernel which measures time in
       jiffies.  The size of a jiffy is determined by the value of the kernel constant HZ.
rjuju commented

Hello,

It's probably expected to have a value different from CONFIG_HZ, however I wouldn't have expected to get such a higher tick frequency. I'm not sure how to reliably get this constant from C code across all linux kernel versions, and we also need an alternative for non-linux kernel. It's however possible to manually set this parameter if needed.

That being said, this parameter is only used as a way to ignore incorrect getrusage() result that we previously saw when graphing the results using powa UI. I'm fine with improving this part, but I'm afraid that whatever approach we use, we'll never be able to really trust numbers for queries which runtime is close to the system clock resolution.

rjuju commented

After a quick look at pgsk_assign_linux_hz_check_hook(), it seems that the approach is quite incorrect as that function assumes that comparing the clock resolution can be computed with only a single new value, which can lead to entirely incorrect values as in your case.

A simple way to fix that would be to add a new loop that call getrusage until a new timeval is found, and then use this as a reference for the existing loop to compute a more reliable hz. What do you think?

Thanks for your reply!

It's probably expected to have a value different from CONFIG_HZ,

Oh, I thought it should be CONFIG_HZ. I misunderstand something?

--- pg_stat_kcache.c
220         DefineCustomIntVariable("pg_stat_kcache.linux_hz",
221                                 "Inform pg_stat_kcache of the linux CONFIG_HZ config option",

That being said, this parameter is only used as a way to ignore incorrect getrusage() result that we previously saw when graphing the results using powa UI.

Just out of interest, 'incorrect getrusage() result' is how incorrect?
Too low, high or somthing else?

A simple way to fix that would be to add a new loop that call getrusage until a new timeval is found, and then use this as a reference for the existing loop to compute a more reliable hz. What do you think?

Uhh, sorry, but I cannot judge this idea works well or not..
I wish there is a good way to get or calculate it, but they say there's no uniform way.
https://stackoverflow.com/questions/12480486/how-to-check-hz-in-the-terminal

rjuju commented

Oh, I thought it should be CONFIG_HZ. I misunderstand something?

The parameter should have this value. If the user doesn't configure it manually (since I don't know any way to conveniently retrieve it from a C program), there's a heuristic trying to find the value.

Just out of interest, 'incorrect getrusage() result' is how incorrect?
Too low, high or somthing else?

Both. Let's say the clock resolution is 1ms, if you have 10 queries running for 0.1 ms, pg_stat_kcache could then report 9 of them being instant as far as CPU is concerned, and one of them lasting 1ms. If you have 10 different queryid, then 9 of those queries will appear as not consuming any CPU while the last one would appear as consuming more cpu time than execution time, so the cpu usage would be more than 100%.

rjuju commented

Oh, it also seems that the computation is done using the wrong unit:

		*newval = (int) (1 / ((myrusage.ru_utime.tv_sec - previous_value.tv_sec) +
		   (myrusage.ru_utime.tv_usec - previous_value.tv_usec) / 1000000.));

This is computing a frequency based on microseconds and not milliseconds, so that explains why you got 500000 rather than 500. Both values are wrong (I'm assuming that you're LINUX_HZ is 1000), but at least 500 is the right order of magnitude.

Thanks for your investigation, I haven't noticed that!