Notes on using rdtsc(p) instruction: zen 4 ryzen mobile processors (R7 7840H) and intel 12th gen alder lake mobile processor (i7 12700H) as examples
On x86 platforms the rdtsc
and rdtscp
instructions gave us the ability to access a internal 64-bit hardware time stamp counter TSC(Time Stamp Counter) that is reset to zero at boot time and since then increments at a certain frequency (ideally at the same rate as the processor clock speed), which can be very useful for conducting microbenchmarks.
Here's some good references to learn about the rdtsc
and the rdtscp
instructions:
- Stackoverflow thread
- Intel's guide on benchmarking code with rdtsc(p)
- rdtscp reference
- rdtsc reference
For measuring the execution time of work
, the ideal case would be like the following:
leading_code()
startcycle = rdtsc()
measured_code()
endcycle = rdtsc()
following_code()
However things just won't work out this easy. The main problem is with out-of-order and multi-issue execution. The plain rdtsc
instruction does NOT give any serializing guarantess. This means that:
- At the time of
startcycle=rdtsc
execution, theleading code
part could have unfinished instructions that contaminants themeasured_code
- At the time of
startcycle=rdtsc
actually read the hardware counter, themeasured_code
could have already started its execution. - At the time of
endcycl=rdtsc
execution, themeasured_code
could be unfinished. - At the time of
endcycl=rdtsc
actually read the hardware counter, thefollowing_code
could have started its execution, taking up hardware resources and thus causing structural hazard and the delays theendcycl=rdtsc
's reading
So how do we solve these problems? It turns out that some serializing instructions has to be used.
For solving the above problems, some more or less serializing instructions has to be used. We have some candidates here:
-
The
rdtscp
instruction, which is available all recent X86 platforms. This instruction could be used to read the hardware counter like therdtsc
instruction, and while it is not a serializing instruction, it does wait until all previous instructions have executed and all previous loads are globally visible. Besides however, it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed. This means that if we modify the above code to the following version:leading_code() startcycle = rdtscp() measured_code() endcycle = rdtscp() following_code()
Two of the above problems is solved that
- the
leading_code
must have finished when thestartcycle = rdtscp()
begins to execute. - the
measured_code
must have finished when theendcycle = rdtscp()
begins to execute.
While two other problems remains to be solved:
- At the time of
startcycle=rdtscp
actually read the hardware counter, themeasured_code
could still have already started its execution. - At the time of
endcycl=rdtscp
actually read the hardware counter, thefollowing_code
could still have started its execution, taking up hardware resources and thus causing structural hazard and the delays theendcycl=rdtscp
's reading
Now we need some instructions that really does some serializing. The first option is to use
cpuid
. - the
-
The
cpuid
instruction, which is available on all X86 platforms, is a true serializing instruction that guarantees any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed. This means that if we modify the above code to the following version:leading_code() startcycle = rdtscp() cpuid() measured_code() endcycle = rdtscp() cpuid() following_code()
This way all the problems are solved. However we must notice that the execution time of first
cpuid
is included in the final timing result. This system error is inevitable and should be substracted from the final result. Besides, as [this] suggests, the use ofcpuid
instruction might severely hinders the performance. Turns out that thelfence
instruction is a better choice. -
The
lfence
instruction, which performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. Thelfence
instruction is not a true serializing instruction in that it only guarantees that the LFENCE does not execute until all prior instructions have completed locally. The key point is the adverb locally which is explained here. Basically speakinglfence
fits our needs here. However be ware that this local serialization behaviour is just guaranteed on intel processors. On earlier AMD processors thelfence
only does a plain load fence, but for the recent enough ryzen processorslfence
behaves the same as the intel ones, see this for more details.So now our final code is like the below:
leading_code() startcycle = rdtscp() lfence() measured_code() endcycle = rdtscp() lfence() following_code()
Again,
lfence
introduces some inevitable system error that has to be substracted. Besides, its worth noting that alfence;rdtsc
sequence is roughly equavalent tordtscp
according to this.
Implementations, more details and note that rdtscp's behaviour is different between AMD(7840H) and intel(12700H) platforms
With the above knowledges, we could implement a clock reading routine using inline assembly as the following:
inline volatile void clock(uint32_t& clk_hi, uint32_t& clk_lo){
asm volatile(
"rdtscp\n\t"
"lfence\n\t"
: "=d"(clk_hi), "=a"(clk_lo)
:
: "%rcx"
);
}
As the intel's guide suggested, before measuring any codes, we would like to determine the inherit overhead caused by using the clock reading routine itself, which is to be substracted from real measurement results. This measurement of the overhead can be done with the following routine:
inline volatile uint64_t concat_clk(const uint32_t& clk_hi, const uint32_t& clk_lo){
return (uint64_t)(clk_hi)<<32 | clk_lo;
}
inline volatile uint64_t measure_overhead(){
uint32_t ch1,cl1,ch2,cl2;
clock(ch1,cl1);
clock(ch2,cl2);
uint64_t clk1 = concat_clk(ch1,cl1);
uint64_t clk2 = concat_clk(ch2, cl2);
uint64_t delta = clk2 - clk1; //clock is monotonic
return delta;
}
And the compiled version of the above routine would look like something like the following:
16f1: 0f 01 f9 rdtscp
16f4: 0f ae e8 lfence
16f7: 89 c6 mov %eax,%esi
16f9: 89 d7 mov %edx,%edi
16fb: 0f 01 f9 rdtscp
16fe: 0f ae e8 lfence
Which is exactly what we want. We could see that here the overhead includes:
- Cycles taken to complete the first
lfence
- Cycles taken to complete two
mov
s, which is also inevitable as consecutiverdtscp
s have the same destination registers. - Cycles taken to complete the latter
rdtscp
Again, as the intel's guide suggests, we are interested mainly in the variance of consecutive overhead measurements. If the variance is small, we can be confident that the overhead is consistent so we could reliably substract the overhead from some real measurements.
Now we lock the cpu clock speed to 3.8GHz and run the above overhead measusring routine 1000 times to observe the overheads and their variance, the results are as the following:
overheads: 76 76 76 76 76 76 76 76 76 ... 114 ... 76
variance: 17.120064
It can be seen that on ryzen 7840H the measuring overhead is quite consistent while being considerably large. The real fun part is that before long I realized that for a ryzen 7840H running at 3.8GHz, the clock reading rdtscp
instruction gives is consistently multiple of 38! Here I wrote another program demonstrating this:
#pragma unroll(32)
for(size_t i=0; i<N; i++){
clock(clk_hi,clk_lo);
clocks[i] = concat_clk(clk_hi, clk_lo);
}
uint64_t count=0;
for (size_t i=0; i<N; i++){
if (clocks[i]%38==0){
count++;
}
}
printf("Out of %lu samples %lu was multiple of 38\n",N,count);
And the results are pretty interesting:
Out of 16384 samples 16384 was multiple of 38
Even more interesting it seems that no matter what frequency the processor is running at the rdtscp
clock reading remains to be multiple of 38. (while rdtsc
is a little even more extra interesting but its not our main concern here)
Considering the fact that this particular ryzen processor 7840h has constant_tsc
flag set and a base clock speed of 3.8GHz, I think it is clear that at least for ryzen 7840H the tsc has a time resolution of 10ns. This means that repeated measurements have to be conducted for higher time resolution. The theories are:
Where
The measurement error are scaled down to
Ryzen 5800X/4750G remains to be tested
TODO
It is often desirable to interpret clock cycle count to real world time. To do so, given a clock driven counter, we would like to know the frequency of the clock. The increment frequency is usually the same as the base clock of the processor, on ryzen R7 7840H, this is about 3.8GHz (~100MHz * 38), however its always good to confirm this. AFAIK there is two ways to get this frequency.
Very luckily Linux uses TSC as a major clock source and it performs a calibration at system boot time. Here dmesg
log could help us confirm this calibration did happen:
$ sudo dmesg | grep -Ii tsc
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.000000] tsc: Detected 3793.051 MHz processor
[ 0.058695] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x6d59656ea52, max_idle_ns: 881590428463 ns
[ 0.562150] clocksource: Switched to clocksource tsc-early
[ 1.584321] tsc: Refined TSC clocksource calibration: 3792.875 MHz
[ 1.584339] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x6d581b92771, max_idle_ns: 881590605997 ns
[ 1.584392] clocksource: Switched to clocksource tsc
...
Then our problem is how to we read this calibration result. In fact Linux kernel has an exported symbol tsc_khz
defined in tsc.c however it is not exposed to userspace. Stackoverflow user maxschlepzig summarized many ways to access this symbol in this stackoverflow thread. I personally feel it is most reasonable and convenient to use a kernel module.
This repository has a quite satisfying implementation.
It is also possible to get the tsc frequency by conducting a calibration. The idea is simple: with some wall clock provided by the system, we record start time with tsc clock, and end time with tsc clock, then divide the time difference by the clock difference to get the tsc cycle time.
TODO
It's natural to notice that to correctly interpret clock cycle count to real world time the frequency of tsc increment should be constant, or at least it should behave as if it is contant. On modern processors with constant_tsc
and nonstop_tsc
flags, the tsc always increment at a constant frequency no matter of the state of the processor such as the frequency scaling state. These flags could be checked with cat /proc/cpuinfo | grep tsc
.
With a constant tsc, a tsc clock difference reading can always be safely interpreted to real world time difference. For instance, with a constant tsc calibrated to 3.8GHz, if two consecutive reading to the tsc gives a clock difference of 38, it is always true that the time interval between the two measurement was 10ns. However it is worth noting that even with a constant tsc, for an clock-accurate microbench result, it is still needed to fix the clock speed of the CPU to its base clock. This is because if not, with constant tsc the processor clock and the tsc clock is essentially running asynchronously. For instance, if the processor's running at 1900MHz while the tsc has a constant freq of 3800MHz, for some operation that takes 5 cycles to finish, a tsc-timed bench result would tell that it takes 10 clock cycles to complete.