Results for other systems
travisdowns opened this issue · 29 comments
As I mentioned on HN, I can run this on SKL, SKX and CNL (CannonLake) for you.
Are there any specific arguments or format you want the results in, or just capture the output of cult
and include it in this issue?
Thanks! No arguments are necessary, but --output=somefile.json
is a handy option.
BTW I have never tested this with AVX-512, I have no idea whether it would all work flawlessly, so fingers crossed :)
OK, I will run 2x with: --output=rounded.json
and --output=raw.json --no-rounding
and provide you those files.
FWIW on my SKL system I get some wrong results, like:
bt r16, r16 : Lat: 0.50 Rcp: 0.50
bt r16, i8 : Lat: 0.50 Rcp: 0.50
bt r32, r32 : Lat: 0.50 Rcp: 0.50
bt r32, i8 : Lat: 0.50 Rcp: 0.50
bt r64, r64 : Lat: 0.50 Rcp: 0.50
bt r64, i8 : Lat: 0.50 Rcp: 0.50
Where 0.5 lat is ... unlikely. I guess the problem is maybe CULT doesn't know that the first argument to bt
is write only? That is, if you do bt eax, ecx
you aren't testing latency (I don't know what asm is actually generated, it's just a guess).
Yeah, I think it's the opposite - bt reg, reg
is read-only for registers, it only modifies the carry flag, so it's hard to make the asm that has dependencies without introducing other instructions in there. This is something I would like to fix in a future version.
BTW: you don't have to use --no-rounding
, there is always small error that gets corrected by the rounding.
@kobalicek - oops, good point I forgot that bt
is totally read only.
Here's another one I noticed:
blendvpd xmm, xmm, xmm0 : Lat: 0.50 Rcp: 0.50
I also got a lot of 0.2 recip throughput results which should be wrong (max 4 ops/cycle), but it seemed to go away after I turned off turbo. Do I need to turn off turbo to get good results?
Yeah I'm also getting 0.2 reciprocal throughput on some instructions on Ryzen, but apparently Ryzen is capable of executing 5 instructions per cycle if they are in uop cache. However, if it says 0.2 it's probably true even on Intel although it's possible that I miscalculate the cycles wasted for each loop iteration, which is currently set to 1 cycle - hard to say whether that could cause reporting 0.2 instead of 0.25 in such cases.
Yeah I'm also getting 0.2 reciprocal throughput on some instructions on Ryzen, but apparently Ryzen is capable of executing 5 instructions per cycle if they are in uop cache. However, if it says 0.2 it's probably true even on Intel although it's possible that I miscalculate the cycles wasted for each loop iteration, which is currently set to 1 cycle - hard to say whether that could cause reporting 0.2 instead of 0.25 in such cases.
Yes, on Ryzen that is expected.
It's definitely not 0.2 on Intel though, I've tested this stuff exhaustively down to the cycle using lots of different calibration and cycle measurement techniques and I have never seen any case you can do 5 ops/cycle.
As I mentioned it could be turbo effects - how are you doing the timing? Do you use a clock-based timing and then convert to cycles using a calibration based on a well-known timing, say a loop of dependent instructions?
My first CNL results look all wrong:
add r8, r8 : Lat: 0.66 Rcp: 0.20
add r8, i8 : Lat: 0.66 Rcp: 0.20
add r16, r16 : Lat: 0.66 Rcp: 0.20
add r16, i16 : Lat: 2.25 Rcp: 2.25
add r32, r32 : Lat: 0.66 Rcp: 0.20
add r32, i32 : Lat: 0.66 Rcp: 0.20
add r64, r64 : Lat: 0.66 Rcp: 0.20
I will try to turn off turbo.
Update: Looks OK with turbo off.
Hmm, I don't know how to fix this though. It seems the readings are incorrect in that case. It uses rdtscp
when available, I followed Intel manual here.
Yes, but rdtscp
measures wall-clock time, not cycles. So it will always be wrong (in cycles) if the chip has turbo.
I think the manual I followed was written when the turbo didn't exist :) Do you have any suggestion about improving it? The logic is in basebench.cpp
if you wanna see the current code.
The "fix" is either to force the user to turbo off turbo, you can see how I do this programatically here:
https://github.com/travisdowns/uarch-bench/blob/master/uarch-bench.sh#L66
Or to do a calibration that allows you to convert from "nominal cycles" as read by rdtsc
into CPU cycles, one way is shown here:
https://github.com/travisdowns/avx-turbo/blob/master/tsc-support.cpp
Yeah many moons ago, there was no frequency scaling (neither turbo nor anti-turbo, i.e., scaling below the nominal freq) so rdtsc
and real cycles were always the same.
Then there was a brief period after Intel added fequency scaling where rdtsc
still measured true CPU cycles, and thus no longer wall-clock time (that's the easy way to implement this counter in hardware, after all). But everyone hated that because rdtsc
is mostly use for efficient gettimeofday
or QueryPerformanceCounter
and other calls which want real time, not some non-constant "cycles", so it was quickly changed to run in wall clock time and that's were we are today (that was like a decade ago though).
Turning off turbo is good because you get much more stable results since you don't have the forces frequency switch when another core spins up (then the current core has to slow down because modern chips have turbo multipliers that depend on how many cores are running), but there are also a lot of problems like even figuring out how to turn off turbo on all systems, user has to be root, etc.
It's kinda pity I don't have Intel hardware anymore at the moment. I would even experiment with this a bit, but it's impossible to make it right on the first time. But, I would research this a bit.
If I think of it I think this would never be 100% reliable tool, but if I can make it close enough I would be happy.
@kobalicek - my experience with uarch-bench indicates that the calibration approach is fairly robust. At most you sometimes get a wrong calibration due to a wrong assumption: e.g., when I ran on POWER9 I found out that dependent instructions always have a latency of at least 2, so the calculated frequency was half of the real frequency, but at least the error was obvious and you can correct it once you notice it.
Do you have AMD hardware, or something non-x86? I may be interested in some AMD numbers for some random microbenchmarks since I don't have easy access to AMD to test.
BTW, running now in parallel on SKX, SKL and CNL, results should be available in a few more minutes. FWIW here's the script I used which might be useful for anyone else who wants to automate this (heavily based on your README):
#!/bin/bash
set -e
ROOT_DIR=$HOME/dev/cult
mkdir -p $ROOT_DIR
cd $ROOT_DIR
# install new CMAKE version privately
CMAKE_INSTALLER=cmake-3.14.5-Linux-x86_64.sh
wget -N https://github.com/Kitware/CMake/releases/download/v3.14.5/$CMAKE_INSTALLER
chmod +x $CMAKE_INSTALLER
mkdir cmake
./$CMAKE_INSTALLER --exclude-subdir --skip-license --prefix=./cmake
CMAKE=$(readlink -e cmake/bin/cmake)
# Clone CULT and AsmJit (next-wip branch)
git clone --depth=1 https://github.com/asmjit/asmjit --branch next-wip
git clone --depth=1 https://github.com/asmjit/cult
# Create Build Directory
mkdir cult/build
cd cult/build
# Configure and Make
$CMAKE .. -DCMAKE_BUILD_TYPE=Release
make -j4
# Run CULT!
./cult --output=$1_rounded.json
./cult --output=$1_raw.json --no-rounding
echo "DONE"
You run it like ./do-cult.sh SKX
and the output is SKX_rounded.json
and SKX_raw.json
.
Nice thanks!
I have reduced all my machines to only one, which is Ryzen 1700 atm (but planning upgrade to 16c/32t at the end of the year). Then only ARM devices like raspberry for testing, interested in RISC-V though.
BTW don't wanna waste more of your time on this. I would have to fix the timing issues if I want better numbers, as I really didn't know it could go that off initially.
BTW don't wanna waste more of your time on this. I would have to fix the timing issues if I want better numbers, as I really didn't know it could go that off initially.
Don't worry, I turned off turbo and the numbers seem good.
Thanks a lot! I have updated the web-app with the new data here: https://asmjit.com/asmgrid/ - The architectures look pretty similar to me - Selecting few architectures and enabling "Hide equal cols" would only show rows that differ, which is useful when looking at differences between microarchitectures.
I think I have some work to do here as I can see that AVX-512 instructions that use k
and zmm
registers are not executed, but that would take me some time as it's not that high priority to me at the moment. In addition, I would really want to have the timings calibrated so they are precise, so there is a lot to do now :)
No need, I have to iterate over instruction signatures instead of doing what I do at the moment, asmjit has now all the information I need in cult
to do this properly.
- I have fixed some issues regarding AVX-512 (now it properly tests all supported instructions with ZMM and K registers)
- I have fixed incorrect latency in some instructions that have different kind of destination and source registers (like cvtsi2ss and friends)
- Also other issues I guess
There are still some things that are not proper (for example it's hard to test latency of cmp, test, bt, and such instructions as the result is just flags. I will think of something, however, it's a minority of instructions so it's not that severe I think).
I have also added get_tsc_freq()
, heavily inspired in your implementation, but I still don't know how to properly use the value to calculate correct clock cycles in case of active turbo.
* I have fixed some issues regarding AVX-512 (now it properly tests all supported instructions with ZMM and K registers)
Cool! Would you like me to run it on any systems? In addition to the ones above I now have access to Zen 2 and Ice Lake.
(for example it's hard to test latency of cmp, test, bt, and such instructions as the result is just flags.
Right. Have you seen what uops.info does? They consider each instruction to have a matrix of latencies, one for each combination of input and output. For for a typical instruction like add reg, reg
there are 2 inputs and 2 outputs (the destination register and the flag output), so there are actually 2x2=4 different possible latencies.
Here's cmp, and they show the latency to the flag ouput (which is 1 from either input in this case, but other cases are more interesting).
This is how I think of instruction latency now, although admittedly it often does simplify to the "single figure" for many instructions with N register inputs and 1 register output and where the latency is the same for each input. Not all instructions fit that pattern though, particularly instructions with more than 1 uop.
I have also added
get_tsc_freq()
, heavily inspired in your implementation, but I still don't know how to properly use the value to calculate correct clock cycles in case of active turbo.
The TSC frequency alone doesn't do that, it just lets you convert rdtsc
values into time units. To measure true CPU cycles, there are several approaches. A reasonable one is a calibration like this one which measures how long a loop taking a known number of cycles (actually, this breaks on Ice Lake and Zen 2/3 because you can do 2 stores a cycle: an addition chain would be better), to allow conversion between realtime and CPU cycles.
Then you run your benchmark and measure realtime and use the conversion factor to get cycles. Of course, this only works if the CPU frequency is the same during the calibration and the benchmark. That's not always the case. Approaches that are robust against that problem include:
- Use cycles performance counter (I have some examples)
- Use the APERF and MPERF MSRs (I think you can only read these directly as root, but they are also available via the perf subsystem)
@travisdowns If you have time to run cult on any Intel hardware I would be interested in results. I have updated cult to test more stuff, also memory ops, etc... There are still instructions where latency is wrong (as write-only memory ops don't create a dependency, for example), but these are things I would fix in the future and that don't bother me much as you can clearly see in the results that the timings are impossible.
I have at the moment only Zen4 desktop and Tigerlake laptop, so any other arch would help me to improve asmgrid as I would have to delete all previous tables.