lukego/blog

Tracing JITs and modern CPUs part 3: A bad case

lukego opened this issue · 1 comments

Let's look at a bad case for tracing JITs. The simple function we looked at in #6 worked really well but turns out to be quite fragile. Let us look at how to break it and see if we can learn something in the process.

I write this from my perspective as a humble application programmer who only needs to understand the fundamentals well enough to write efficient programs. I will not be offering any grand insights into compilers on this blog (though perhaps commenters will).

Background

Back in #6 I wrote a simple Lua module for Snabb Switch and found that the JIT compiled it surprisingly well and that the CPU executed the code surprisingly well too.

The code was a simple loop to bump a counter one billion times. The code is high-level Lua that even includes a hashtable lookup in the inner loop (looking up the add function from the counter table):

local counter = require("core.counter")

local n = 1e9
local c = counter.open("test")
for i = 1,n do
   counter.add(c, 1)
end

LuaJIT compiled this down to a trace: a linear block of machine code that contains at most one loop and otherwise no internal branches. The loop compiled down to five instructions:

->LOOP:
0bcafab0  add rbp, +0x01
0bcafab4  mov [rcx+0x8], rbp
0bcafab8  add ebx, +0x01
0bcafabb  cmp ebx, eax
0bcafabd  jle 0x0bcafab0    ->LOOP

and I was pleasantly surprised when perf stat told me that my CPU executes all five of those instructions in one cycle:

     5,021,702,476 instructions              #    4.58  insns per cycle        
     1,097,518,679 cycles                   

That paints a very pretty picture. But it is easy to shake this up.

Just a little innocent change

Here is an innocent looking new version of the program that contains an if to count 1 when the loop index is even and 10 otherwise:

local counter = require("core.counter")

local n = 1e9
local c = counter.open("test")
for i = 1,n do
   if i % 2 == 0 then
      counter.add(c, 1)
   else
      counter.add(c, 10)
   end
end

How does this version run?

    36,029,310,879 instructions              #    2.38  insns per cycle        
    15,109,330,649 cycles                   

Oops! Now each iteration takes 15 cycles and executes 36 instructions. That is a 15x slow down.

What happened?

The high-level explanation is actually straightforward. The first version runs fast because the loop executes very few instructions: most of the work like table lookups has been "hoisted" to run before entering the loop. The second version runs slowly because it is frequently repeating this setup work.

To understand what happens we can read the JIT dumps for the first version and second version and draw pictures of the flow of control. In these diagrams each box is a trace i.e. a series of machine code instructions that will execute from top to bottom. Branches are drawn as arrows and there are two kinds: the loop back into an earlier part of the trace (at most one is allowed) or an exit to a different trace. The "hot" code that consumes the CPU is highlighted.

Here is a picture of the first version that runs fast:

first

and below is the full machine code. I don't really bother to read every instruction here. My observation is that the proportions match the diagram: quite a lot of instructions upon entry and then a small number of instructions in the loop.

---- TRACE 3 mcode 212
0bcaf9f0  mov dword [0x41693410], 0x3
0bcaf9fb  movsd xmm7, [rdx+0x20]
0bcafa00  cvttsd2si eax, xmm7
0bcafa04  xorps xmm6, xmm6
0bcafa07  cvtsi2sd xmm6, eax
0bcafa0b  ucomisd xmm7, xmm6
0bcafa0f  jnz 0x0bca0010    ->0
0bcafa15  jpe 0x0bca0010    ->0
0bcafa1b  cmp eax, 0x7ffffffe
0bcafa21  jg 0x0bca0010 ->0
0bcafa27  cvttsd2si ebx, [rdx+0x18]
0bcafa2c  cmp dword [rdx+0x4], -0x0c
0bcafa30  jnz 0x0bca0010    ->0
0bcafa36  mov r8d, [rdx]
0bcafa39  cmp dword [r8+0x1c], +0x0f
0bcafa3e  jnz 0x0bca0010    ->0
0bcafa44  mov esi, [r8+0x14]
0bcafa48  mov rdi, 0xfffffffb4169d9e8
0bcafa52  cmp rdi, [rsi+0x140]
0bcafa59  jnz 0x0bca0010    ->0
0bcafa5f  cmp dword [rsi+0x13c], -0x09
0bcafa66  jnz 0x0bca0010    ->0
0bcafa6c  cmp dword [rdx+0x14], -0x0b
0bcafa70  jnz 0x0bca0010    ->0
0bcafa76  mov ecx, [rdx+0x10]
0bcafa79  cmp dword [rsi+0x138], 0x4172d7b0
0bcafa83  jnz 0x0bca0010    ->0
0bcafa89  movzx edx, word [rcx+0x6]
0bcafa8d  cmp edx, 0x4f9
0bcafa93  jnz 0x0bca0010    ->0
0bcafa99  mov rbp, [rcx+0x8]
0bcafa9d  add rbp, +0x01
0bcafaa1  mov [rcx+0x8], rbp
0bcafaa5  add ebx, +0x01
0bcafaa8  cmp ebx, eax
0bcafaaa  jg 0x0bca0014 ->1
->LOOP:
0bcafab0  add rbp, +0x01
0bcafab4  mov [rcx+0x8], rbp
0bcafab8  add ebx, +0x01
0bcafabb  cmp ebx, eax
0bcafabd  jle 0x0bcafab0    ->LOOP
0bcafabf  jmp 0x0bca001c    ->3
---- TRACE 3 stop -> loop

So what changes in the second version that causes the inner loop to expand from 5 instructions up to 36? Here is the picture:

second

Now we have two traces: the original root trace and a new side trace. This is necessary because there is a branch (if) in our code and traces are not allowed to have internal branches. The root trace will internally handle the case when the loop index is even but it will exit to the side trace when the index is odd. The side trace then rejoins the root trace. The effect is that the loop alternates between saying inside the root trace and exiting to the side trace.

The picture also illustrates the two reasons why we execute so many instructions now. First, the side trace is bigger than the loop in the root trace (i.e. it contains more instructions). Second, when the side trace branches back to the root trace it re-enters at the top instead of taking a short-cut into the inner loop. This means that overall we execute more instructions.

Let us zoom in to a bit more detail: first to look at the inner loop in the root trace, then to look at the side trace, and finally to look at the complete root trace that is running every time the side trace branches back.

Here is the new loop in the root trace (with added comments):

->LOOP:
0bcafaa0  test ebx, 0x1               ; loop index is odd?
0bcafaa6  jnz 0x0bca0024    ->5   ; yes: exit this trace
0bcafaac  add rbp, +0x01
0bcafab0  mov [rcx+0x8], rbp
0bcafab4  add ebx, +0x01
0bcafab7  cmp ebx, eax
0bcafab9  jle 0x0bcafaa0    ->LOOP

The difference from the original trace is the two new instructions at the start. These test a guard for the trace (that the loop index must be even) and branch to an exit if this condition does not hold. So when the loop index happens to be even the execution will be very similar to the original version, but when the loop index is odd we will exit to the side trace.

Here is the code for the side trace.

---- TRACE 4 mcode 171
0bcaf922  mov dword [0x416ac410], 0x4
0bcaf92d  mov edx, [0x416ac4b4]
0bcaf934  cmp dword [rdx+0x4], -0x0c
0bcaf938  jnz 0x0bca0010    ->0
0bcaf93e  mov ebp, [rdx]
0bcaf940  cmp dword [rbp+0x1c], +0x0f
0bcaf944  jnz 0x0bca0010    ->0
0bcaf94a  mov ebp, [rbp+0x14]
0bcaf94d  mov rdi, 0xfffffffb416b69e8
0bcaf957  cmp rdi, [rbp+0x140]
0bcaf95e  jnz 0x0bca0010    ->0
0bcaf964  cmp dword [rbp+0x13c], -0x09
0bcaf96b  jnz 0x0bca0010    ->0
0bcaf971  cmp dword [rdx+0x14], -0x0b
0bcaf975  jnz 0x0bca0010    ->0
0bcaf97b  mov r15d, [rdx+0x10]
0bcaf97f  cmp dword [rbp+0x138], 0x413fe7b0
0bcaf989  jnz 0x0bca0010    ->0
0bcaf98f  movzx ebp, word [r15+0x6]
0bcaf994  cmp ebp, 0x4f9
0bcaf99a  jnz 0x0bca0010    ->0
0bcaf9a0  mov rbp, [r15+0x8]
0bcaf9a4  add rbp, +0x0a
0bcaf9a8  mov [r15+0x8], rbp
0bcaf9ac  add ebx, +0x01
0bcaf9af  cmp ebx, eax
0bcaf9b1  jg 0x0bca0014 ->1
0bcaf9b7  xorps xmm7, xmm7
0bcaf9ba  cvtsi2sd xmm7, ebx
0bcaf9be  movsd [rdx+0x30], xmm7
0bcaf9c3  movsd [rdx+0x18], xmm7
0bcaf9c8  jmp 0x0bcaf9d4
---- TRACE 4 stop -> 3

I have not read this code in detail but here are a couple of observations:

  1. It contains 32 instructions. This side trace will run on every second iteration of the loop and so it seems to account for about half of the 36 instructions per iteration reported by perf.
  2. The trace ends by branching to address 0bcaf9d4 which turns out to be the beginning of the root trace (not the inner loop).

Here finally is the entire root trace, this time including the initial code before the loop that executes when the side trace branches back:

---- TRACE 3 mcode 236
0bcaf9d4  mov dword [0x416ac410], 0x3
0bcaf9df  movsd xmm7, [rdx+0x20]
0bcaf9e4  cvttsd2si eax, xmm7
0bcaf9e8  xorps xmm6, xmm6
0bcaf9eb  cvtsi2sd xmm6, eax
0bcaf9ef  ucomisd xmm7, xmm6
0bcaf9f3  jnz 0x0bca0010    ->0
0bcaf9f9  jpe 0x0bca0010    ->0
0bcaf9ff  cmp eax, 0x7ffffffe
0bcafa05  jg 0x0bca0010 ->0
0bcafa0b  cvttsd2si ebx, [rdx+0x18]
0bcafa10  test ebx, 0x1
0bcafa16  jnz 0x0bca0014    ->1
0bcafa1c  cmp dword [rdx+0x4], -0x0c
0bcafa20  jnz 0x0bca0018    ->2
0bcafa26  mov r8d, [rdx]
0bcafa29  cmp dword [r8+0x1c], +0x0f
0bcafa2e  jnz 0x0bca0018    ->2
0bcafa34  mov esi, [r8+0x14]
0bcafa38  mov rdi, 0xfffffffb416b69e8
0bcafa42  cmp rdi, [rsi+0x140]
0bcafa49  jnz 0x0bca0018    ->2
0bcafa4f  cmp dword [rsi+0x13c], -0x09
0bcafa56  jnz 0x0bca0018    ->2
0bcafa5c  cmp dword [rdx+0x14], -0x0b
0bcafa60  jnz 0x0bca0018    ->2
0bcafa66  mov ecx, [rdx+0x10]
0bcafa69  cmp dword [rsi+0x138], 0x413fe7b0
0bcafa73  jnz 0x0bca0018    ->2
0bcafa79  movzx edx, word [rcx+0x6]
0bcafa7d  cmp edx, 0x4f9
0bcafa83  jnz 0x0bca0018    ->2
0bcafa89  mov rbp, [rcx+0x8]
0bcafa8d  add rbp, +0x01
0bcafa91  mov [rcx+0x8], rbp
0bcafa95  add ebx, +0x01
0bcafa98  cmp ebx, eax
0bcafa9a  jg 0x0bca001c ->3
->LOOP:
0bcafaa0  test ebx, 0x1
0bcafaa6  jnz 0x0bca0024    ->5
0bcafaac  add rbp, +0x01
0bcafab0  mov [rcx+0x8], rbp
0bcafab4  add ebx, +0x01
0bcafab7  cmp ebx, eax
0bcafab9  jle 0x0bcafaa0    ->LOOP
0bcafabb  jmp 0x0bca0028    ->6

The trace executes an additional 38 instructions before entering the loop. This path will be taken on every second loop iteration, when the exit to the side trace is taken and it branches back to the top. That would seem to account for the rest of the instructions reported by perf.

So...?

If I were a compiler expert then this is where I would explain why the code compiles in this way and provide interesting links to all the relevant research. But I am not. So all I can really state are my own personal observations.

  1. The trace structure seems to be significant. The first version was awesomely fast because it kept the CPU on the same trace for an extended period of time.
  2. Microbenchmarks can be very hard to interpret. How do you decide whether a particular programming construct -- for example the counter.add(...) module lookupg - is cheap or expensive? The answer seems very context dependent.
  3. Good profiling tools and a strong mental model are probably critical to achieving predictably high performance.
  4. Now I understand the LuaJIT numerical computing optimization tip "Reduce unbiased branches. Heavily biased branches (>95% in one direction) are fine." If we were only executing the side trace on 1/20th of the iterations then the cost would not be a problem.

I find this all very interesting. The issues are subtle and complex, but most languages and platforms are subtle and complex to optimize when you first start taking them seriously. I am happy to keep geeking out on these ideas.

Thanks

Thanks to Alex Gall and Andy Wingo for mentioning the issue of side-traces re-entering their parent traces at the root, so that I could recognise it when I saw it too.

The jump from root trace to side trace should be at here?

0bcafa10  test ebx, 0x1
0bcafa16  jnz 0x0bca0014    ->1

It should not be at the LOOP part.

The root trace should be patched (change "jnz 0x0bca0014" to "jnz 0x0bcaf922", the entry address of the side trace) after the side trace created.