faster-cpython/ideas

Investigation: Optimizer/executor/traces are behaving strangely

Fidget-Spinner opened this issue ยท 23 comments

I decided to benchmark nbody using pyperf on my computer.

When I turn on the optimizer (optimize_uops) in CPython main + JIT + Guido's exponential backoff and GHCCC, on is 3-10% slower than with it off.

However, with above settings and reducing trace length from 800 to 200, optimizer on is 2% faster than off.

My hunch is that the trace lengths are too long to be worth optimizing right now. What I think is happening is that we are falling off trace way too early. Say for a trace of 800, the optimizer abstract interprets all 800 bytecode instructions. Whereas in reality, maybe only 40 of the instructions are executed before we fall off in most cases. In the above example, the instructions that are worth optimizing are only 40 out of the 800 instructions.

We should seriously reconsider reducing trace length.

In other news, I tried encouraging inlining of all optimizer functions by placing them in the same compilation unit. No speedup there with LTO. So that means LTO is already doing a good job.

Hmm more benchmarking reveals it's the optimizer is still slower on reduce trace length....the implication is the optimizer is doing nothing. (This is on a JIT build).

Edit: I meant to say that the build with JIT is slower/the same on smaller trace length.

I decided to build without JIT because it was confounding the information. Here is just Tier 2 alone:
Long traces (800, current on main):

Optimizer off vs Optimizer on:
pyperf compare_to ../opt_off_long_t2.json ../opt_on_long_t2.json 
Mean +- std dev: [opt_off_long_t2] 148 ms +- 2 ms -> [opt_on_long_t2] 143 ms +- 2 ms: 1.03x faster

Medium traces (400, set by me):

pyperf compare_to ../opt_off_medium_t2.json ../opt_on_medium_t2.json 
Mean +- std dev: [opt_off_medium_t2] 151 ms +- 2 ms -> [opt_on_medium_t2] 138 ms +- 2 ms: 1.09x faster

It seems that medium traces are indeed faster for the optimizer than long traces. I thus propose we reduce trace length to 400.

Benchmarked with

./python ../bm_nbody.py -w 20 -o

(ie. 20 warmup iterations)

Sounds good to me. Which constant is currently 800? Last time I looked it was a power of two?

EDIT: I'm not sure I understand why shorter trace lengths make things faster. See #669 (comment) and ff.

Sounds good to me. Which constant is currently 800? Last time I looked it was a power of two?

The trace length. It got changed to 800 after Mark introduced hot-cold splitting.

I wonder if we could prove that the optimizer is the bottleneck and what part of it. Maybe we could artificially increase the number of times we run the optimizer, and then apply perf to the process to get maybe a flame graph or something?

I don't think there's anything in the optimizer that's quadratic, but we could very well have long traces where execution quickly deoptimizes.

Another thought: Until gh-117144 is merged, we might be wasting a lot of time unsuccessfully optimizing the same thing over and over, for lack of exponential back-off for side exits.

Another thought: Until gh-117144 is merged, we might be wasting a lot of time unsuccessfully optimizing the same thing over and over, for lack of exponential back-off for side exits.

These results are with your PR applied :). Without it, I suspect it's much worse.

These results are with your PR applied :). Without it, I suspect it's much worse.

It was worse before, when there was an issue that would cause the abstract interpreter to fail when it encountered a generator expression. But that was fixed by gh-117028. So it might not actually be so bad (now it's mostly short traces that get done over and over).

Another approach would be to have a more realistic way to calculate the on-trace confidence while we are projecting. But this is tricky. E.g., what probability do you assign to _GUARD_BOTH_INT passing? Is that something we have in the pystats data?

We talked about this during our team meeting (sans Mark). I'm now very curious what's happening in nbody that is made faster by limiting the trace size. I'm going to spike on trying to find exactly which traces are being created and where they deoptimize in the stable state of the algorithm (which looks to be extremely stable -- not an if in sight. :-)

Lab notes from my M3 Mac.

I copied the nbody benchmark into a file in my repo root nbody.py (this requires installing pyperf using the local python.exe) and run variations of command lines like this:

PYTHON_UOPS=1 PYTHON_LLTRACE=1 ./python.exe nbody.py --inherit-environ PYTHON_UOPS,PYTHON_LLTRACE

First observations:

  • A lot of traces are created for loops in the pyperf infrastructure. Not sure if this matters -- presumably these are all outside the timed region. (Slight worry: possibly one of these could trace into the nbody code and cause a distraction there?)

  • It looks like each of 6 traces is created 21 times (presumably one warmup and 20 timed runs). Apparently the pyperf infrastructure actually creates 20 processes to run the benchmark. If I add -p 1 I only see 12 traces created for nbody.py (each one twice -- I guess one for warmup, one timed run).

  • There are two traces that are (initially) longer than 200 instructions: the first and third one created for advance(). One is 269, one is 279. These would be truncated (and then presumably split) by setting the max trace length to 200. None would be affected by setting it to 400 (and there are no traces longer than 400 created for pyperf either).

  • With PYTHON_LLTRACE=2 I see DEOPT log messages pointing to STORE_SUBSCR_LIST_INT.

DEOPT: [UOp _DEOPT (0, target=181, operand=0) -> STORE_SUBSCR_LIST_INT]
DEOPT: [UOp _DEOPT (0, target=17, operand=0) -> STORE_SUBSCR_LIST_INT]
(these two repeated many times)

This is suspicious! There are quite a few STORE_SUBSCR_LIST_INT opcodes in this trace:

Optimizing advance (/Users/guido/cpython/nbody.py:78) at byte offset 350

The first one is

93: STORE_SUBSCR_LIST_INT(0)
 137 ADD_TO_TRACE: _STORE_SUBSCR_LIST_INT (0, target=93, operand=0)
 138 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=95, operand=0x12f8181a6)

Looking at the surrounding code in the disassembly:

 88        158       LOAD_FAST                8 (v1)
           160       LOAD_CONST               2 (0)
           162       COPY                     2
           164       COPY                     2
           166       BINARY_SUBSCR
           170       LOAD_FAST               15 (dx)
           172       LOAD_FAST               20 (b2m)
           174       BINARY_OP                5 (*)
           178       BINARY_OP               23 (-=)
           182       SWAP                     3
           184       SWAP                     2
           186       STORE_SUBSCR

Here's line 88 in nbody.py, in advance():

            v1[0] -= dx * b2m

Honestly that doesn't make a lot of sense to me, maybe I've got my offsets mixed up. More later.

Continued lab notes:

  • DEOPT: [UOp _DEOPT (0, target=17, operand=0) -> STORE_SUBSCR_LIST_INT] comes from _ITER_NEXT_LIST (expanded from FOR_ITER_LIST) at the top of the loop at line 80 (the first inner loop in advance(), for ... in pairs).
  • DEOPT: [UOp _DEOPT (0, target=181, operand=0) -> STORE_SUBSCR_LIST_INT] comes from the same sequence at a different bytecode offset -- the second inner loop, at line 94, for ... in bodies.
  • So there is actually nothing to worry about here: these are expected. The only mystery is why they claim to point to STORE_SUBSCR_LIST_INT -- this is probably a mistake in the debugging output, showing the last instruction that didn't hit a DEOPT_IF or EXIT_IF.

Going back to @Fidget-Spinner's original investigation, I don't see how reducing UOP_MAX_TRACE_LENGTH from 800 to 400 could make a difference, since the longest trace in nbody has an unoptimized length of 279. I do see how reducing it to 200 makes a difference, since that would truncate the two longest traces, and then additional traces would eventually be created -- but I don't see how that would make things slower unless the JIT is still slower than Tier 1. (I have to try this and check the debug output.)

I certainly now disagree with @Fidget-Spinner's hypothesis:

My hunch is that the trace lengths are too long to be worth optimizing right now. What I think is happening is that we are falling off trace way too early. Say for a trace of 800, the optimizer abstract interprets all 800 bytecode instructions. Whereas in reality, maybe only 40 of the instructions are executed before we fall off in most cases. In the above example, the instructions that are worth optimizing are only 40 out of the 800 instructions.

Because looking at the traces and the DEOPT logs printed by PYTHON_LLTRACE=2, there's nothing that exits the trace prematurely (other than, eventually, falling off the end of a for loop over either pairs or bodies). There are no if statements, just float and list operations.

Next I'm going to change my nbody.py to turn the optimizer on only around bench_nbody(), since I am getting annoyed at the traces for the pyperf infra (which isn't so easy to filter out in the PYTHON_LLTRACE debug output).

However, the experiment shows different. Reducing UOP_MAX_TRACE_LENGTH from 800 to 400 changes the created traces. Here's the original set of traces, with 800:

Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 34 -- length 269
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 362 -- length 100
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 24 -- length 279
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 6 -- length 101
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 128 -- length 65
Created a proto-trace for offset_momentum (/Users/guido/cpython/nbody.py:128) at byte offset 6 -- length 59

And here's the new set, with 400:

Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 34 -- length 260
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 338 -- length 11
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 362 -- length 100
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 24 -- length 259
Created a proto-trace for advance (/Users/guido/cpython/nbody.py:94) at byte offset 324 -- length 23
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 6 -- length 101
Created a proto-trace for report_energy (/Users/guido/cpython/nbody.py:116) at byte offset 128 -- length 65
Created a proto-trace for offset_momentum (/Users/guido/cpython/nbody.py:128) at byte offset 6 -- length 59

Note the two runts of length 11 and 23, which are undoubtedly created by the side exits.

So with a claimed max length of 400, the actual max length is roughly 260. This is surprising.

The explanation is that translate_bytecode_to_trace() deducts one from its max_trace_length for each input bytecode:

        // Need space for _DEOPT
        max_length--;

In addition, it deducts one for each opcode that has an exit or an error. See also this comment in optimizer.c:

/* It is impossible for the number of exits to reach 1/4 of the total length,
 * as the number of exits cannot reach 1/3 of the number of non-exits, due to
 * the presence of CHECK_VALIDITY checks and instructions to produce the values
 * being checked in exits. */
#define COLD_EXIT_COUNT (UOP_MAX_TRACE_LENGTH/4)

Now let's repeat the experiment with UOP_MAX_TRACE_LENGTH set to 200.

Surprise! It crashes with

Assertion failed: (trace_length < max_length), function translate_bytecode_to_trace, file optimizer.c, line 881.

This looks like a failure in the space reservation algorithm, which could hit us in other ways too, so I should figure out the root cause (probably an insufficient RESERVE() call). But it feels like a possible explanation of @Fidget-Spinner's comment:

Hmm more benchmarking reveals it's still slower on reduce trace length....the implication is the optimizer is doing nothing. (This is on a JIT build).

Could it have silently overflowed the buffer (assuming his JIT build did not have --with-pydebug)? Or silently bailed completely?

I notice that in his later experiments he used 400, not 200.

Next, let me go back to 400 and try to understand how effective the side exit warming is. (Note: I'm doing everything in my exp-backoff branch, i.e. gh-117144.)

Okay, with 800, no message starting with SIDE EXIT: are ever printed, so we're not ever taking side exits. Instead, we're deoptimizing whenever a for-loop terminates.

Aside: what happens on DEOPT_IF(x)?

  • This is only used for guards that we expect to fail very rarely (?)
  • The code generator changes this to if (x) JUMP_TO_JUMP_TARGET()
  • That macro expands to goto jump_to_jump_target
  • That label extracts the jump_target field from the last uop, sets next_uop to that target in the current trace, and goes to tier2_dispatch, i.e. executes the stub instruction.
  • The stub instruction is _DEOPT, whose body is EXIT_TO_TIER1().
  • That macro expands to goto exit_to_tier1
  • There we find a debug message printing DEOPT: (etc.) and then we go back to Tier 1

And what for EXIT_IF(x)?

  • This is used for guards we expect might fail, including branches (e.g. _GUARD_IS_TRUE_POP) type guards (e.g. _GUARD_BOTH_INT), and _EXIT_TRACE
  • The code generator turns EXIT_IF(x) into if (x) JUMP_TO_JUMP_TARGET()
  • That expands to goto jump_to_jump_target
  • The dispatch goes the same way, but it ends up with a different stub instruction, _SIDE_EXIT
  • This calls EXIT_TO_TRACE()
  • That expands to goto exit_to_trace
  • There we find a debug message printing SIDE EXIT: (etc.) and then we continue in Tier 2
  • Presumably this then goes to _COLD_EXIT where we decrement the temperature and either go to Tier 1, or, if the temperature hits zero (in my PR)), look for an ENTER_EXECUTOR bytecode, or try to optimize
  • Either way, if successful, we change exit->executor to the new executor, and start executing it

More later.

I hesitate to change the issue title, but I think the situation is quite different.

I just spent an hour and a half with @brandtbucher going over this. We found a few things.

  • We should probably change _GUARD_NOT_EXHAUSTED_LIST (and the two other similar uops for range and tuple) to use EXIT_IF instead of DEOPT_IF. This requires changing something in prepare_for_execution() because that currently assumes that there is only one stub (exit OR deopt) per Tier 1 bytecode -- there are other guards in expansions of FOR_ITER that use DEOPT_IF for a legitimate reason (e.g. DEOPT_IF(Py_TYPE(iter) != &PyListIter_Type); in _ITER_CHECK_LIST). (A quick hack replaces if (target != current_jump_target) with if (1).)

  • At least when I run the nbody benchmark, I find that loops = 1, reference = 'sun', and iterations = 20000. Moreover, the length of pairs is 10 and the length of bodies is 5.

  • Let's name the loops in advance() (A), (B) and (C): outer loop (A), first inner loop (B), second inner loop (C). There's an executor for each, starting at the JUMP_BACKWARD instruction.

  • The first loop to be optimized is (B), which iterates over the 10 pairs, doing a lot of float ops (no branches or calls). Next is (C), looping over the 5 bodies, doing less work; it easily fits in even the smallest trace size contemplated. Finally, (A) has many more iterations (20000), and soon it is optimized into something that also covers the first iteration.

  • With the larger UOP_MAX_TRACE_LENGTH value (800), all traces fit, and a single trace ending in _JUMP_TO_TOP is created for each.

  • With the smaller value (400), the traces for (A) and (B) don't fit, and each ends somewhere in the middle of the iteration; (A) a bit earlier than (B).

  • They no longer contain _JUMP_TO_TOP, but instead end in _EXIT_TRACE, which eventually is stitched to "remainder" traces (A1) and (B1). The remainder traces are then stitched to the top of the corresponding main trace. (We think.)

  • We explored a few ways that somehow the version using a single perfect large trace ending in _JUMP_TO_TOP is slower than the version that stitches (A) to (A1) back to (A).

  • Eventually @brandtbucher hypothesized that the version with a single perfect trace calls CHECK_EVAL_BREAKER() for every iteration of every loop (i.e., 20000 * (10 + 5) == 300000 times), whereas in the stiching version it is called basically never (only a few times while the first few outer iterations warm up).

  • This is just a theory so far. I will try to confirm by building with JIT, running the nbody benchmark using pyperformance, and then rebuilding with the CHECK_EVAL_BREAKER() simply removed from _JUMP_TO_TOP.

Alas, that theory (that the JIT is faster with stitched traces than with a single trace per loop because of CHECK_EVAL_BREAKER()) doesn't get borne out. I tried a trace max length of 800 with two variants of the code:

  • Regular: as in my exp-backoff branch
  • Modified: with the CHECK_EVAL_BREAKER() commented out in _JUMP_TO_TOP, and running make regen-cases

I then ran the nbody benchmark, like this, on my Mac M3 (EDIT: see later, I wasn't testing right):

pyperformance run -p ~/cpython/python.exe -b nbody

Alas, I get about the same speed (within variance):

  • Regular: Mean +- std dev: 54.0 ms +- 0.4 ms
  • Modified: Mean +- std dev: 53.9 ms +- 0.2 ms

Next I tried the "regular" version with UOP_MAX_TRACE_LENGTH set to 400 and 200, and it's slower:

  • Max 400: Mean +- std dev: 55.9 ms +- 0.2 ms
  • Max 200: Mean +- std dev: 56.8 ms +- 0.3 ms

Maybe @Fidget-Spinner can repeat these experiments? Perhaps things are different on Linux or on x86.

The experiments are on tier 2 without JIT, tier 2 with JIT has no difference in numbers.

Note: you can run nbody without full pyperformance installed to save yourself some time as long as you have pyperf. Just run ./python.exe ./bm_nbody.py -w 20 -o foo.json after running pyperf system tune.

I started this thread because I was curious why JIT with abstract interpreter off vs on had almost no difference. That whole issue seems to be fixed by Guido's current reimplemented tier 2 counters branch:

Main with abstract interpreter on vs off.

# warmups = 20
Mean +- std dev: [main_w_20_opt] 106 ms +- 1 ms -> [main_w_20_no_opt] 109 ms +- 2 ms: 1.03x slower

Which is basically almost no change (I take anything below 4-5% on benchmarks to be attributable to noise). So I was puzzled if the abstract interpreter even does anything.

Guido's branch with reimplemented tier 2 counters. Abstract interpreter on vs off:

warmups = 20
Mean +- std dev: [backoff_w_20_opt] 103 ms +- 2 ms -> [backoff_w_20_no_opt] 109 ms +- 3 ms: 1.06x slower

So I guess we can close this issue with Guido's PR merging soon. I am really really sorry if I miscommunicated anything or led you down a rabbit hole @gvanrossum . Sorry that I wasn't clear either by vaguely saying optmizer, which could mean tier 2, uops, abstract interpreter, or all optimizations passes. It seems your PR does fix my issue.

I have tried to repro @Fidget-Spinner's results. In particular, all combinations of these two diffs:

diff --git a/Include/internal/pycore_optimizer.h b/Include/internal/pycore_optimizer.h
index 44cafe61b75..cde1577d20c 100644
--- a/Include/internal/pycore_optimizer.h
+++ b/Include/internal/pycore_optimizer.h
@@ -12,7 +12,7 @@ extern "C" {
 #include <stdbool.h>
 
 // This is the length of the trace we project initially.
-#define UOP_MAX_TRACE_LENGTH 800
+#define UOP_MAX_TRACE_LENGTH 400
 
 #define TRACE_STACK_SIZE 5
 
diff --git a/Python/optimizer_analysis.c b/Python/optimizer_analysis.c
index 6f553f8ab8a..081b36fc04f 100644
--- a/Python/optimizer_analysis.c
+++ b/Python/optimizer_analysis.c
@@ -602,9 +602,9 @@ _Py_uop_analyze_and_optimize(
 
     peephole_opt(frame, buffer, length);
 
-    length = optimize_uops(
-        _PyFrame_GetCode(frame), buffer,
-        length, curr_stacklen, dependencies);
+    // length = optimize_uops(
+    //     _PyFrame_GetCode(frame), buffer,
+    //     length, curr_stacklen, dependencies);
 
     if (length <= 0) {
         return length;

I used ./configure -C.

I was on my exp-backoff branch, at revision 8eee1b40bc7.

A typical command-line was:

(.venv) ~/cpython$ PYTHON_UOPS=0 pyperformance run -p ~/cpython/python.exe -b nbody --inherit-environ PYTHON_UOPS

For each combination, I ran with PYTHON_UOPS=0 (Tier 1 only) and PYTHON_UOPS=1 (Tier 2 interpreter).

The Tier 1 speed is unaffected by the various choices; here's a typical measurement:

  • Tier 1: Mean +- std dev: 57.2 ms +- 1.4 ms

The rest of my measurements are all for PYTHON_UOPS=1 (Tier 2 interpreter).

With max trace 800:

  • Optimizer on: Mean +- std dev: 90.3 ms +- 0.5 ms
  • Optimizer off: Mean +- std dev: 95.4 ms +- 0.5 ms
    With max trace 400:
  • Optimizer on: Mean +- std dev: 93.0 ms +- 0.3 ms
  • Optimizer off: Mean +- std dev: 98.7 ms +- 0.5 ms

My conclusion, limited to nbody:

  • The Tier 2 interpreter is way slower than Tier 1
  • Lowering max trace from 800 to 400 makes the interpreter slower
  • Disabling the optimizer (i.e., the abstract interpreter) makes it slower
  • The effects are additive

I don't see any evidence that reducing the max trace length is beneficial, nor that the optimizer makes things worse.

PS. The JIT numbers I reported above show that the JIT has a slight positive effect for nbody:

  • JIT: Mean +- std dev: 54.0 ms +- 0.4 ms

Before I let it rest I'll have a look at whether the optimizer does much for the nbody benchmark.

A quick analysis shows that the optimizer removed a bunch of _GUARD_BOTH_FLOAT and their corresponding side exits. Where the optimized list is 193, the unoptimized (but still peepholed) list has 215 entries, including stubs.

So the optimizer does its job!

  • Modified: with the CHECK_EVAL_BREAKER() commented out in _JUMP_TO_TOP, and running make regen-cases

Oof, sorry. _JUMP_TO_TOP is handled as a special case in the JIT. So you'd actually need to comment out the eval breaker check in template.c and rebuild the JIT to actually skip it.

Okay, will try that.

JIT with this diff (and everything else unchanged):

diff --git a/Tools/jit/template.c b/Tools/jit/template.c
index 351bc2f3dd4..1b5ddf98565 100644
--- a/Tools/jit/template.c
+++ b/Tools/jit/template.c
@@ -97,7 +97,7 @@ _JIT_ENTRY(_PyInterpreterFrame *frame, PyObject **stack_pointer, PyThreadState *
 
     // The actual instruction definitions (only one will be used):
     if (opcode == _JUMP_TO_TOP) {
-        CHECK_EVAL_BREAKER();
+        // CHECK_EVAL_BREAKER();
         PATCH_JUMP(_JIT_TOP);
     }
     switch (opcode) {

First and second run (same setup):

  • Mean +- std dev: 52.9 ms +- 0.9 ms
  • Mean +- std dev: 52.3 ms +- 1.0 ms

With the eval breaker enabled (the patch removed, expected to be slightly slower), two runs:

  • Mean +- std dev: 53.0 ms +- 0.7 ms
  • Mean +- std dev: 52.7 ms +- 0.7 ms

That doesn't look like it made a difference (for nbody).

I'm guessing that there's enough "meat" in even the innermost loop in nbody that checking a few bits in tstate->eval_breaker doesn't matter. I will stop losing sleep over this one. (But we should still fix the two bugs linking to this issue.)