faster-cpython/ideas

Constructing Executors using stack contents

gvanrossum opened this issue · 26 comments

This started at #628 (comment).

We can do away with the function version cache using this approach. The idea is that when the Tier 1 to Tier 2 translator encounters an instruction that requires tracing through another code object, instead of using the function cache to find that code object, we stop translation with a special uop (let's say UPDATE_TRACE) instead of EXIT_TRACE. When that uop is encountered during Tier 2 interpretation, we have to do the following:

  1. Use the eval stack of the current frame to find the correct code object.
  2. Continue projecting through that call (assuming we're doing this for CALL and its specializations).
  3. Possibly also continue through the corresponding return.
  4. Once we are done projecting (using the usual criteria), combine the old executor and the new projected trace in a new trace.
  5. Re-optimize the combined trace.
  6. Construct a new executor.
  7. Replace the old executor with the new one.
  8. Continue executing the new Executor, starting where we left off. (However, finding the exact spot may be tricky due to optimization -- we may have to fall back to Tier 1 and pick up the new Executor on the next iteration.)

Let's try to go over those steps in a bit more detail for CALL_PY_EXACT_ARGS.

This is currently special-cased in optimizer.c by special-casing the _PUSH_FRAME uop after we've generated it as part of the macro expansion of CALL_PY_EXACT_ARGS. Instead, we'd probably have to special-case CALL_PY_EXACT_ARGS, as follows:

  • If this is not the first opcode we're translating, emit UPDATE_TRACE and stop.
  • If it is the first opcode, we know that the function is at position oparg+1 below the top of the stack. We can retrieve it assuming we've got access to the frame (this requires either a signature change to the optimizer to pass it from _PyOptimizer_BackEdge(), or we can use tstate->current_frame).
  • Once we've got the function, we can follow roughly the same path we currently follow for _PUSH_FRAME.

So now at some point we have a new buffer containing Tier 2 IR produced by projecting a trace from CALL_PY_EXACT_ARGS to some endpoint (we can stop at an unsupported Tier 1 opcode, or when the buffer is full, or when we encounter another CALL specialization).

A simple thing we could do at this point would be to replace the CALL_PY_EXACT_ARGS Tier 1 instruction with ENTER_EXECUTOR, passing it a new Executor constructed from the buffer (after optimization, etc.).

But it would be more effective if we could extend the trace that was "interrupted" by the UPDATE_TRACE uop with the new buffer.

It's probably not a good idea to update the original Executor object in place (for one, Executor objects are variable-size objects so we can't make it longer). Instead, however, we can construct a new Executor and use it to replace the original in the co_executors array of the code object. The insert_executor() function can do this (or PyUnstable_Replace_Executor(), a public wrapper). So how do we combine the original Executor object with the new trace buffer?

First we have to allocate another buffer. Into this buffer we first copy the IR from the original Executor, but stopping at the UPDATE_TRACE uop. Then we append the IR from the trace buffer we just collected. Then we optimize the combined buffer (this requires the optimizer to be idempotent, which seems fine -- its input format is the same as its output format). We can then construct a new Executor from the optimized buffer.

There's a wrinkle though: stubs. The original Executor may contain one or more stub sequences at the end. The new trace buffer may contain another set of stub sequences (but at the very end of the buffer, with a gap in between). For optimization purposes, IIUC we don't optimize stubs, but they may become dead code. We should probably somehow collect these stubs in a separate buffer and add them back later, patching up the jumps. (This requires us to be somewhat clever, because there's nothing to tell us where the stubs are except the conditional branch uops. We can do something fairly stupid initially and make it faster once everything's working.)

Replacing Executors in-flight looks easy enough -- there are reference counts that will cause the original Executor to be deallocated once the last thread or frame executing it releases it.

Continuing in the new Executor where the original terminated may be tricky. The problem is that if we have an existing trace

UOP1
UOP2
UOP3
UPDATE_TRACE

and we combine it with a new trace

UOP4
UOP5
UOP6
EXIT_TRACE

the optimizer may replace (for example)

UOP3
UOP4

with a single entirely different uop, say UOP7:

UOP1
UOP2
UOP7
UOP5
UOP6
EXIT_TRACE

Now where do we continue execution? We can't start before UOP7, because that would repeat UOP3, whose stack effect we've already incurred. But we also can't start after UOP7, since that would skip the effect of UOP4. The only safe thing to do would be to fall back to Tier 1.

An alternative would be to optimize the new trace buffer before merging it with the original Executor's trace buffer. But that would reduce the optimizer's effectiveness, as it would not consider optimizations spanning both buffers (e.g. dropping a guard in the new trace that's already checked in the original trace). So it's probably better to just bail to Tier 1 for one iteration.

Note that it is possible that the new Executor also ends in a UPDATE_TRACE uop. In that case the process will repeat itself once that uop is reached. This should not cause any new problems.

Note that this is really just a form of trace stitching. I think we should design a mechanism that works for both.

For example, we can just let the exit at the end of the trace "heat up" like normal side exits, and then project a new trace from there and stitch them together. I think, if we do it right, things can grow organically and we don't need to throw away and rebuild the "first" executor every time we rebuild the trace (and the nasty quadratic time complexity that introduces for a trace with a bunch of calls or whatever).

If we wanted to, we could even use this for branches (and get rid of the existing counters).

Note that this is really just a form of trace stitching. I think we should design a mechanism that works for both.

Agreed that the quadratic effect is undesirable. OTOH with trace stitching, we don't get the benefits of running the optimizer pass across all "segments". Maybe we could do something that only runs the optimizer once the last segment doesn't end in _UPDATE_TRACE? (Perhaps split the optimizer in "peephole" and "global" passes -- the peephole pass is run over each segment, once, while the global pass is only run once the trace is "complete".)

So that brings up another question: do we want to have all of the segments be exclusive to the first executor, or have each segment be its own executor that can be stitched into arbitrarily many traces?

In the first case, we'd want to optimize each branch of the tree, but those branches can't easily be stitched into other trees. On the other hand, the second case would cut down on code duplication and make stitching all the hot paths of the program together easy (so we don't leave the trace graph once the program warms up), but we can't optimize across segments.

I sort of like the latter, assuming it doesn't prevent too many useful optimizations. I think if we wanted to optimize the segments more heavily in that case, something like lazy basic block versioning would really shine.

In terms of what's easiest to build right now, as a start for incremental improvements, turning each segment into its own Executor would be simplest. We'd basically just create a new Executor starting at the CALL_PY_EXACT_ARGS instruction (but using the function object from the evaluation stack), and replace that CALL... with ENTER_EXECUTOR. Then resume at that instruction, and bingo. You can figure out how to stitch the two traces together later.

Oh, we would need to overwrite the _UPDATE_TRACE uop in the original executor with _EXIT_TRACE, so it doesn't keep trying to create a new Executor at that spot. That's a little icky (we tend to think of Executors as immutable -- like code objects :-), but the only difference between the two is that _UPDATE_TRACE tries to create a new Executor. Once that's done (or we've proved it can't be done) we can safely replace it (in the future that overwrite should be made atomic, but for now we're protected by the GIL).

Could we just have it check if there's an executor installed at the target instruction, and load it if so? Otherwise, it projects a new trace and installs it itself?

If we do that, then we get cross-trace stitching for free. (We just have to be careful that a DEOPT on the first instruction of a trace doesn't fall into an infinite loop.)

I’d have to think about how tricky this is for the JIT… but it shouldn’t be too bad, especially considering the benefits it brings. It does seem like the simplest of the ideas we’ve considered here, though. The tricky part is tail-calling into the new executor.

Also, modifying traces after the fact is, uh, pretty painful for jitted code. So let’s continue to think of traces as immutable. ;)

Right now, loading an executor is a fair amount of code (see inst(ENTER_EXECUTOR, (--)) in bytecodes.c), and it looks like almost all of it is needed. We could duplicate all that in _UPDATE_TRACE, but maybe we could stuff a pointer to the successor executor in the operand field (of _UPDATE_TRACE as well as _EXIT_TRACE) so those would do something like

if (operand) {
    Py_DECREF(current_executor);
    current_executor = Py_NewRef((PyObject *)operand);
    GOTO_TIER_TWO();
}

(Though I've probably missed a step or two.)

Also, modifying traces after the fact is, uh, pretty painful for jitted code. So let’s continue to think of traces as immutable. ;)

Oh, never mind then. Then instead of stuffing the executor in operand we'd have to find it using this code (copied from ENTER_EXECUTOR):

        if (next_instr->op.code == ENTER_EXECUTOR) {
            PyCodeObject *code = _PyFrame_GetCode(frame);
            _PyExecutorObject *executor = (_PyExecutorObject *)code->co_executors->executors[oparg&255];
            int original_oparg = executor->vm_data.oparg | (oparg & 0xfffff00);
            JUMPBY(1-original_oparg);  // <-- This is problematic since we're not at a JUMP_BACKWARD
            frame->instr_ptr = next_instr;
            Py_INCREF(executor);
            if (executor->execute == _PyUopExecute) {
                current_executor = (_PyUOpExecutorObject *)executor;
                GOTO_TIER_TWO();
            }
        }

(There are definitely a few more nits that have to change here.)

This will also force us to deal with the question of "what if an executor makes no progress?" (I'm sure there's a discussion about that in this tracker but I can't find it quickly.) An executor that replaces CALL_PY_EXACT_ARGS can definitely deopt without having done anything. So the ENTER_EXECUTOR code needs to be updated to deal with that. (That's the JUMPBY I called out just now.) Fortunately ENTER_EXECUTOR is Tier 1 only, so no worry about how to JIT it.

We could also create an “unfinished” executor that projects a trace the first (or n-th) time its execute function is called. If we create and install all of those for side exits when projecting (or find them for existing executors), then we can burn them into operand.

I guess the main change would be that executors wouldn’t store the trace in their tails, it would be a pointer to a buffer. But that’s probably fine, since the JIT likely wants to throw the trace buffer away anyways.

My mind is boggling. I am going to concentrate on creating a second executor using stack_pointer[-2-oparg] for the function object instead of using the function cache (which can be deleted), and patching the CALL_PY_EXACT_ARGS with ENTER_EXECUTOR, and addressing the "instant deopt" issue. Let's iterate on the rest later (or when I get stuck with this plan).

I think the question of forward progress can probably be answered by sticking the uops for the “base” opcode in any side exit stubs for the first instruction in the trace. So it’s as-if the base opcode already ran when we exit.

I think the question of forward progress can probably be answered by sticking the uops for the “base” opcode in any side exit stubs for the first instruction in the trace. So it’s as-if the base opcode already ran when we exit.

Not sure what you're proposing here. We don't yet have DEOPT_IF() jump to a stub containing more uops (IIRC we're planning to, but we haven't done it yet -- stubs are only used for explicit conditional branches, _POP_JUMP_IF{TRUE,FALSE}).

And by "base opcode" I presume you're referring to the CALL_PY_EXACT_ARGS opcode. That's the very one that could deopt (it starts with three guard uops). The only solution I can think of is for ENTER_EXECUTOR to check whether the new next_instr equals to its this_instr and in that case dig up the original opcode+oparg and then dispatch.

Not sure what you're proposing here. We don't yet have DEOPT_IF() jump to a stub containing more uops (IIRC we're planning to, but we haven't done it yet -- stubs are only used for explicit conditional branches, _POP_JUMP_IF{TRUE,FALSE}).

Yeah, I'm imagining this building upon python/cpython#111611.

And by "base opcode" I presume you're referring to the CALL_PY_EXACT_ARGS opcode. That's the very one that could deopt (it starts with three guard uops). The only solution I can think of is for ENTER_EXECUTOR to check whether the new next_instr equals to its this_instr and in that case dig up the original opcode+oparg and then dispatch.

No, I mean CALL in this case.

We can discuss more on Monday, I think.

Oh, got it. That would make the jitted code for the side exit bulky though -- CALL does a lot of work. It would be more compact if you could somehow burn in a jump to the target label for CALL (i.e. where DEOPT_IF() would jump :-).

Plus, currently _CALL isn't a viable uop.

Bulky side exits are the future. ;)

In all seriousness, I've been thinking of only jitting the mainline and interpreting the side exits. Trickier, but could be something to consider down the road if we're jitting too much code.

I have a working prototype (need to clean it up before I push it to a branch). I realized there's another issue with not fusing the executors/segments: if a loop contains a call, it is broken up into two or more separate executors, and we can no longer use _JUMP_TO_TOP to stay within the executor (because we'd have to jump from the end of the second executor to the top of the first one). I'll leave that as a challenge for the trace stitching project.

I've also discovered a problem with checking in ENTER_EXECUTOR whether an executor made no progress: we can't distinguish the case where it truly did nothing from the case where it did a bunch of stuff ending in a jump to the top and then an exit or deoptimization. Fortunately I think it doesn't matter -- either way we've reached the current instruction and it's not wrong to re-execute the base opcode. (Actually, the easiest thing to do is to re-execute the specialized instruction that was overwritten by ENTER_EXECUTOR -- if it deoptimizes it will jump to the base instruction anyway.)

Actually, checking whether an executor made progress is hard, period. I'll probably need a new variable to hold some state. I have a feeling Mark knew this but didn't explain it well enough for me to understand. :-(

A tentative solution compares the current instruction to the target of the _SET_IP at the start of the trace. If the trace starts with something other than _SET_IP, the first uop can't fail, and we assume it is guaranteed to make progress. (This is not a hard fact, though. In theory we could have a macro instruction whose uop expansion starts with an infallible uop, and the _SET_IP might have moved beyond that. I trust, however, that whenever a macro can deopt, the very first uop in its expansion can deopt.)

Oof, this is much harder than it seemed. ENTER_EXECUTOR expects that it overwrites a JUMP_BACKWARD, and there are various other subtleties as well that I didn't anticipate in my original plan; I can't really explain them other than saying that the code I wrote still crashes in various places.

One thing I haven't tried yet but should (although I doubt it'll fix any of the current blockers): instead of changing the signature of opt->optimize() to add an optional argument to pass the function object dug out of the eval stack, it could just dig whatever out of the stack (we'd pass frame instead) and compare the function version. This would be more general because it could work even if the CALL_PY_EXACT_ARGS instruction is not the first one to be translated.

This all seems rather complicated.
We have live information about the VM state at the start of a trace, as that is the state of the VM at that point.
We can do a reasonable amount of optimizations with that, for now.

It might be worth dropping tracing across calls at all, and our efforts into making sure that we can stitch together the caller trace and callee trace.

Let's not attempt to start executing traces in the middle.
We can optimize larger regions in tier 3.

ENTER_EXECUTORS, they should not expect that they overwrite any particular instruction.
python/cpython#108866

It might be worth dropping tracing across calls at all, and our efforts into making sure that we can stitch together the caller trace and callee trace.

For now we have the function (version) cache that enables us to trace across at least some calls. (We should try to figure out how effective it is.)

Stitching will require re-optimizing, e.g. if we have a function that does something trivial (e.g. return self._attr) the optimizer should be given a chance to inline that without even creating a frame. (Or is that deemed Tier 3?)

FWIW, @brandtbucher just showed me his version of the part (really just one yak I shaved) that deals with the requirement that ENTER_EXECUTOR should make progress. It's possible that I missed something that he figured out. It's also possible that the remaining crashes in my version are due to something in the rest of my code -- it might be easier to debug once that shaved yak is committed. (If we're willing to commit it. Offline I previously said I was 80% convinced by @markshannon's arguments for requiring in progress, but after carefully going over Brandt's version with him I am actually 80% convinced that it can work, and it seems simpler than the alternative once we want to use executors everywhere. I have asked Brandt to write up carefully how his version works in the various cases -- it seems to be mostly a matter of not entering an executor that we just left, no matter why.)