faster-cpython/ideas

True function inlining: design discussion

Fidget-Spinner opened this issue · 21 comments

True function inlining

This is my proposal for function inlining in 3.13. Feedback and alternative proposals welcome.

CPython 3.11 obtained frame inlining. This eliminated most of the overhead of function calls. However, "_PyInterpreterFrame" were still needed. These take the form of the instructions _INIT_CALL_PY_EXACT_ARGS and _PUSH_FRAME.

In CPython 3.13, I plan for true function inlining -- that is, no frame creation at all in the happy case, not even _PyInterpreterFrame frames.

The uops tracer already projects traces through Python function boundaries. The goal is to eliminate _INIT_CALL_PY_EXACT_ARGS, _PUSH_FRAME and _POP_FRAME. Doing this is easy, but doing this while maintaining CPython semantics is very hard.

Before

ADD_TO_TRACE(_CHECK_PEP_523, 1, 0)
ADD_TO_TRACE(_CHECK_FUNCTION_EXACT_ARGS, 1, 4828)
ADD_TO_TRACE(_CHECK_STACK_SPACE, 1, 0)
# Frame creation
ADD_TO_TRACE(_INIT_CALL_PY_EXACT_ARGS, 1, 0)
ADD_TO_TRACE(_SAVE_RETURN_OFFSET, 4, 0)
ADD_TO_TRACE(_PUSH_FRAME, 1, 0)
ADD_TO_TRACE(_SET_IP, 0, 0)
ADD_TO_TRACE(_CHECK_VALIDITY, 0, 0)
ADD_TO_TRACE(_RESUME_CHECK, 0, 0)
...
# Frame removal
ADD_TO_TRACE(_POP_FRAME, 0, 0)

After

_CHECK_PEP_523(1, 12, 0)
_CHECK_FUNCTION_EXACT_ARGS(1, 12, 4828)
_CHECK_STACK_SPACE(1, 12, 0)
_SETUP_TIER2_FRAME(128, 0, 0)
 # Minimal bookkeeping, no frame creation
_PRE_INLINE(7, 0, 39)
_SET_FRAME_NAMES(0, 0, 93831592813424)
...
_POST_INLINE(4, 0, 18446744073709551615)
_SET_FRAME_NAMES(0, 0, 93831592813424)

The cost of a call went from:

Before:

  1. Bump frame chunk pointer (or use new chunk, if out of bumps).
  2. Copy over args to new frame locals.
  3. Frame bookkeeping.

After:

  1. Update a few pointers.
  2. Frame bookkeping (only in the case of sys._getframe, see below).

Locals interleaving

By interleaving the locals of the new call with the stack of the old function frame, we can achieve zero argument copying.

Before a function call, the stack looks like this:

callable self_or_null arg1 arg2
^ (stack pointer)

Notice that the arguments are already laid out as how the locals of the new "frame" will be. We just need to expand the stack to accommodate for more locals. Finally, since we are now pointing into the current stack as our locals, we offset all LOAD_FAST and STORE_FAST into the current stack.

callable self_or_null arg1 arg2
^ LOAD_FAST 0 + offset of inlined frame ^ (stack pointer)

Finally, we must update the attribute names' tuple that the current frame points to, so that the inlined call's attribute loads load the correct attribute names. This is just a simple assignment.

Frame reconstruction

We need to reconstruct frames on every deopt, traceback, sys._getframe, and possibly locals() too. To reconstruct frames, we need to store the following information in the trace for each inlined frame in the current call stack:

  • return offset
  • instruction pointer
  • function object seen at trace creation time.
  • original code object seen at trace creation time (important! because this may differ from the code object seen in the current function object!)
  • current stack entries after the call

For the following call stack:

(Not inlined) Root frame [Inlined Frame 1, Inlined Frame 2, Inlined Frame 3]

We reconstruct from the root frame upwards -- so in the order Inlined Frame 1 -> Inlined Frame 2 -> Inlined Frame 3. Note that we can statically determine the current stack entries of all inlined frames except the topmost frame -- inlined frame 3. For that, we simply need to calculate at runtime from the current stack pointer.

If you want to see how complex this all is, I implemented the reconstruction in roughly 120 lines of code (with lots of debugging code inside), in my previous iteration of my optimizer. This already works with deopts and tracebacks.

sys._getframe and locals()

There is one minor inelegance with how we handle sys._getframe, but it's also a strength. In the case of deopts and tracebacks, after reconstruction, the recreated frames are now the ones being executed. In the case of sys._getframe, the frame being reconstructed is purely for sys._getframe. That means we do not actually execute the reconstructed frames. The reason behind this is that we have no clue where stack_pointer is for the topmost frame, as it is local to ceval.c, so we cannot accurately create the topmost frame's stack. For this reason, we link up the frames as per usual, mark them as inlined virtual, and clear them on _POST_INLINE as per normal. This means even in the case of sys._getframe, the only cost we incur is the cost of the original function overhead plus a little extra. sys._getframe should have no noticeable slowdown.

How will this play with the JIT?

I assume the JIT will get some form of register allocation. If Brandt implements some sort of stack operands cached in register scheme. The locals interleaving will mean Python calls pass their arguments by register automatically, with no additional infrastructure changes required to make the JIT aware of inlining. The only change the JIT would need to be aware of is that it needs to change the stack operands that it caches to the new virtual inlined stack base rather than the original stack base.

This is basically the same idea as Mark's PhD theses called "Deferred object creation"https://theses.gla.ac.uk/2975/1/2011shannonphd.pdf (see section 5.7.1).

The only new addition is the sinking of frame reconstruction data into the trace, and the locals interleaving.

I am super excited about this topic, since my intuition tells me that call inlining is the key to solid perf improvements. I am a little confused about the two alternatives -- is what you describe in your first comment called "sinking reconstruction into side exits"?

The last time I chatted with Mark about this, he suggested that the simplest useful example would be to take a call like cast(int, x) and inline it so that everything melts away and all you've got left is LOAD_FAST x. This is a special case that doesn't require frame reconstruction, since we know that LOAD_FAST x cannot exit (it can't fail and it can't deoptimize, and it doesn't call anything that could possibly call sys._getframe()).

I think even that is somewhat challenging because you need to detect that there's only a certain kind of thing between _PUSH_FRAME and _POP_FRAME (bookkeeping and pure ops), you need to get rid of the bookkeeping ops, transform the pure ops, and you want to have a peephole pass that gets rid of the unused LOAD_GLOBAL for cast and int (fortunately those will have been transformed to _LOAD_CONST_INLINE by then). But this limitation might mean that we have even less overhead (e.g. no _SET_FRAME_NAMES).

Regarding frame reconstruction, it seems clear that in theory some kind of static analysis of the trace should be sufficient to record everything we need so that upon an early exit we can reconstruct the frame, and have some kind of slow code that does the actual reconstruction, either in the form of custom generated code (different for each exit), or in the form of a mini-interpreter for some very simple custom code (think of the way the line number table is used to reconstruct the current line number).

It also looks as if we might be able to avoid the need for a fully general frame reconstruction mechanism by only inlining sequences of uops that satisfy some predicate (e.g. "only pure ops"). We might be able to choose a predicate that is less restrictive than "only pure" but that still allows the reconstruction data/code to be finite (and small) in size. In particular, it would be nice if we started by limiting ourselves to ops that cannot lead to a call to sys._getframe() (possibly this means checking that HAS_ESCAPES_FLAG is not set).

Thoughts? What would really get me excited would be a prototype that got rid of the _PUSH_FRAME/_POP_FRAME for the cast(int, x) example, with zero reconstruction effort. Is that possible?

I am a little confused about the two alternatives -- is what you describe in your first comment called "sinking reconstruction into side exits"?

Yes. I realised it's the same in Mark's paper as deferred object creation, so I deleted my comment. Sorry for the confusion.

or in the form of a mini-interpreter for some very simple custom code (think of the way the line number table is used to reconstruct the current line number).

Yup that's my current approach!

It also looks as if we might be able to avoid the need for a fully general frame reconstruction mechanism by only inlining sequences of uops that satisfy some predicate (e.g. "only pure ops"). We might be able to choose a predicate that is less restrictive than "only pure" but that still allows the reconstruction data/code to be finite (and small) in size. In particular, it would be nice if we started by limiting ourselves to ops that cannot lead to a call to sys._getframe() (possibly this means checking that HAS_ESCAPES_FLAG is not set).

That was my initial plan way back, but it's too restrictive. Even something like a // b which isn't specialised would mean we cannot inline in that case. So we need the frame reconstruction to support more cases.

Thoughts? What would really get me excited would be a prototype that got rid of the _PUSH_FRAME/_POP_FRAME for the cast(int, x) example, with zero reconstruction effort. Is that possible?

My current prototype still has reconstruction for that. That's not very easy at the moment and you need a few passes to remove everything perfectly. Here's how it could look like.

1. # After globals -> constant promotion

LOAD_CONST cast
PUSH_NULL
LOAD_CONST int
LOAD_FAST x
_CHECK_FUNCTION_EXACT_ARGS
_INIT_CALL_PY_EXACT_ARGS
_SAVE_RETURN_OFFSET
_PUSH_FRAME
LOAD_FASt 1  # x
RETURN_VALUE

2. # Constant evaluation
LOAD_CONST cast
PUSH_NULL
LOAD_CONST int
LOAD_FAST x
_INIT_CALL_PY_EXACT_ARGS
_SAVE_RETURN_OFFSET
_PUSH_FRAME
LOAD_FAST 1  # x
RETURN_VALUE

3. # Inline (noticed all no escapes, so not the reconstruction version) + basic value numbering

LOAD_CONST cast
PUSH_NULL
LOAD_CONST int
LOAD_FAST x
LOAD_FAST x
_SHRINK_STACK_AND_SWAP_TOS 3 # removes call args from stack, same effect as RETURN_VALUE

4. # _SHRINK_STACK peephole, notices the value on TOS does not depend on a previous value, and
   # can be safely removed

LOAD_FAST x

Even something like a // b which isn't specialised would mean we cannot inline in that case.

Honestly that's fine with me, at least for the first phase.

It looks like we may just be disagreeing about what should be the first step. It seems my first step would be to initially limit call inlining to things that need zero reconstruction, and then gradually add more complex cases, whereas your proposed first step is to implement a general algorithm that includes reconstruction but isn't necessarily optimal in all cases, and then iterate on special cases. Do you think that's a fair summary? We should probably put this on the agenda for Wednesday, so we can get other opinions.

Separately, I don't fully understand how you plan to move from stage 1 to stage 4 in your last comment with examples.

  • From 1 to 2, the only difference seems to be that _CHECK_FUNCTION_EXACT_ARGS is gone. What is your algorithm for that?
  • Similar for 2 to 3. This seems to do a lot of work, I'd like to understand what the stack layout is at various points during that execution. (Maybe we need a convention for showing what's on the stack after each operation.)
  • 3 to 4 is pretty straightforward -- the magic really is in how that _SHRINK_STACK_AND_SWAP_TOS op was obtained.

There also seem to be a few things missing after _PUSH_FRAME in stage 1, notably RESUME_CHECK.

I can work on function inlining for the simple case first. Most of the optimizer infrastructure will be shared anyways, so starting with either works.

From 1 to 2, the only difference seems to be that _CHECK_FUNCTION_EXACT_ARGS is gone. What is your algorithm for that?

Add constant evaluation for _CHECK_FUNCTION_EXACT_ARGS in the redundancy eliminator.

Similar for 2 to 3

That's replacing _PUSH_FRAME and _POP_FRAME with their real stack effects, now that the frames are gone. That is what _SHRINK_STACK_AND_SWAP_TOS is for as well.

There also seem to be a few things missing after _PUSH_FRAME in stage 1, notably RESUME_CHECK.

Yup I was writing this manually, so I forgot a few things. Hope this helps illustrate the point though!

Got it, const evaluation is a pretty powerful tool! Also congrats on getting the foundations merged.

@Fidget-Spinner Please write down the details of the reconstruction design.

Frame reconstruction details:

At the point of a deopt, the operand stack will look like this:

| S0 | S1 | S2 | S3 | L0 | L1 | L2 | S0 | S1 | S2 | L0 | L1 | S0 |
--------------------^ |---------------------------^ -------------^
extent of root frame     first inlined frame        second inlined frame

L-prefixes are locals.
S-prefixes are stack values.

To reconstruct the inlined frames, we need to convert the stack above
to what tier 1 uses.

Note that the stack is already laid out exactly how the localsplus of
the "real" frames will look like. The reconstruction is thus the following
steps:

  1. Allocate frames in the order: first, second.
    a. For each frame, memcpy the values into their localsplus.
    b. Set frame instr ptr, return offset, stack pointer, based on
    information we've already gleaned at static analysis time. Note that
    all information except stack pointer is automatic from instructions
    like _SAVE_RETURN_OFFSET and _SET_IP.
    c. Link up the frame's previous frame.
  2. Finally, the current stack now belongs to the root frame. So shrink
    the stack pointer, so we don't accidentally double decref.
  3. The topmost frame (the second one) needs the stack pointer to be set
    based on what is currently the stack pointer at runtime. This can
    be calculated by deducting the current stack_pointer from the base.

The benefits of this reconstruction scheme:

  1. Low overhead. The cost of the memcpy would be incurred anyways
    because real frames need to copy the stack values into the locals.
    So we're not copying anything other than the stack values.
  2. Compatible -- we are compatible with exceptions and deopts.

Additional special case for sys._getframe and locals():

See paragraph above in original post titled sys._getframe and locals().

Paging @pablogsal. Do you think removing frames completely would break perf and other frame introspection tools?

Paging @pablogsal. Do you think removing frames completely would break perf and other frame introspection tools?

Not “break” as in they will crash but certainly it will make it less useful and then formation they report will be less accurate and confusing which will certainly impact users. This is the same when compilers inline functions and then getting stack traces in debuggers is challenging when the inline functions don’t appear.

I would recommend to think about the debugging experience here because this is the classic example when performance and debugging experience are in direct competition and at the very least we should have that in mind when making decisions.

There are two different things to think about:

  • Profilers attach to the interpreter from the outside and walk the frame list from the interpreter state. Profilers cannot execute code in the process being traced so they need that list of frames to be consistent. If we inline, then we would need to provide the information in another place, maybe in a reduced format so it doesn’t impact performance too much to keep that up to date.

  • Perf in particular works with trampolines and having those activated deactivates inline Python frames in 3.11 and 3.12 so a solution may be to do the same with this optimisation when Perf mode is active. This has the advantage of not breaking Perf but it will make it even less useful as it won’t represent anymore the “real program” (which currently it’s the case but in a much less impactful way).

Going back to the frame reconstruction details comment.

  • What happens to the callable and the NULL following it (let's assume it's NULL)? Are they always elided from the stack? (Or do we have them at S2/S3 in the root frame and S1/S2 in the first inlined frame?)
  • How do you envision the reconstruction being executed? Is the idea that the reconstruction information is stored in some data structure (probably it could be of fixed size per frame, and we have a fixed maximum number of frames), or do you envision encoding it into T2 uops that are somehow executed when we deopt/exit/error? The uops solution seems less compact but might be easier to translate into JIT code??
  • I am suddenly reminded of the issue of stack size. When we elide _PUSH_FRAME we may need to give the root frame more stack space, so that the truly inlined call's locals and stack can be fit in. Are you planning a uop for that?
  • I guess a future optimization might be for the reconstructed frame to actually overlap its locals with the stack space from the previous frame, so the memcpy() is not even needed. But this is probably too advanced (we've considered this before and never done it yet.)

Going back to the frame reconstruction details comment.

  • What happens to the callable and the NULL following it (let's assume it's NULL)? Are they always elided from the stack? (Or do we have them at S2/S3 in the root frame and S1/S2 in the first inlined frame?)

Woops yes they are at S2/S3. We leave the stack in the same state as right after a call, but before the frame is created.

  • How do you envision the reconstruction being executed? Is the idea that the reconstruction information is stored in some data structure (probably it could be of fixed size per frame, and we have a fixed maximum number of frames), or do you envision encoding it into T2 uops that are somehow executed when we deopt/exit/error? The uops solution seems less compact but might be easier to translate into JIT code??

Currently, I write a small stub of various uops containing what we need at the end of the trace, after _EXIT_TRACE. Then while loop over them (so like a small mini-interpreter).

  • I am suddenly reminded of the issue of stack size. When we elide _PUSH_FRAME we may need to give the root frame more stack space, so that the truly inlined call's locals and stack can be fit in. Are you planning a uop for that?

Yes I already implemented that on my old branch. It's a run-once uop at the top of the trace. This is why we need something like _JUMP_ABSOLUTE. Alternatively, I set a bit on the frame once it's run before and make it a no-op after that.

Woops yes they are at S2/S3. We leave the stack in the same state as right after a call, but before the frame is created.

Hm, IIRC in Tier 1 (and I think also in Tier 2 so far) the callable+NULL are popped off the stack some time during the call sequence (I think _INIT_CALL_PY_EXACT_ARGS) so you may have to do something about this.

Currently, I write a small stub of various uops containing what we need at the end of the trace, after _EXIT_TRACE. Then while loop over them (so like a small mini-interpreter).

That sounds like it's relatively wasteful of space. Is this shareable by all deopts/exits in a trace?

That sounds like it's relatively wasteful of space. Is this shareable by all deopts/exits in a trace?

It's one stub per inlined frame, which is shared by all deopt/exits inside that inlined frame. Considering we only go 5 frames deep at maximum, it's not too much.

I would still love to see the design of the reconstruction data described in more detail here.

I would still love to see the design of the reconstruction data described in more detail here.

Will get to it.

One more design discussion point: for 3.13, I can't mix non-inlined and inlined frames, so you can't have:

non inlined -> inlined -> non inlined -> inlined

but rather only

non inlined -> inlined -> inlined -> inlined

The reasoning is that the former makes reconstruction more complex, and isn't feasible for the 3.13 time frame. The main thing is that the data stack chunks need to represent the frames. If we have mixed inlined and non-inlined things, then the frame reconstruction and data stack chunks in the interpreter need a lot more reconstruction juice.

That makes total sense.

I've dropped function inlining for 3.13. Instead I propose #657 to get most of the benefits without most of the cons.