faster-cpython/ideas

How to handle YIELD_VALUE in Tier 2

Opened this issue · 6 comments

YIELD_VALUE is currently the most-executed Tier 1 instruction that doesn't have a translation into Tier 2 uops.

But how should we handle it?

yield is highly asymmetric -- the generator's frame is entered (resumed) by calling __next__(), send() or throw(), but it is left through the YIELD_VALUE bytecode. We don't specialize any of those calls, so we can't really do something similar to _PUSH_FRAME on the call (yet -- if we can specialize list.append presumably we can specialize generator.__next__).

It might be possible to trace through YIELD_VALUE in a way similar to _POP_FRAME, if we know we'll always yield to the same frame, and if we know the code running in that frame. This would only make sense if we also traced through __next__() or send() though, otherwise we'd quickly run out of frames.

Concluding, I don't see what to do for YIELD_VALUE in Tier 2. Am I missing something, @markshannon or @brandtbucher?

According to the stats:

Instruction Count
YIELD_VALUE 958M
SEND_GEN 490M
FOR_ITER_GEN 113M

So 63% of YIELD_VALUES have a matching send/for.
I suspect the remaining 37% are the result of iterating over a generator from C code, which we want to fix anyway.

I expect that gen.__next__() is almost never called explicitly, throw()s are very rare, and explicit send() calls can be handled by reimplementing send in bytecode.

Given that, YIELD_VALUE can be treated in much the same way as RETURN_VALUE, with SEND_GEN and FOR_ITER_GEN being treated as calls.

It is SEND_GEN and FOR_ITER_GEN that are the challenge to handle as generators may have several entry points.

Okay, I'll educate myself about SEND_GEN and FOR_ITER_GEN. I suspect that method __next__() is almost never called directly, but the builtin next() is called plenty of times, so we might translate that into bytecode too. throw() is unimportant.

I am still stuck on this. E.g. how can we identify the code object of the generator while tracing through a for-loop starting with _FOR_ITER?

Meanwhile, it looks like _PUSH_FRAME is very similar to what happens at DISPATCH_INLINED() and the start_frame label in ceval.c, with only a few differences:

  • _PUSH_FRAME starts with frame->return_offset = 0; while this is set explicitly before calling DISPATCH_INLINED() elsewhere.
  • _PUSH_FRAME omits setting frame->prev_instr = next_instr - 1`; presumably because it is already set correctly? (And because it wouldn't work in Tier 2?)
  • DISPATCH_INLINED() calls _Py_EnterRecursivePy() where _PUSH_FRAME merely decrements tstate->py_recursion_remaining. Presumably _PUSH_FRAME does this for speed, and maybe because those functions don't exist in the Tier 2 interpreter?

E.g. how can we identify the code object of the generator while tracing through a for-loop starting with _FOR_ITER?

A few possible options exist:

  • Brandt suggested that we can get it from the stack (_PyOptimizer_BackEdge() has the frame and the stack pointer, although they are currently not passed through to uop_optimize()).
  • Mark suggested that we might switch to tracing instructions at some point (but not any time soon).
  • We could extend the FOR_ITER cache structure to include a function version field, which we can use similar to we find it for tracing through CALL_PY_EXACT_ARGS.

There are other complications though.

  • If a generator has multiple yield statements, subsequent iterations of FOR_ITER_GEN may not continue at the same point in the generator's code object -- unlike CALL, where we always enter from the top. So we can't blindly trace through it even if we know the code object and where to resume it!
    • We could special-case generators with a single yield (intuitively these are common). There is still the initial resume, which still enters from the top (after RETURN_GENERATOR), but this handled by the Tier 1 interpreter anyway, because the Tier 2 executor is only triggered by the ENTER_EXECUTOR at the bottom of the loop.

Another random idea:

  • We could start tracing in the generator (many generators are loops themselves) and upon yield (i.e., YIELD_VALUE) trace into the caller. This is quite exciting, though possibly the required frame stack manipulations are too complicated.

Mark and I are hoping to dive more deeply into this at the sprint in Brno.

This is still very tricky; we decided that we have other priorities, e.g. trace stitching, for which we first need to merge the Tier 2 interpreter into the Tier 1 interpreter (#631).

The basic problem is that we don't have the runtime information necessary to project the trace until we execute the code.
This is a solved problem. Psyco handles this by stopping compilation, running the code and start optimizing again with the new information. We can do the same.

The idea is that we exit the trace with a special micro-op that resumes optimization, passing in the actual value seen at runtime.

It is conceptually a bit complex, but not that hard to implement.

This approach also works for calls and returns, so we could throw out the function cache, which doesn't work well for closures anyway.