faster-cpython/ideas

Tail-calling interpreter

markshannon opened this issue · 4 comments

An alternative dispatch mechanism to switch or computed gotos is to use tail calls.
In a tail calling interpreter, dispatch is implemented as a tail call to the next instruction.

For example: the C code generated for NOP would look something like:

funcptr INSTRUCTION_TABLE[256] = {
    [NOP] = nop_func,
    ...
};

RETURN_TYPE nop_func(PyObject **sp, PyInterpreterFrame *fp, PyThreadState *ts, _Py_CODEUNIT *next_instr, int oparg)
{
    next_instr += 1;
    int opcode = next_instr->op.code;
    int oparg = next_instr->op.arg;
    funcptr next = INSTRUCTION_TABLE[opcode];
    return next(sp, fp, ts, next_instr, oparg);
}

If we can get the C compiler to generate jumps for tailcalls and avoid lots of expensive spills, this can be faster than computed gotos.

The problem is that C compilers do not always generate tailcalls and the platform ABI force them to spill a lot of registers.

However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.

This has the potential to speed up the tier 1 interpreter by several percent.

In the example, shouldn't these two lines be swapped?

    funcptr next = INSTRUCTION_TABLE[opcode];
    int opcode = next_instr->op.code;

ISTM that next depends on the new value of opcode, which is computed in the second line.

However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.

To clarify, this is backwards: Clang already has __attribute__((musttail)), and may be getting the GHC calling convention in the future.

So the tail-calling idea can be tested immediately, but may have performance issues without the calling convention change. And, of course, it only works on Clang builds.

Hello, I work on Pyodide at Cloudflare, and I did a quick and dirty implementation of this that seems to work. I'm mainly focused on Emscripten performance, so I haven't tried building with clang-19 with preserve_none calling convention. Feel free to take a look and hmu if you run into any issues.

garrettgu10/cpython#1

(Coming in as a bit of an outsider to cpython)

Why keep this as just a tier 1 Interpreter improvement? Wouldn't the tier 2 Interpreter also benefit from this?

Also, given there are several ways to dispatch in an Interpreter (Direct Threading, Indirect Threading, Token Threading, Call Threading, Tail Calls, as this proposal suggests), would it be feasible to compare the performance of all of them to see which would fit cpython the best? In particular, with call threading, each handler can return the address of the next handler to call to the main loop, which will have the next handler's address nicely laid out in the return value register to be called again, until the loop terminates when program execution ends