faster-cpython/ideas

Tier 2 optimizer: Constant Int and Float propagation, without the refcount problems

Opened this issue · 3 comments

Currently we're not doing constant and float propagation on the tier 2 optimizer in main because we don't want to create new objects or hold owned references in the trace.

I propose a way to circumvent this and do constant and float propagation. I will outline the design here. The idea is to pack the raw values into the operand itself.

We then introduce two new instructions:

LOAD_INT
LOAD_FLOAT

Which constructs a value from an immediate in the operand by calling PyLong_FromLong and PyFloat_FromDouble.

To keep the optimizer peepholes simple, we then introduce variants of the above instructions

POP_TWO_LOAD_INT
POP_TWO_LOAD_FLOAT

Thus the binary operations get constant propagated to the following instructions:

BINARY_OP_ADD/SUB/MUL_INT -> POP_TWO_LOAD_INT
BINARY_OP_ADD/SUB/MUL_FLOAT -> POP_TWO_LOAD_FLOAT

Which will then be cleaned up in a peephole pass as follows:
Before:

LOAD_FAST x
LOAD_FAST y
POP_TWO_LOAD_INT
LOAD_FAST z
POP_TWO_LOAD_INT

After

LOAD_INT (x + y + z)

Here's what must happen for this to be a win:
For floats:

  • The cost of loading two values from locals/registers, writing them to stack, adding dereferenced doubles, then decref specialized must be high. Note that because of the Py_REFCNT(lhs) == 1 trick that we use to reuse floats, I am not super optimistic about this being a big win for small runs of constants. So for now, this should not be prioritised. My own experiments with the tier 2 interpreter back then when I did the lazy block versioning experiment suggests that floats are not that expensive due to the float reuse. In fact, we might slow things down because PyFloat_FromDouble always has to allocate.

For ints:

  • This is almost always guaranteed to be faster if we can constant propagate. No refcounting tricks here for ints, and the cost of a boxed int is actually very high in comparison to floats. Ints always allocate unless they are from the small ints pool, in which case we still save actually going through the _PyLong_Add subroutine, which is extremely expensive as it assumes an array of long. Note that in the specialized interpreter, float addition is the machine double addition, while int addition is an expensive subroutine. We should prioritize this.

We should prioritize this.

Why? do you have evidence that this will make a large enough difference to prioritize over other work?
I'm not saying we shouldn't do this. I'm just a bit skeptical that it will make much of a difference.

We should prioritize this.

Why? do you have evidence that this will make a large enough difference to prioritize over other work?

I'm not saying we shouldn't do this. I'm just a bit skeptical that it will make much of a difference.

I don't. I meant that we should prioritize ints over floats initially because my past experiments with the tier 2 interpreter suggest float operations are quite cheap. I dont think any of this will show up on a benchmark necessarily.