Going Branchless

Question

Going Branchless

mikey-b opened this issue 6 years ago · 5 comments

I'm not on a development computer at the moment, but some thoughts.

Sign Extend is called a lot, We could turn this into a branch free routine. (removes 41 instructions on compiler explorer output -O3).

uint16_t sign_extend(const uint16_t x, const int bit_count)
{
    uint16_t mask = -((x >> (bit_count - 1)) & 1);
    return x | (mask << bit_count);
}

We could also change the update_flags to be branchless (Saving another 40 instructions). Unfortunitely it would require changing the enum numbering which would break existing programs.

enum
{
    FL_ZRO = 0, /* Z */
    FL_POS = 1, /* P */
    FL_NEG = 3, /* N */
};
void update_flags(const uint16_t r)
{
    reg[R_COND] = ((reg[r] >> 14) & 2) | (reg[r] != 0);
}

Kind regards,
Mike Brown

Answer 1 · 2020-04-20T04:18:24.000Z

I can't speak for @justinmeiners, but I tend to avoid micro-optimizations like these unless I can prove that they provide significant benefits, especially when they come at the cost of less readable code or not conforming to a specification. I think this is especially relevant for this project, since it's meant to be an educational article that even beginning developers can follow.

The sign_extend function is already pretty difficult to read, so I'm not necessarily opposed to including that change. I'm actually wondering if it would be better to replace it with something more readable though, even if it's not as performant. Additionally, it looks like Clang 8.0.0 actually produces less instructions with the current implementation than it does for the proposed one. It's also using conditional moves instead of branches, so that should be less of a performance concern.

As for the update_flags change, this codebase tries to follow the LC3 specification pretty closely, so I'm not sure if diverging from the spec just to remove a branch is worth it. The proposed change is also substantially more difficult to understand than the current implementation. For what it's worth, Clang 8.0.0 actually uses a conditional move instead of a branch for the current implementation of this function as well. It seems only GCC is using branches to implement these functions.

With all of that said, I'm still open to alternative implementations that result in no branches when compiled with GCC. It would just be ideal if the changes had the same (if not better) readability and expression of the technical concepts being discussed in the article.

Answer 2 · 2020-04-20T04:33:44.000Z

I agree with most of what @rpendleton says. My priorities are:

LC-3 compliance (for the parts I choose to implement)
Education and readability
Platform compatibility
Performance

So because of 1, changing the enums is probably off the table.
I would be open to an alternative sign_extend implementation.
The reason why I like the way it's written is it explicitly shows how two different behaviors occur based on whether the value is negative or positive.

But what is "sign-extending"? Although the immediate mode value has only 5 bits, it needs to be added to a 16-bit number. Those 5 bits need to be extended to 16 to match the other number. For positive numbers, we can fill in 0's for the additional bits and the value is the same. However, for negative numbers, this causes a problem. For example, -1 stored in 5 bits is 1 1111. If we just extended it with 0's, this is 0000 0000 0001 1111 which is equal to 31! Sign extension prevents this problem by filling in 0's for positive numbers and 1's for negative numbers.

Do you have any thoughts about this @mikey-b can we perhaps write a branchless version that illustrates that point?

In any event, I appreciate your comments.

Answer 3 · 2020-04-20T16:16:30.000Z

Hi all,

I completely understand the points raised and agree with most.

The thing with sign_extend is that it is everywhere, and while I suspect conditional moves will be faster than a branch - They both still have a dependancy, either a data or control dependancy. The other thing to note is that bit_count is known at compile time, a literal, in all uses. It is impossible for the compiler to elide the conditional check due to x, and using the fact that N | 0 = N means that we can actually do the same operation regardless of the predicate, allowing us to more effectively use bit_count.

uint16_t sign_extend(const uint16_t x)
{
    const int bit_count = 5;
    uint16_t mask = -((x >> (bit_count - 1)) & 1);
    return x | (mask << bit_count);
}

Will produce, (gcc -O2) One data dependency, and zero conditional dependencies.

sign_extend(unsigned short):
        mov     eax, edi
        sal     eax, 11
        sar     ax, 15
        sal     eax, 5
        or      eax, edi
        ret

Having said all that, This is an intrepeter, and the instruction loads are going to completely drawf any dependency bottlenecks. And an educational exercise/tutorial. I think this is an important lesson for people to know but perhaps not the place for it.

Which one is clearer, and how to make it clearer for educational purposes? I really don't know, I think the problem with the two implentations is that they require knowledge of two's complement. I know theres a link to more info on this but its a barrier.

If the goal is learning, Here's my stab at clarity.

uint16_t sign_extend(const uint16_t x, int bit_count)
{
    // Check to see if the last bit (located at position bit_count) is set to 1.
    if ( x & (1 << (bit_count - 1) ) ) {
        return x | (0x0000 << bit_count); // Make all bits positioned higher than bit_count 0's
    } else {
        return x | (0xFFFF << bit_count); // Make all bits positioned higher than bit_count 1's
    }    
}

I think its easier to explain to people that the predicate is looking at a bit at a particular position (Good luck explaining index's are 0-based and hence the -1), & at the location, rather than shifting left to make it the zero's bit is IMO clearer.

This will produce the optimal cmov code, I am convinced branchless will provide some % of performance - But its not going to be easier to learn so not worthwhile.

Answer 4 · 2020-04-20T19:04:51.000Z

Just to complete this thread, update_flags can be done branchless and to the existing standard format. But does suffer from the same readability issue.

void update_flags(const uint16_t r)
{
    reg[R_COND] = ((0b1001 \
        << (reg[r] == 0)) \
        >> (reg[r] >> 15)) \
        & 0b111;
}

Answer 5 · 2020-08-19T18:21:24.000Z

Thanks again. I decided to close those for now.