Vectorizing complex conditional code

This article is inspired by SSE - Vectorizing conditional code by Félix Abecassis.

When you only have a single condition, the approach proposed in that article works great.

It’s worse when you have multiple conditions in your code. Just 4 independent nested conditions, and you sometimes have 2^4=16 different code paths. Mask-based conditions would waste too much time computing unneeded branches.

Manual vectorization approach implemented in this demo relies on movemask family of intrinsics. It packs multiple conditions into a bitmask in a single scalar CPU register. The bitmask allows efficient branching based on a set of conditions. This allows to vectorize condition-based code in a manner that’s both flexible (e.g. you can fallback to scalar code for some rarely-taken branches, or for parts of the code which don't benefit from SIMD) and allows to optimize performance of specific frequent cases.

If these conditions just happen randomly, this approach probably won’t help much. But in many practical applications they aren’t random. If you’re tracing rays, or shading triangles, or implementing some other geometry-based stuff, there’s good chance all lanes in the wavefront need to take the same branches. The SimdMask template class allows to optimize code for such cases.

The demo project is very simple, it just solves quadratic equations with SSE1 code. It doesn’t even measure performance. The task I’m actually solving is completely different, I don’t care about performance in this particular case.

Q & A

Why are you publishing this?

To be able to use SimdMask template class in multiple projects.. See the license.

Which compilers/platforms are supported?

The current version was only tested in VC++ 2017 but you should be able to port to anything else. However, I think this approach is only good for x86: Neon doesn't have movemask instructions, emulating them might be too slow.

Is this only for floats?

You can specialize the SimdMask template class with __m128i or __m256i vectors and custom traits class. Casting between register types is free, i.e. in your fromVector method, movemask_ps will work great for 32-bit integer lanes, and movemask_pd for 64-bit integers. For 16 bit lanes you can use movemask_epi8 and extra code to drop every other bit, the fastest method for that is apparently _pext_u32 from BMI2 instruction set but you'll probably need a software fallback. For 8 bit lanes just use movemask_epi8.

Why pack these bits in a single scalar?

To save registers / RAM latency. But the main reason is testing for multiple conditions in single operation. The majority of bit shifting happens in compile time, in runtime the tests become very simple code, e.g. and + cmp.

You said all lanes in wavefront often need to take same branches, why not GPU instead?

Often GPGPU is indeed the best way for that, but GPUs have their limitations. Some software needs to work on potato PCs, CPU code only depends on CPU chip, GPU code also requires GPU hardware, GPU drivers, and way more complicated OS support. For some code, non-vectorizable cases negate the difference in raw computation performance. Some code consumes lots of bandwidth compared to computations, and DDR is faster than PCI express.

For quadratic equation problem in the demo, isn’t it faster to always use the most generic method with blending, and vectorize roots count calculation using bitwise tricks to convert comparison masks into lookup indices, then _mm_shuffle_epi8 to lookup from a constant vector?

Probably true. 4 tests = 16 combinations, just enough to lookup from 16 constant bytes in a single register, each being between 0 and 2. But this isn’t the point, I’ve made this for more complex cases than solving quadratic equations. Also _mm_shuffle_epi8 is SSSE3. AMD was manufacturing K10 CPUs without the support till 2012, some people still use such systems. At the time of this writing, steam hardware survey says 2.9% of users don't have SSSE3.

Const-me/SimdMask

Vectorizing complex conditional code

Q & A