LoganDark/stackblur-iter

Implement fast division

LoganDark opened this issue · 1 comments

So there's this thing called libdivide that I don't fully understand, but basically it makes it possible to perform integer divides without actually using division instructions, which is a lot faster.

The fastdivide crate has a really good explanation:

Division is a very costly operation for your CPU. You may have noticed that when the divisor is known at compile time, your compiler transforms the operations into a cryptic combination of a multiplication and bitshift.

The key idea is that, rather than computing

N / D

It is faster to compute (with k sufficiently large)

N * ( 2^k / D ) / (2^k)

If D is known in advance, (2^k / D) can be precomputed by the compiler.

Unfortunately if the divisor is unknown at compile time, the compiler cannot use this trick.

The point of fastdivide is to apply the same trick by letting you precompute a DivideU64 object.

Dividing takes up a lot of the CPU time spent blurring. If divisions can be eliminated, the blur could become a lot faster.

To test how much performance can be had here, I did a test of the current stackblur-iter (well actually the rgb branch), and then a version with division completely stripped out (i.e. it is a no-op). Here are my results:

With division:

took 35022μs to generate & 11421μs to upload buffer
took 34211μs to generate & 6235μs to upload buffer
took 29929μs to generate & 5187μs to upload buffer
took 32854μs to generate & 10104μs to upload buffer
took 29785μs to generate & 5284μs to upload buffer
took 32468μs to generate & 21958μs to upload buffer
took 31371μs to generate & 7881μs to upload buffer
took 31785μs to generate & 9126μs to upload buffer
took 30348μs to generate & 9654μs to upload buffer
took 30948μs to generate & 6289μs to upload buffer
took 30430μs to generate & 5373μs to upload buffer
took 34881μs to generate & 14143μs to upload buffer
took 33193μs to generate & 3612μs to upload buffer
took 29994μs to generate & 9046μs to upload buffer
took 30341μs to generate & 10188μs to upload buffer
took 30294μs to generate & 4809μs to upload buffer
took 30880μs to generate & 8723μs to upload buffer
took 30332μs to generate & 8853μs to upload buffer
took 33023μs to generate & 5494μs to upload buffer
took 31356μs to generate & 9066μs to upload buffer

Looks like it usually takes around 31ms to blur the buffer on my machine. (ignore upload, that is for displaying it in the window)

Without division:

took 27420μs to generate & 5178μs to upload buffer
took 27724μs to generate & 9695μs to upload buffer
took 29814μs to generate & 11635μs to upload buffer
took 26112μs to generate & 10097μs to upload buffer
took 23530μs to generate & 4504μs to upload buffer
took 24017μs to generate & 8413μs to upload buffer
took 27231μs to generate & 6686μs to upload buffer
took 24829μs to generate & 6802μs to upload buffer
took 25670μs to generate & 4691μs to upload buffer
took 27677μs to generate & 3024μs to upload buffer
took 23661μs to generate & 7811μs to upload buffer
took 23697μs to generate & 5162μs to upload buffer
took 24433μs to generate & 5137μs to upload buffer
took 23965μs to generate & 4851μs to upload buffer
took 23506μs to generate & 6776μs to upload buffer

Looks much different - usually around 24-25ms bt sometimes up to 29ms. Generally looks like at least a 15% improvement in performance.

(this is with Rayon+SIMD)