Squeeze extra performance out of mixed-radix algorithm
calebzulawski opened this issue · 3 comments
When profiling with perf
, a huge amount of time (40-60% of the entire transform) seems to be spent in the very first "narrow SIMD" pass, where the stride isn't large enough to fill an entire SIMD vector. Right now there is an optimization for radix-4 on AVX, but even with that the performance is underwhelming.
Worth pointing out: The "prime factor algorithm" in Fourier isn't mathematically accurate: https://en.wikipedia.org/wiki/Prime-factor_FFT_algorithm
The prime factor algorithm is the name for a specific algorithm where no twiddle factors are necessary, because the inputs are reordered based on some number theory formulas. (Worth looking into btw - it could be faster because you significantly reduce the number of floating point operations, or slower because you have to do re-indexing).
A more accurate name for the algorithm in fourier might be mixed radix or cooley-tukey.
Right, I knew I had seen that name somewhere. I was looking for a more generic name since the distinction between Cooley-Tukey and Stockham autosort generally doesn't matter, I definitely meant mixed-radix! Thanks
The PFA algorithm does interest me, though I wonder if the effort/complexity is worth it compared to the existing Bluestein's algorithm implementation.