Poor performance on prime multipliers
vitalsong opened this issue · 0 comments
Hi, I was wondering what method you use for FFT size not equal to a power of two (chirp z-transform or factorization). I picked up even lengths with prime factors and ran a test.
The results confused me:
Run on (12 X 2600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 2.70, 3.20, 3.63
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_FFT_KISS/1018 1387 us 1386 us 444
BM_FFT_KISS/2246 6771 us 6771 us 103
BM_FFT_KISS/11200 337 us 337 us 2090
BM_FFT_KISS/11202 62048 us 61918 us 12
BM_FFT_KISS/11182 171242 us 171165 us 4
BM_FFT_KISS/11146 169774 us 169752 us 4
BM_FFT_KISS/16384 201 us 201 us 3476
Factorization result:
factor(1018) = [2, 509]
factor(2246) = [2, 1123]
factor(11200) = [2, 2, 2, 2, 2, 2, 5, 5, 7]
factor(11202) = [2, 3, 1867]
factor(11182) = [2, 5591]
factor(11146) = [2, 5573]
factor(16384) = 2^14
The difference is 500-800 times between the worst and the best option. Most users don't think about implementation details and can get serious performance issues.
The fftw3 implementation does not have this problem:
BM_FFTW3_FLOAT/1018 11.7 us 11.7 us 59338
BM_FFTW3_FLOAT/2246 43.4 us 43.4 us 16293
BM_FFTW3_FLOAT/11200 24.2 us 24.2 us 29191
BM_FFTW3_FLOAT/11202 230 us 230 us 2918
BM_FFTW3_FLOAT/11182 257 us 257 us 2734
BM_FFTW3_FLOAT/11146 260 us 260 us 2727
BM_FFTW3_FLOAT/16384 36.7 us 36.7 us 19199
As I understand it, you use factorization and calculate the FFT for each factor, and then combine the results. It's probably worth adding a separate stage for large prime factors (e.g. czt ) to smooth out this problem.
In my implementation, I use the czt algorithm. This gives the worst result on good multipliers, but it has a fixed difficulty.
Here is an example of my benchmark:
BM_FFT_DSPLIB/1018 46.0 us 46.0 us 15287
BM_FFT_DSPLIB/2246 230 us 230 us 3031
BM_FFT_DSPLIB/11200 1100 us 1100 us 634
BM_FFT_DSPLIB/11202 1138 us 1138 us 636
BM_FFT_DSPLIB/11182 1132 us 1132 us 628
BM_FFT_DSPLIB/11146 1107 us 1107 us 618
BM_FFT_DSPLIB/16384 222 us 222 us 3236
PS: All bencmarks for complex type