mborgerding/kissfft

Poor performance on prime multipliers

vitalsong opened this issue · 0 comments

Hi, I was wondering what method you use for FFT size not equal to a power of two (chirp z-transform or factorization). I picked up even lengths with prime factors and ran a test.

The results confused me:

Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 2.70, 3.20, 3.63
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_FFT_KISS/1018          1387 us         1386 us          444
BM_FFT_KISS/2246          6771 us         6771 us          103
BM_FFT_KISS/11200          337 us          337 us         2090
BM_FFT_KISS/11202        62048 us        61918 us           12
BM_FFT_KISS/11182       171242 us       171165 us            4
BM_FFT_KISS/11146       169774 us       169752 us            4
BM_FFT_KISS/16384          201 us          201 us         3476

Factorization result:

factor(1018) = [2, 509]
factor(2246) = [2, 1123]
factor(11200) = [2, 2, 2, 2, 2, 2, 5, 5, 7]
factor(11202) = [2, 3, 1867]
factor(11182) = [2, 5591]
factor(11146) = [2, 5573]
factor(16384) = 2^14

The difference is 500-800 times between the worst and the best option. Most users don't think about implementation details and can get serious performance issues.

The fftw3 implementation does not have this problem:

BM_FFTW3_FLOAT/1018        11.7 us         11.7 us        59338
BM_FFTW3_FLOAT/2246        43.4 us         43.4 us        16293
BM_FFTW3_FLOAT/11200       24.2 us         24.2 us        29191
BM_FFTW3_FLOAT/11202        230 us          230 us         2918
BM_FFTW3_FLOAT/11182        257 us          257 us         2734
BM_FFTW3_FLOAT/11146        260 us          260 us         2727
BM_FFTW3_FLOAT/16384       36.7 us         36.7 us        19199

As I understand it, you use factorization and calculate the FFT for each factor, and then combine the results. It's probably worth adding a separate stage for large prime factors (e.g. czt ) to smooth out this problem.

In my implementation, I use the czt algorithm. This gives the worst result on good multipliers, but it has a fixed difficulty.
Here is an example of my benchmark:

BM_FFT_DSPLIB/1018         46.0 us         46.0 us        15287
BM_FFT_DSPLIB/2246          230 us          230 us         3031
BM_FFT_DSPLIB/11200        1100 us         1100 us          634
BM_FFT_DSPLIB/11202        1138 us         1138 us          636
BM_FFT_DSPLIB/11182        1132 us         1132 us          628
BM_FFT_DSPLIB/11146        1107 us         1107 us          618
BM_FFT_DSPLIB/16384         222 us          222 us         3236

PS: All bencmarks for complex type