ARM support

Question

ARM support

Opened this issue 3 years ago · 4 comments

Has there been any work done to enable arm neon support? In this repo or in any forks?

Answer 1 · 2021-12-30T12:32:38.000Z

Do you have any plans to do support for arm in the near future?

Answer 2 · 2022-11-18T23:33:11.000Z

I added ARM64 NEON support here: https://github.com/Pflugshaupt/muFFT
I used a different CMake setup, but the actual SIMD parts are working, it just would need proper internal CMake patches.
Unfortunately I found the result to be slower than other NEON FFT libraries (esp. pffft). My use case was 1D real to complex ffts on macOS with m1 cpus.
I guess the main reason for the slowness is the way the complex numbers are arranged in the registers. Two complex numbers per 128-bit register leads to a lot of permutations and shuffles that could be avoided by a 128-bit real/128-bit imag layout.

Answer 3 · 2022-11-19T09:50:56.000Z

I'm curious, did you also benchmark vdsp fft vs pffft (and mufft) on M1 mac?

Answer 4 · 2022-11-19T12:28:33.000Z

Yes, but I only benchmark inside my current project, so this is not general at all. I do heaps of 8'192 real to complex 1d ffts. For this pffft on Arm64 with neon is faster than vdsp and faster than my patch of muFFT.
As far as I know vdsp has a fft weakness on m1 Macs. Pffft with neon is much faster than vdsp for 2^10 - 2^16 real to complex ffts. It's possible vdsp has improved since I tested it on the initial m1 systems, but it definitely had a weakness when it comes to FFTs on ARM (using the old calls that allow for fft sizes > 2^12).
My patch of muFFT probably falls somewhere in the middle, I hoped it would be faster, but it wasn't. Maybe it could be optimized more (for instants using neon fma), but pffft doesn't use those either.
My guess is the big difference comes from how the complex numbers are arranged in memory/on the registers. muFFT uses a strict interleaved scheme, where 128-bit registers are used to hold 2 complex numbers with 32-bit real and imaginary parts. Pffft uses two 128-bit registers to hold 4 complex numbers, with all real parts in one reg and all imaginary parts in the other. The pffft scheme leads to fewer shuffle and permute operations - especially on complex multiplications where the muFFT routine does more shuffling than calculating.