pq-crystals/kyber

NTT ideas

catid opened this issue · 4 comments

catid commented

I recently wrote some NTT code for my Leopard project and found a 50% speed boost using decimation-in-time version, where I take groups of 4 positions and operate on them (rather than 2 at a time). This allows us to effectively run two layers of the NTT/FFT/AFFT/etc in registers at a time rather than laborously writing them out and reading them back in again. It was hard to follow your AVX2 qhasm code but I think it doesn't do that so maybe a good win there?

e.g. https://github.com/catid/leopard/blob/master/LeopardFF16.cpp#L616

Another suggestion is to use ISPC instead of qhasm since it appears to support ARM NEON also, but I don't have any personal experience (yet) to back up that suggestion..

catid commented

Also FFT can run in parallel so maybe try OpenMP also? :)

e.g. https://github.com/catid/leopard/blob/master/LeopardFF16.cpp#L121

catid commented

Nice work!