A fast, accurate integer convolution

This is an entry for Code Golf challenge Compute convolution quickly and accurately.

The algorithm used is Number Theoretic Transform, which is a variation of FFT that does not suffer from floating-point inaccuracies. The modulo used is P = 71 * 2^57 + 1 which is slightly larger than 10^19, and I devised a specialized a * b % P function here, which gave ~4x speedup for the entire convolution compared to (a as u128 * b as u128 % P as u128) as u64.

The main entrypoint for the algorithm is solve_ntt.

Additionally, three variants of the same algorithm is implemented as solve_ntt_par, solve_ntt_par2, and solve_ntt_par3, which are some attempts at applying data-independent parallelism using rayon. (Trying to parallelize individual butterflying turns out to be harmful, apparently because it causes much more cache misses. Will it perform better if cache-line-sized chunks are used instead?)

Use cargo bench to run benchmarks locally.

I don't claim this to be the fastest code that supports up to 10^19 in the result of convolution. Even the core algorithm itself (Cooley-Tukey) can be substituted for different ones such as Stockham for significant improvement.

Bubbler-4/cg263748

A fast, accurate integer convolution