A poly1305 implementation written in assembly. The assembly file has been written in such a way that it supports linux, macOS and windows. Currently, only a scalar implementation but it's already extremely fast. Not sure how much faster SSE or AVX2 would be unless one was processing multiple data streams.
When testing calculating the Poly1305 tag for 1 GB and 512 KB of random data I got the following on average:
Processor | 1 GB | 1 GB test GB/s | 512KB | 512KB test GB/s |
---|---|---|---|---|
Xeon E3-1230 v5 | 0.434929 s | 2.299 GB/s | 0.000238 s | 2.052 GB/s |
Ryzen 5 3600X | 0.302994 s | 3.300 GB/s | 0.000153 s | 3.191 GB/s |