Hardware offload or general crypto optimizations for Cavium Octeon
mzpqnxow opened this issue · 5 comments
Hello there,
Some time back I began using Wireguard as both a server and client on a small fleet of Ubiquiti EdgeRouter devices, specifically the ones with the Cavium Octeon MIPS64 chip- which is practically all of them, save one or two models
I did a little bit of reading as I was curious about what options there were for acceleration, to either reduce CPU load, improve throughput, or (ideally) both
I read this (very old) UBNT forum post which you participated in. To jog your memory, here was your comment:
I haven't even begun optimizing for the EdgeRouter's architecture. I'll need to write MIPS64 primitives and maybe even figure out how to utilize the offloading chip. The EdgeRouter kernel does not have CONFIG_PADATA, which means we're stuck to one CPU per flow, instead of nicely parallelizing encryption across all CPUs. I'll be able to get that aspect sorted eventually though. Completely unoptimized on my ERL3, I get around 80 mb/s, which isn't bad for a first run. But it's nowhere near the performance it should be getting and eventually will be getting. This benchmark will only get faster, of course.
I noticed a UBNT developer/rep replied essentially offering to enable various features in their kernel configuration if it facilitated this work, which seemed encouraging
I'll get to the point now :)
- Has there been any work done since that time to substantially optimize for Octeon platforms?
- Are you seriously planning any such work?
- If you are planning to look further into this, are there blockers (e.g. UBNT not playing along with their kernel build config) or is it just the usual case of no time to allocate?
The way I understand things, there are three ways to go about optimizing:
- Hand-tweaking the the most expensive crypto implementations. I assume the most important for throughput are ChaCha20 & Poly1305. I'm also guessing, however, that your implementation is already a lean version of the reference implementation, compiled with the appropriate optimizations
- Utilizing the Cavium Octeon SDK compiler with any special optimization flags/features it may offer on top of vanilla gcc/clang. I don't know much about the Cavium Octeon SDK at all and I realize it's possible that vanilla gcc or clang already has this
- Taking advantage of hardware features, assuming any of them are relevant/useful for this application
Obviously, utilizing available hardware features is ideal, since it can reduce the load on the CPU. I'm not deeply familiar with what the Octeon offers, but I do know that there is a crypto co-processor, and that the UBNT devices offload NAT and "plain" packet routing (as well as IPSEC and VLANs) to hardware
Any thoughts/comments are appreciated. I'm happy to sponsor the effort, but the sum of money I would offer (~$500) is probably a bit of a joke compared to the cost of the time to do this. Regardless, it's offered. I'm also happy to go through all of the prior work and reference manuals to cut down on the "grunt" work ;)
I'm self-interested here because of the UBNT devices I'm responsible for, but I know Cavium has a presence in a lot of the larger enterprise devices (layer-7 firewalls, VPNs, etc.) so maybe there's some value there. Though in that case it would be nice for one of those vendors to sponsor the work...
Thanks!
Also- I realize that the work for this would probably be done under the wireguard-linux
project, but it seemed most appropriate to open the issue here first
There is already https://git.zx2c4.com/wireguard-linux/tree/arch/mips/crypto/poly1305-mips.pl
Feel free to additionally port https://git.zx2c4.com/wireguard-linux/tree/arch/mips/crypto/chacha-core.S to mips64.
Thank you! They told me MIPS asm would be a useless skill... :P
Well, increasingly arcane, but not quite useless yet. Odd MIPS knowledge floating around in my head has helped me out in all sorts of unforeseen ways over the years... It's also fun to write.