RustCrypto/block-ciphers

AES implementation should be chosen at runtime rather than compile time

upsuper opened this issue · 36 comments

AES implementation should be chosen at runtime rather than compile time, otherwise it makes people very hard to ship products with this crate, because they cannot choose the environment a product would be run on.

Switch between implementations should be done as high-level as possible, not at the block cipher level. There is a nocapture feature for this use-case, but you will need to be sure to enable necessary target features for proper inlining.

I believe ideally we need a support for this at the language level. I've described my view here.

What I'd like in the short term:

  • On platforms that have no aesni impl, use the software fallback
  • On platforms that do (presently x86/x86_64) require the appropriate target features are enabled (i.e. +aes,+ssse3) and refuse to compile if they are not. Use runtime detection to determine if AES-NI is available, and if not, fall back to the software implementation

Why not [target_feature(enable = "aes")]

Because it silently falls back to only using the software implementation of AES. In fact, if you compiled with RUSTFLAGS=-Ctarget-feature=+aes today, that is exactly what will happen, because using aesni requires +aes,+ssse3, and +aes alone will silently fall back to aes-soft as the one and only option. It's the kind of "gotcha" users of this crate shouldn't have to worry about, and I think the best way to surface that at present is through a compile error.

I have plastered Miscreant with instructions to use RUSTFLAGS or ~/.cargo/config to enable the appropriate target features, but even then they changed, and I found myself having trouble configuring them to enable aesni and reconfiguring all of the places which were using just +aes instead of +aes,+ssse3. In absence of that, I am pretty sure nearly everyone using the aes crate is using aes-soft instead of aesni.

It would be nice if the ergonomics of target features were better, but failing that, if the aes crate silently falls back to aes-soft with no runtime AES-NI detection unless the target features are correctly configured, then nobody is going to bother to configure their builds correctly, and everything will use the software implementation. I would consider that a bad outcome, as runtime AES-NI detection is the norm for crypto libraries outside the Rust ecosystem.

kazcw commented

On platforms that do (presently x86/x86_64) require the appropriate target features are enabled (i.e. +aes,+ssse3) and refuse to compile if they are not. Use runtime detection to determine if AES-NI is available, and if not, fall back to the software implementation

I don't think that's sound. Once you enable +ssse3 for the whole crate, the compiler could choose to use ssse3 anywhere.

For groestl-aesni, I used a combination of:

  • #[inline(always)] all the implementations
  • sets of wrappers for different #[target_feature(enable = ...)]
  • lazy_static selection between wrappers

It works fine; the version with aes detected at runtime runs just as fast as compiling with --target-feature=+aes, and much faster than the soft-aes path.
kazcw/groestl-aesni@d9fd06f

That's better than what this crate presently does, however unless I'm mistaken it still appears to require the crate's consumer to enable the +aes target feature, and if they don't do that, it will omit AES-NI support. Is that the case?

I think all builds for all x86(_64) targets should always attempt to detect and use AES-NI, and if the required target features aren't enabled, the crate should refuse to compile. That could potentially have a way to opt out for users who really don't want AES-NI, but the default should be to attempt to use it, IMO.

Barring that, I think no one will enable the target features and nothing will even attempt to use AES-NI.

kazcw commented

That's better than what this crate presently does, however unless I'm mistaken it still appears to require the crate's consumer to enable the +aes target feature, and if they don't do that, it will omit AES-NI support. Is that the case?

No, that's what #[target_feature(enable...)] does. If it doesn't work it's because the functions with the intrinsics didn't get inlined into the target_feature fn; to be sure it's effective you have to use #[inline(always)] all the way down the call stack.

kazcw commented

(Btw, the #[cfg(target_feature...)] stuff in groestl is kind of a red herring: I did that so if you do set --target-cpu and --target-feature for a custom build, it'll skip building impls inferior to whatever is configured at compile time; it's just a compile speed optimization)

Interesting. I’ll have to look into it more.

kazcw commented

Gah. What I've been doing doesn't actually work. I was fooled by rust-lang/stdarch#323, and the functions I thought were LLVM's fallbacks are actually just Rust's pointless wrappers. The bug is here: cryptocorrosion/cryptocorrosion#7

@kazcw oof, sorry to hear, almost thought you had this one licked!

Any plans for how to address it?

kazcw commented

@tarcieri Yeah, it won't be too bad. The upshot is this is forcing me to put everything through traits, which was on the roadmap anyway (cryptocorrosion/cryptocorrosion#6) but I'd been putting it off because ad-hoc polymorphism by switching out imports took zero work. But it's not turning out to be as hard as I thought, and the new API is cleaner (the user parameterizes code that will be compiled with different features sets by a Machine trait, whose impls are ZSTs that define a set of types indicating what kind of registers to use, what cpu flags are enabled, etc; possessing a particular Machine instance is used as a marker that its associated types can safely be created). This also removes the barrier to AVX2, or rather what I needed to do to get AVX2 working turns out to also be what I need to make everything else work anyway.

I already have ports of JH and BLAKE working on the new API. I can probably do ChaCha later today, and then I'll release the new versions, with packed_simd disabled for now. Putting packed_simd through the traits will be about as much work as implementing them on coresimd has been, and packed_simd is less crucial now that ppv-lite86 can do AVX2.

We are hitting this error on Debian when trying to package these RustCrypto AES crates as Debian's x64 baseline does not include AESNI.

All you need to do is use this: https://doc.rust-lang.org/std/macro.is_x86_feature_detected.html

@infinity0 we are aware of that macro and have made past attempts to use it. Offhand I don’t remember what the objections were.

I saw the above old discussion above about stdarch, and I believe today's stdarch uses these newer macros I mentioned. Hopefully the objections can be overcome straightforwardly?

See this PR: RustCrypto/universal-hashes#11

To summarize: it appears doing is_x86_feature_detected! in the hot path (which ultimately calls CPUID) has some potential performance problems like pipeline stalls(?).

I think it’d be good to confirm the performance impact empirically.

@infinity0 can you spell out Debian's requirements a little more specifically? Why can't you omit the RUSTFLAGS to enable AES-NI?

In order to support as many machines as possible, the Debian "amd64" architecture guarantees support even for x64 CPUs without AES-NI. That means we can't use these RUSTFLAGS. Debian also doesn't want to build everything twice just to support CPUs with AES-NI.

And I assume this is the same for pretty much any other FOSS OS distro and not just Debian.

That means we can't use these RUSTFLAGS.

Yes, that’s what I’m saying: to even use AES-NI you have to configure RUSTFLAGS to do so, so if you want to avoid AES-NI and fall back on the bitsliced implementation, simply don’t configure them.

simply don’t configure them.

Yes, that's what I'm already doing. However this means no Debian users will be able to use AESNI, even if their CPU supports it. It would be better to detect it at runtime and enable it on that condition.

Ok, yes I agree, but the tentative plan is to do that sort of runtime selection in higher-level crates like the AEAD crates (e.g. aes-gcm, chacha20poly1305)

Well, "how high do you want to go", why not an even higher-level application crate? If performance is a concern can't you use lazy_static or something to cache the detection result?

We need to benchmark the performance of various strategies

Fair enough, although I would expect/hope that the macro itself is already pretty well-optimised so that in practise all-except-the-first call to it would be as cheap as checking a boolean.

My understanding is it invokes the CPUID instruction every time, which is cheap but may have negative impacts on pipelining.

CPUID instruction itself certainly is not cheap (measured throughput can be as high as 1500 cycles), but is_x86_feature_detected caches result into an atomic variable, so it's relatively cheap after a first call, but I am still hesitant to put it into a tight loop. Also it probably will not play well with the optimized AES-CTR code, which was specifically written to keep all key data in XMM registers when possible (runtime detection will mean that CPU may have to unload key data into general purpose registers). Another (minor) problem is that is_x86_feature_detected is not available on no_std, we could implement our own detection like it was done in getrandom. It's certainly not ideal, but should work fine.

A much more serious problem is incompatible layouts between the bitsliced and AES-NI implementations. Sure, we could use an enum, but aesni::Aes256 takes only 480 bytes, while aes_soft::Aes256 takes as much as 4320 (for Aes128 it's 352 and 3168 bytes respectively).

I guess we should benchmark runtime detection on the level of the aes crate and if the performance impact is minor enough (say less than 1%), go with such solution for user's convenience sake. But I am not sure what to do about the memory layout issue.

Another option to consider is gating runtime detection behind a cargo feature (which itself would depend on std). I don't think it would be too hard to add to the aes crate in its current form, since it's otherwise just a facade.

A much more serious problem is incompatible layouts between the bitsliced and AES-NI implementations.

Yes, that is unfortunate, but I'm not sure what else there is to be done (other than a more efficient bitsliced implementation).

Another option would be to detect at runtime once, then set some function pointers and call through those. Branch predictors are very good at noticing that every call through a function pointer goes to the same place.

As I wrote in my previous message, branching is the least of our problems here (and the pointer chasing probably will be worse than a simple branch on an enum tag). The main problem is that by default we keep cipher state on the stack and states in aes-soft are nearly order of magnitude bigger than states in aesni. If we kept cipher state on the heap, it wouldn't be a problem, but it would go against the goal of supporting bare no_std by default.

I guess we could optionally store aes-soft state on the heap, but ideally it would be better to use a more efficient implementation instead.

I think we could feature gate autodetection. Yes, it would increase AES-NI stack usage considerably, but in some cases being able to ship a relocatable binary which supports both is more important than that

@newpavlov So, the blocker is having a software implementation to fall back to that has a reasonably sized state? (As well as ensuring that aesni can maintain the desirable property of keeping all its state in registers?)

@tarcieri Yes, eating some extra stack or heap space (especially if it doesn't need initializing when not used) would be a small price to pay to be able to ship binaries that support systems without hardware acceleration but that can use hardware acceleration if available.

Also, I'd happily have a heap-based state for the software implementation if that means the hardware-accelerated implementation has less overhead and doesn't have a pile of unused memory.

If we kept cipher state on the heap, it wouldn't be a problem, but it would go against the goal of supporting bare no_std by default.

Why not let the user chose? If it's on "the stack", then the user can simply move it to the heap by boxing it, right? This could be documented somewhere, and have the default fn new() could return a Box<Aes> instead of a straight Aes (and a raw_new be provided to return a straight Aes, available in no_std). Embedded users will likely want to store the key state in static storage anyways, to keep it from taking up precious stack space.

Alternatively, maybe a mechanism like ManagedSlice can be investigated, where the backend storage of the object is selected by the user (either using a reference, or an owned object if alloc feature is enabled).

I think there's something even better we can do to get the best of both worlds.

First, add an autodetect Cargo feature which depends on aesni, aes-soft, and alloc.

Next, define enums for the various key sizes which look like this:

// Hypothetically we could wrap these enums in an opaque struct too
pub enum Aes128 {
    AesNi(aesni::Aes128),
    AesSoft(Box<aes_soft::Aes128>)
}

In other words, only box aes_soft, but let aesni remain on the stack. This would avoid heap allocations when aesni is available, and also avoid using massive amounts of stack space when it's not and aes_soft is used instead.

Now that we've implemented AES fixslicing (see #176, #177, and #185 among others) the memory consumed by the aes::soft backend has changed significantly:

Key size AES-NI size 32-bit fixsliced size 64-bit fixsliced size
AES-128 352 bytes 352 bytes 704 bytes
AES-192 416 bytes 416 bytes 832 bytes
AES-256 480 bytes 480 bytes 960 bytes

Or in other words, AES-NI and 32-bit fixslicing use the same amount of space, and 64-bit fixslicing needs twice as much as the other two. This is because AES-NI needs separate sets of encryption and decryption round keys, 32-bit fixslicing can share round keys but operates over 2 blocks in parallel, and 64-bit fixslicing can also share them but operates over 4 blocks in parallel.

With the size difference eliminated/reduced, it seems like my suggestion above of using Box to reduce stack space is no longer necessary, or at least as much of a consideration.

I've opened a PR which implements runtime autodetection for AES-NI: #208