hsivonen/simd

Would it make sense to enable ffast-math for simd types?

Opened this issue · 7 comments

As discussed here:

rust-lang/rust#21690

-ffast-math can be very useful to speedup floating point operations, particularly allowing easier vectorization. I'm seeing a ~30% runtime reduction for matrix multiplication in clang from doing -ffast-math in this benchmark:

https://github.com/pedrocr/rustc-math-bench

As mentioned in the rust issue the intrinsics already allow a part of this and a wrapper type for f32/f64 can already be implemented. Since SIMD types are already aimed at vectorization and the cost of wrapping/unwrapping is already there would it make sense to enable -ffast-math for them anyway? Alternatively if there are cases where that doesn't make sense would it be useful to duplicate slow and fast versions of all the types for convenience?

This won't be much of a response, but I personally don't know. I don't know anything about -ffast-math or what problems it's solving. I don't know what it means to "enable -ffast-math for them." Duplicating every single vector type seems rather extreme.

I think I'd need to see a lot more detail on this before I'd personally be comfortable doing anything.

I'm far from an expert on the topic but my understanding is that the normal IEEE754 precision guarantees don't allow doing certain arithmetic reorderings. This limits LLVM in generating more effective code even with normal floating point instructions and can severely limit it's ability to auto-vectorize. Enabling -ffast-math for the vector types would in essence mean using things like fadd_fast instead of the normal floating point add to allow the compiler the freedom to rearrange the math. I suspect this is the right tradeoff for most SIMD applications but maybe not all. Having that be a feature in the simd crate instead of a different type might be a cleaner option.

fadd_fast is a compiler intrinsic, which is basically permanently unstable. I'd rather not add a dependency on such things in this crate since there is no path to stability.

I feel like there is a much larger design space. For example, instead of duplicating all of the floating point vector types, we could just expose "fast" arithmetic operations as normal functions, kind of like how we have wrapping_add and saturating_add on the number types today.

Another design point is to add a FFast<T> type, kind of like our std::num::Wrapping<T> type for wrapping arithmetic.

Finally, I'm not exactly sure why -ffast-math belongs with SIMD. They seem like orthogonal concerns to me? For example, it seems like you'd want to be able to do fast math on normal f32/f64 types and not just vector types. Could you please elaborate on this point?

@pedrocr My intuition here is that someone will need to champion this and propose an addition to the standard library that gives you access to -ffast-math. That means writing an RFC and thoroughly exploring the design space.

I feel like there is a much larger design space. For example, instead of duplicating all of the floating point vector types, we could just expose "fast" arithmetic operations as normal functions, kind of like how we have wrapping_add and saturating_add on the number types today.

This would work but makes for really ugly code.

Another design point is to add a FFast<T> type, kind of like our std::num::Wrapping<T> type for wrapping arithmetic.

This would be a much better solution indeed from the code clarity standpoint.

Finally, I'm not exactly sure why -ffast-math belongs with SIMD. They seem like orthogonal concerns to me? For example, it seems like you'd want to be able to do fast math on normal f32/f64 types and not just vector types. Could you please elaborate on this point?

Doing it for normal types is indeed also useful. I see two reasons this connects with SIMD though. The first (and circunstancial) is that the vector API is already a wrapper around the underlying types so it's already naturally easier to implement these things than with f32 which is a primitive type. The wrapper solution works much poorly for primitive types because it introduces the wrapping/unwrapping steps whereast f32x4 usage already implies that anyway so code churn is minimal. The second is that SIMD auto-vectorization can work much better if -ffast-math is enabled and so it should be simple to enable it for something like f32x4 independently of if it's easy or not to use it for f32. Use of f32x4 implies the user is trying to go fast whereas f32 can be performance-insensitive.

My intuition here is that someone will need to champion this and propose an addition to the standard library that gives you access to -ffast-math. That means writing an RFC and thoroughly exploring the design space.

I've been following this issue:

rust-lang/rust#21690

I don't really understand the rust design process. Are you suggesting that the next step should be to take that discussion and try and do an RFC? I don't think I know enough about the rust conventions and this problem to write an RFC but I can try and start a pre-RFC discussion in internals to get the ball going.

The wrapper solution works much poorly for primitive types because it introduces the wrapping/unwrapping steps

Could you expand on this? let x = FFast(some_float) and x.0 should be zero cost.

It still seems to me like -ffast-math is orthogonal to SIMD, but that SIMD vector types might participate in it.

I don't really understand the rust design process. Are you suggesting that the next step should be to take that discussion and try and do an RFC? I don't think I know enough about the rust conventions and this problem to write an RFC but I can try and start a pre-RFC discussion in internals to get the ball going.

A pre-RFC would be good. I should have suggested that first. :-) I'd encourage you to give lots of examples.

Could you expand on this? let x = FFast(some_float) and x.0 should be zero cost.

It's zero cost in execution but quite costly in programming time creating quite ugly code. Here's what getting rid of OrderedFloat bought me in code simplification:

pedrocr/rawloader@c288697

A pre-RFC would be good. I should have suggested that first. :-) I'd encourage you to give lots of examples.

Yeah, I think I'll do that. I have 2 or 3 options of how -ffast-math could work in rust and having that discussion would be nice.