rust-lang/rust

Imprecise floating point operations (fast-math)

mpdn opened this issue Β· 75 comments

mpdn commented

There should be a way to use imprecise floating point operations like GCC's and Clang's -ffast-math. The simplest way to do this would be to do like GCC and Clang and implement a command line flag, but I think a better way to do this would be to create a f32fast and f64fast type that would then call the fast LLVM math functions. This way you can easily mix fast and "slow" floating point operations.

I think this could be implemented as a library if LLVM assembly could be used in the asm macro.

Inline IR was discussed on #15180. Another option is extern "llvm-intrinsic" { ... } which I vaguely think we had at some point. If we added more intrinsics to std::intrinsics would that be sufficient?

mpdn commented

Yeah, adding it as a function in std::intrinsics could definitely work as well.

There are a few different fast math flags, but the fast flag is probably the most important as it implies all the other flags. Adding all of them would be unreasonable if using intrinsic functions, but I don't think all of them are necessary.

bluss commented

This forum thread has examples of loops that llvm can vectorize well for integers, but doesn't for floats (a dot product).

I've prototyped it using a newtype: https://gitlab.com/kornelski/ffast-math (https://play.rust-lang.org/?gist=d516771d1d002f740cc9bf6eb5cacdf0&version=nightly&backtrace=0)

It works in simple cases, but the newtype solution is insufficient:

  • it doesn't work with floating-point literals. That's a huge pain when converting programs to this newtype.
  • it doesn't work with the as operator, and a trait to make that possible has been rejected before.
  • the wrapper type and extra level of indirection affects inlining of code using it. I've found some large functions where the newtype was slower than regular float, but not because of float math, but because other structs and calls around it weren't as optimized. I wasn't able to reproduce it in simple cases, so I'm not sure what exactly is going on.

So I'm very keen on seeing it supported natively in Rust.

bluss commented

@pornel The issue #24963 had a test case where a newtype impacted vectorization. That example was fixed (great!), sounds like the bug is probably still visible in similar code.

I've tried -ffast-math in my C vs Rust benchmark of some graphics code:

https://github.com/pedrocr/rustc-math-bench

In the C code it's a ~20% improvement in clang but no benefit with GCC. In both cases it returns a wrong result and the math is extremely simple (multiplying a vector by a matrix). According to this:

https://stackoverflow.com/questions/38978951/can-ffast-math-be-safely-used-on-a-typical-project#38981307

-ffast-math is generally too unsafe for normal usage as it implies some strange things (e.g., NaN checks always return false). So it seems sensible to have a way to opt-in only to the more benign ones.

@pedrocr Your benchmark has a loss of precision in sum regardless of fast-math mode. Both slow and fast give wrong result compared to summation using double sum.

With double for the sum and you'll get correct result, even with -ffast-math.

You get significantly different sum with float sum, because fast-math gives you a small systemic rounding error, which accumulates over 100 million additions.

All values from matrix multiplication are the same to at least 6 digits (I've diffed printf("%f", out[i]) of all values and they're all the same).

@pornel thanks, fixed here:

pedrocr/rustc-math-bench@8169fa3

The benchmark results are fine though, the sum is only used as a checksum. Here are the averages of three runs in ms/megapixel:

Compiler -O3 -march=native -O3 -march=native -ffast-math
clang 3.8.0-2ubuntu4 6,91 5,40 (-22%)
gcc 5.4.0-6ubuntu1~16.04.4 5,71 5,85 (+2%)

So as I mentioned before clang/llvm gets a good benefit from ffast-math but not gcc. I'd say making sure things like is_normal() still work is very important but at least on llvm it helps to be able to enable ffast-math.

I've suggested it would make sense to expose -ffast-math using the target-feature mechanisms:

https://internals.rust-lang.org/t/pre-rfc-stabilization-of-target-feature/5176/23

Rust has fast math intrinsics, so the fast math behavior could be limited to a specific type or selected functions, without forcing the whole program into it.

A usable solution for my use cases would probably be to have the vector types in the simd crate be the types that allow the opt-in to ffast-math. That way there's only one type I need to conciously convert the code to for speedups. But for the general solution of in normal code having to swap types seems cumbersome. But maybe just doing return val as f32 when val is an f32fast type isn't that bad.

Created a pre-RFC discussion on internals to try and get a discussion on the best way to do this:

https://internals.rust-lang.org/t/pre-rfc-whats-the-best-way-to-implement-ffast-math/5740

Is there a current recommended approach to using fast-math optimizations in rust nightly?

If it helps, a good benchmark comparison article between C++ and Rust floating point optimizations (link) inside loops was written recently (Oct 19), with a good Hacker News discussion exploring this concept.

Personally, I think the key is that without specifying any (EDIT: floating-point specific) flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

(EDIT: It seems that -fvectorize -Ofast was specified for clang to get gcc-comparable results, see proceeding comment)

An important key for any discussion on opmitized float math should keep this in mind: vectorization isn't always less precise - a commenter pointed out that a vectorized floating point sum is actually more accurate than the un-vectorized version. Also see Stack Overflow: https://stackoverflow.com/a/7455442

I'm curious what criteria for vectorization clang (or gcc) uses for figuring out floating point optimization. I'm not enough of an expert in these areas to know specifics though. I'm also not sure what precision guarantees Rust makes for floating point math.

Personally, I think the key is that without specifying any flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

That's not the case in the article. The clang compilation was using -Ofast which apparently enables -ffast-math.

As mentioned above, enabling fast math globally on a build is way to unsafe. Some things like -ffp-contract=fast being on by default make sense. But others are necessarily situational. Manually calling intrinsics on every single call is tedious.

Perhaps it would be better to allow annotating a function. IE:

#[math(overflow="wrap", assumptions=("algebraic", "no-nan", "finite"))]
fn update_velocity(b1: &mut Body, b2: &mut Body, diff: f32, mag: f32) {
    b1.vel = b1.vel - diff * (b2.mass * mag);
    b2.vel = b2.vel + diff * (b1.mass * mag);
}

That would provide reasonable safety while being more convenient, doubly so if it could be applied to entire modules.

Enabling fast match by default is going to break a lot of floating-point code.

For example, chances are that somewhere in your floating-point program, something is computing the sum of an array of floating-point values. If that code is doing it right, it is probably going to be using a crate like accurate to compute the sum efficiently with a small error.

One of the algorithms that accurate implements is Kahan summation, which is roughly:

S = X[0]
C = 0
for i in [1..N]:
  Y = X[i] - C
  T = S + Y
  C = (T - S) - Y
  S = T

With -ffast-math, a compiler can replace T in C = (T - S) - Y with S + Y, which results in C = ((S + Y) - S) - Y and optimize that to C = 0. Since C is never modified, and adding zero does nothing with -ffast-math, Kahan algorithm can be further optimized to:

S = X[0]
for i in [1..N]:
  S += X[i]

which defeats the point and produces quite inaccurate results.

This optimization is not theoretical, clang performs it when -ffast-math is enabled, e.g., see: https://gcc.godbolt.org/z/8NLIdB , where a more accurate Kahan summation gets optimized to an inaccurate sum.

Allowing users to enable -ffast-math globally is only going to destroy all properly-written floating-point code in libstd and elsewhere, which Rust users building applications with more than 200 crates are probably using without knowing.

Even allowing this at the function scope seems like a footgun, e.g., imagine a user writes:

#[math(overflow="wrap", assumptions=("algebraic", "no-nan", "finite"))]
fn foo(b1: &[f32]) -> f32 {
    accurate::kahan_sum(b1)
}

If accurate gets inlined into foo, the same algorithm-destroying optimization shown above would apply. To avoid that, we would need to prevent functions with "incompatible" #[math(...)] annotations from being inlined into each other (which might be worse for perf than the "fast-math" wins) - in this case, since accurate::kahan_sum wouldn't have a #[math(...)] annotation, it would not be inlineable into foo. This means, however, that the example used by @tkaitchuck above might not work, because the <f32 as Add<f32>::add method does not have a #[math] annotation either (and wouldn't have it with any proposal AFAICT).

If accurate gets inlined into foo, the same algorithm-destroying optimization shown above would apply.

I'd consider that a compiler bug.

The whole dilemma seems to be coming from tying fp optimizations to scopes, rather than types. If there was f32 and fastf32, then inlining wouldn't be semantics-breaking.

@gnzlbg

To avoid that, we would need to prevent functions with "incompatible" #[math(...)] annotations from being inlined into each other (which might be worse for perf than the "fast-math" wins)

This is not at all necessary in LLVM IR, where fast-math flags apply to individual instructions, not whole functions. MIR could adopt the same basic idea to also enable MIR inlining between functions with different fast-math settings.

This means, however, that the example used by @tkaitchuck above might not work, because the <f32 as Add<f32>::add method does not have a #[math] annotation either

The general question for how to propagate fast-math settings into other functions in a controlled and predictable fashion is a very tricky language design challenge, but this specific example is not affected. For primitive types, expressions like x + y (in contrast to x.add(y)) already get special-cased and lowered directly to built-in MIR operations, so no function call is involved anyway.

@rkruppe good points about inlining in MIR and about x + y being special cased.

So IIUC, inlining would still need to be very careful of, after inlining, not applying other optimizations that could change the fast-math settings of the inlined code. In a "fast-math" function, x + y would get fast-math flags, but would an x + y that gets inlined from a x.add(y) also get them?

The whole dilemma seems to be coming from tying fp optimizations to scopes, rather than types. If there was f32 and fastf32, then inlining wouldn't be semantics-breaking.

@kornelski What fastf32 achieves is tying the fp-arithmetic constraints to the operations on the memory it wraps. We already have some core::intrinsics for fast-math. I don't know if it is a good idea, but I would be ok with exposing a perma-unstable set of intrinsics that can express all these operations, since that should be enough to write types like fastf32 and other approaches like NonNan<NoSignedZero<Associative<T>>> in Rust libraries. Maybe something like:

mod core::intrinsics { // or somewhere else
    // bitflags for fast-math:
    const NonNan: u32 = 0b1;
    const NoSignedZero: u32 = 0b10;
    const Associative: u32 = 0b100;
    ...

    // fp arithmetic intrinsics taking a const bitset of fast-math flags
    fn fp_add<T>(T, T, const fast_math_flags: u32) -> T;
    fn fp_sub<T>(T, T, const fast_math_flags: u32) -> T;  
   ...
   fn fp_sqrt<T>(T, T, const fast_math_flags: u32) -> T;
   ...
}

Alternatively, maybe we can just extend all the current floating-point core intrinsics with a bitset. With something like default function arguments, we would just set that bitset to 0 by default, preserving current behavior.

So IIUC, inlining would still need to be very careful of, after inlining, not applying other optimizations that could change the fast-math settings of the inlined code. In a "fast-math" function, x + y would get fast-math flags, but would an x + y that gets inlined from a x.add(y) also get them?

No, if fast-math flags are per-instruction, then inlining automatically does the correct thing: just copy each instructions around without changing the fast-math flags on it. So if without inlining x.add(y) calls a function that does a regular fadd, then you'll get the same after inlining.

On the other hand, optimizations that specifically rewrite floating point operations need to take care to correctly interpret and update flags of all involved instructions (e.g. reassociating fadd x, (fadd assoc y, z) is incorrect).

Ixrec commented

I would be ok with exposing a perma-unstable set of intrinsics that can express all these operations, since that should be enough to write types like fastf32 and other approaches like NonNan<NoSignedZero<Associative>> in Rust libraries

Was "perma-unstable" meant to be something else? Unless we think all libraries interested in fast-math are okay with being perma-unstable libraries.

@lxrec a perma-unstable feature is enough to allow those interested in developing a stable-rust solution to prototype one in nightly, since as mentioned, it allows people to implement types like fastf32, type wrappers like NonNan<T>, and even proc macros like #[fast_math(...)].

That would allow those interested in "fast-math" to explore the design space without having to hack on the compiler, and to submit RFCs that can be tried, since library APIs using unstable features can be exposed from libcore to stable Rust users, without having to stabilize the "perma-unstable" features themselves.

Unless we think all libraries interested in fast-math are okay with being perma-unstable libraries.

I have yet to see anybody expressing desire for this goal.

@gnzlbg Intrinsics for prototyping are OK, but they're not enough for practical use: #21690 (comment)

You mention three issues in that comment:

  • lack of literal support: this is an orthogonal problem that should be solved by "user-defined literals".
  • lack of as support: From, TryFrom, round, transmute, and similar APIs are usually (always?) clearer than as, particularly when combined with type ascription,
  • bad performance: sounds like a compiler bug, did you report it ? Looking at your code, your ffast_math::fff type is repr(C) but should be repr(transparent) - notice that repr(C) inhibits the Scalar and ScalarPair repr(Rust) optimizations, forcing the ffast_math::fff type to use the Aggregate ABI class.

So from the issues you mention, the one I think has most weight is the lack of "user-defined literals". I don't think one needs to solve this problem to ship a fast-math feature. Today, we would need to write let x = NonNan(1.0);. IMO that is not that bad, definitely not bad enough to block an otherwise useful feature, in particular if we consider that with a "user-defined literals" feature we might need to write let x = 1.0_NonNan_f64; or similar (e.g. let x = 1_r64; for Real<f64>).

For game engines and similar real-time apps where floating point-accuracy is not that big of an issue (but speed is), this is quite important. The experience and steps to add this today in rust programs is not at all great.

What would be the best practical next steps to get some traction on this issue?
What needs to be addressed and how can we help with this?

Potential questions that might need answers:
Does allowing -ffast-math have safety repercursions in safe rust? I can't imagine how it would violate safety invariants, but an example in safe rust would be great.

If it doesn't have any security/safety implications, my proposal would be for this be a compiler option like -Oz, etc. (exactly as clang has it?). Disabled by default, of course.

Does this need a RFC? The pre-rfc discussion from 2017 is now locked and in an inconclusive state.

This particular feature is a real need within the game development community for obvious performance reasons. Usually we'd want this for hot-spot optimizations, however, it's also common to blanket enable -ffast-math for the entire (C++) game codebase.

I'd like to propose two things:

  • A per function syntax where we can opt-in to fast-math style optimizations along the lines of #[fast-math(no-trapping-math)] or #[fast-math(all)] so we can use it specifically on our math-heavy cpu hot-spots.
  • A per-crate override so we can enable-disable fast-math optimizations through the already existing profile overrides mechanism.

For example to opt out of fast-math everywhere one would simply:

[profile.dev.package.'*']
fast-math: none

This would allow both the more scientific community to disable fast-math where they want (say if a crate they use enables it), and allow us to opt-in where needed (and it is needed).

@Jasper-Bekkers having it per-function (or per-crate) makes inlining difficult. intrinsics or types are more predictable: #21690 (comment)

@kornelski LLVM doesn't allow for it?

@kornelski LLVM doesn't allow for it?

It looks like LLVM is at least fine with all of this - it allows specificity fast-math on a per-instruction level: http://blog.llvm.org/2019/03/llvm-numerics-blog.html & https://www.duskborn.com/posts/fpscev/ so that argument sounds like a bit of a red-herring.

The per-function/per-crate problem is with MIR and just ambiguity of what that means when the code is moved between scopes and crates.

OTOH the more precise solution of having fast math functions or fast math types use the LLVM per-instruction mechanism you've pointed out.

The per-function/per-crate problem is with MIR and just ambiguity of what that means when the code is moved between scopes and crates.

MIR is under the control of the rust team so that should be fixable, and there are other situations where similar infrastructure is desirable as well (think per-function overrides of opt-level for debugging optimized builds which is also pretty common in the games industry).

OTOH the more precise solution of having fast math functions or fast math types use the LLVM per-instruction mechanism you've pointed out.

On the other hand, it's vastly un-ergonomic and impossible to override without running into similar problems.

IR representations for fine-grained opt-in are mostly a solved or solvable problem, the more serious issue is one of language and library design: per-function is often too fine-grained, some mechanism for selectively propagating FMFs across function boundaries is needed. For example, f32::sqrt is just ordinary library code (which wraps an unstable intrinsic that is the actual compiler magic), it won't by itself opt into any FMFs, but we still need code that opts into FMFs to obtain a version of f32::sqrt that also has the same FMFs applied. If we just consider f32::sqrt to be "a function that does not opt into any FMFs" then fine-grained per-instruction FMFs will lead to the sqrt operation not having any FMFs even after inlining. So something different needs to happen to propagate FMFs from callers to callees, but only some callees, otherwise we're back to the problems that motivate fine-grained FMFs in the first place (accidental contamination of adjacent code that doesn't opt into FMFs).

How can we achieve this? C and C++ do not really have a solution for this, other than by special-casing a few standard library functions in the compiler front-end. I think the Rust project has more aversion to such special-casing and more reliance on user-defined library code and abstractions, so the incentives are probably different here. New types offer a solution that leans on an existing mechanism (generics + monomorphization) but this comes with downsides (e.g., type incompatibilities, potential code size explosion from pointless monomorphizations). It's a tricky problem IMO.

@hanna-kruppe That's why I proposed the two pronged approach above, both per function and per crate. Typically in a game there are a few sub-systems that, at their core, rely heavily on math operations. Those functions I'd like to be able to decorate with #[fast-math(all)], think for example decorating the constraint solver of a physics engine with this, while leaving it un-affected in the larger part of the system or the rest of the engine.

#[fast-math(all)]
fn solve_my_physics_constraints() {
    ...
}

Solving this in the type system imho isn't the right approach; like you've pointed out it's extremely hairy, and it's a huge annoyance when going in and trying things out speculatively - which - like it or not is a part of the optimization workflow that's very important.

Also, having a few "fast math" intrinsics similar to what's in nightly at the moment isn't enough - by far.

I don't understand how either "prong" helps with the problem I described. f32::sqrt is neither in any module of your game engine, nor in any crate you control. If you write

let x = something.sqrt();

... then a localized #[fast_math(all)] on the surrounding function, module, or crate will not result in LLVM IR that contains a call fast @llvm.sqrt.f32(...) instruction, which is the IR you'd want to match sqrt(something) in C or C++ under -ffast-math. You'll just get the same old call @llvm.sqrt.f32(...) (no FMF) from the library function in src/libstd/f32.rs that has no opt-in applied. Something beyond attributes applied to limited lexical scopes is needed to avoid this, not necessarily something involving the type system, but certainly something beyond scoped attributes.

Something beyond attributes applied to limited lexical scopes is needed to avoid this, not necessarily something involving the type system, but certainly something beyond scoped attributes.

Wouldn't it work to have scoped attributes both for enabling fast-math stuff and also for allowing it to pass-through if called from such context? That way the sqrt example would work something like:

#[fast-math(all)]
fn need_all_the_speed(val: f32) -> f32 {
    val.sqrt()
}

(...) somewhere in the stdlib

#[fast-math-passthrough-allow(all)]
fn sqrt(val: f32) -> f32 { ... }

When sqrt() gets called from a fast-math context it uses that and if not behaves normally. No idea if this is feasible in the compiler as it requires compiling the stdlib sqrt() in how many ways there are contexts calling it with different fast-math contexts.

This together with other proposals of specifying subsets of fast-math would allow specifying what is and isn't valid for each function decided by whoever wrote the code. Instead of doing what C/C++ does and allowing who is building the code to overwrite that and generate broken code by fiddling with compiler flags. In all the discussions I've seen about this it was fairly clear that should not be done. Overriding fast-math on a per-project or per-crate level is a footgun.

Right, an attribute opting into "inheriting" FMFs from callers is a possible direction, but it's not clear how to best integrate it with Rust's (or most other language's, for that matter) compilation model. While f32::sqrt is a trivial wrapper function that should pretty much always be inlined anyway, for other functions that would benefit from pass-through, simply duplicating all code for every distinct set of FMFs used is not the right choice.

And even if it was, implementing it in the compiler requires the same machinery as monomorphization for type and constant parameters. Bolting yet another kind of parameter on this code seems potentially quite ugly, especially because it would not apply in other parts of the compiler that currently share concepts and code (the type system and everything that interacts with generic types). Leaning on the type system has its own drawbacks, but it neatly side-steps these concerns.

But why weird attribute pass-throughs when you can have f32::sqrt() that is always precise, and fastf32::sqrt() that is always fast?

Types match what LLVM is doing, so having non-types layer of attributes on top of that that rust will have to translate back to types is weird. It's like #[u32_is_signed_here] attribute for a function that should be using i32.

Instead of putting an attribute, you can make your functions take the fast-float type. You could also make them generic over f32/fastf32.

Separate types also give fine-grained control, e.g. you may want calculation fast, but some accumulator to use precise math. You can have f32 += fast32 to express that, instead even more special attributes.

Types also match what Rust is already doing with Wrapping type. There's no #[integers_can_wrap_here] #[ambivalent_about_integer_wrapping], but precise per-operation control.

Types could work, but you're going to need much more than one as fast-math should not be a single toggle. It also makes the code uglier from the conversions but that can probably be worked around with macros that convert types automatically.

You'll just get the same old call @llvm.sqrt.f32(...) (no FMF) from the library function in src/libstd/f32.rs that has no opt-in applied.

The idea would be that the flag / attribute gets inherited, without it it wouldn't work to begin with for reasons you've listed.

But why weird attribute pass-throughs when you can have f32::sqrt() that is always precise, and fastf32::sqrt() that is always fast?

Because:

  • This makes it extremely difficult & tedious to switch between the two
  • There's no easy opt-out for the scientific community
  • It's not even clear what the semantics would be then; do they opt in to all fast-math operations? Should we have a flush_to_zero_fastf32 as well?

There's no #[integers_can_wrap_here] #[ambivalent_about_integer_wrapping], but precise per-operation control.

Except that precise control is more or less exactly what you're opting out of when compiling with fast math enabled.

bluss commented

Is the f32 vs fastf32 dichotomy enough? Has anyone looked at which fast-math flags agree with Rust's language guarantees and which not? It looks like to me, some are entirely harmless / only affect the value level, and some others more or less have preconditions and may even need to be guarded by unsafe?

Compare for example the nnan flag and the reassoc flag. Langref - from just reading, nnan is potentially unsafe (I don't know off hand what poison value means for Rust) and reassoc not.

Hopefully, maybe, the "safe" flags are enough for most optimizations.

@bluss I don't think any of the flags currently would be seen as unsafe. The spec doesn't make any reference to f32 safety and explicitly mentions integer overflow as /not/ unsafe. As a matter of fact, there are only two references to IEEE 754 floats in the spec that I can find, one is to specify the data-storage as binary32 and binary64 for f32 and f64 respectively, and the other is to specify what happens during certain casting operations.

However, from a ideological point of view, I have no objections to fast-math having to be unsafe (whatever that means from a language point of view).

bluss commented

I haven't followed the discussions unfortunately, but from a quick read it seems like anything that produces a poison value could be problematic. (See link rust-lang/unsafe-code-guidelines#6 (comment)) But the unsafe-c-g wg would be able to discuss it.

Yes, nnan flag is very much unsafe, as it can result in several uses of the same value observing different values non-deterministically, which is a gadet that can be used to e.g. defeat bounds checks (example). I am not 100% sure if any other flags are safe. It seems they are all at risk, if a computation gets duplicated (which is generally allowed!) and then the two copies get optimized differently.

I personally would be fine with per function and per crate attributes that enable some fast math flags without the need for callees inheriting the flags if that simplifies things. It gives the game dev community something to work with without needing a more elaborate solution. Fast math attributes could be ignored by the compiler by default unless explicitly allowed by compiler flags. No surprise foot guns there, only the foot guns I've asked for.

I'm fine to call a fast sqrt from my fast math enabled code, often games write their own sqrt/sin/cos etc approximations anyway so I don't have an issue with being explicit. If the compiler switch isn't enabled then fast sqrt results in regular old sqrt.

I think this would be a lot more ergonomic to work with than new types and gives some level of control over float opts.

For safety concerns perhaps an initial implementation could not allow problematic flags or alternatively require functions to be labelled as unsafe. The latter would be a lot less ergonomic though.

I'm fine to call a fast sqrt from my fast math enabled code, often games write their own sqrt/sin/cos etc approximations anyway so I don't have an issue with being explicit. If the compiler switch isn't enabled then fast sqrt results in regular old sqrt.

Actually, for these I think it would be totally fine to just have a bunch of sqrt_native, sin_native functions on f32 directly. It's always been one of the few nice things about OpenCL as well.

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/sqrt.html

@hanna-kruppe I tried coming up with a list of llvm builtin ops that can listen to the afn flag, the llvm reference docs seem to suggest there are only a few, however, it looks like opencl has a bunch more native_ ops then this. (And I'm aware of fast-math style approximations for many more, as are you most likely). Could you help out set up a list for this so we can make a proposal and evaluate having these as extensions on f32 and f64?

llvm.exp.*
llvm.exp2.*
llvm.log.*
llvm.log10.*
llvm.log2.*
llvm.fma.*
llvm.sqrt.*
llvm.sin.*
llvm.cos.*
llvm.pow.*

I can't invest significant time into this subject. But if it helps, I don't believe afn is the only or primary mechanism by which these and other functions get mapped to approximations. Depending on platform and operation, other options include:

  • gated on other LLVM IR attributes (e.g., "unsafe-fp-math" function attribute)
  • frontend emitting (or headers #define-ing) the symbols as different inline code or calls to different function symbols, possibly even bypassing the intrinsics altogether
  • just linking to a different libm implementation in the end

The whole subject is a real jungle of implementation details that still changes (example from less than a year ago) and varies between compilers, vendors and target platforms.

@rustbot modify labels: +A-floating-point

JRF63 commented

-ffast-math analysis also applies to vector intrinsics. I don't think we want to create a _fast version of every floating-point intrinsic for every architecture. Turns out this was clang lowering _mm_cmpord_ps to native LLVM IR fcmp.

Relevant part on LLVM. I read that as function calls (that return float types?) on floating-points or floating-point vectors being amenable to fast-math optimizations.

How would the attribute work if with_attribute called with_slow_sqrt or vice versa?

@LiamTheProgrammer All the operations is vast because it includes intrinsics. Anything here with __mm512, __mm512d, __mm256, __mm256d, __mm128, __mm128d etc.

Anecdotally in the C++ code bases I've worked on that used fast math they've generally done the inverse -- use fast math everywhere then by hand annotate specific functions where it would be problematic with pragmas to disable the optimization for those functions. This works in practice because If you care strongly about floating point performance you're probably also using vector operations, and vector operations tend to mitigate precision problems (e.g. to implement vectorized dot product you have 8 separate accumulators effectively instead of just the one in a scalar implementation).

It's almost guaranteed that in a system where you have to bless particular operations that somebody is going to ship a crate with a newtype wrapper for floats that just lists every single operation and that will become the way most people end up using the operations.

Adding a wishlist item when it eventually becomes true: Rust should get a way to temporarily avoid reordering in a certain part of an expression. Fortran compilers usually reorder as they please, but avoid breaking up parenthesized stuff. (I mean, I would love to have it in C too…)

However these optimizations are approached, the consequences can be quite dire: shared objects emitted by some compilers with this style of optimizations can, upon being linked to, change the floating point CSR for the entire program and leave it that way, altering results in unrelated code. While Rust is more commonly a standalone binary and thus largely in control of its execution and not interfering with others, rather than a .so, .dll, or other dynamic library, the day that people frequently link against Rust dylibs is not far off:

https://bugzilla.redhat.com/show_bug.cgi?id=1127544

Saw that thread too! Still not sure whether it affects Rust at all β€” crtfastmath.o is a linkage thing and the decision seems to be due to the compiler driver, not the backend codegen.

Everywhere I've worked using Rust, people have been linking against Rust dylibs, so it's already fairly common in my experience, just not when the program itself is written in Rust. That said, I don't think this needs to be worried that much about so long as we don't do the wrong thing when calling the linker.

I think this is a very tricky problem, and the right solution might be to have different scoped-enabled fast math options. For a while I had an RFC I was working on that was going to propose enabling them in a manner similar to target_feature, which I think would be a good model for it. It explicitly allowed you to say "this function requires strict floating point semantics, even if they required allowed in the caller", but by default you'd inherit them from the caller...

There are a lot of cases where this can benefit stuff in portable_simd, and without some way of doing this, portable handling of edge cases like NaNs could easily make things... substantially slower (for example, fixing rust-lang/stdarch#1155 sped up my code in a tight loop by 50%, and the code had... more to do than just min)

That said, I'm very sympathetic to the point made here: https://twitter.com/johnregehr/status/1440090854383706116, that defining "fast-math" in terms optimizations performed rather than semantics is just going to be a mess.

Drawing some inspiration from the CakeML paper, perhaps we could have an annotation to mark possible values (ranges and inf/NaN), and have an annotation to allow any value in the range spanned by every combination of real number and floating point evaluation (this should allow widening and fma, I think? - it could require some tolerance for rounding in the wrong direction; perhaps returning an adjacent floating point value (1 ulp away) should be allowed), as well as some way to specify looser error bounds (e.g. within this factor of either end of the range described).
Additionally, to deal with the issue of a subexpression being optimized differently in different places, perhaps it would have to be treated as a variable (β€œlet binding”) when optimizing the outer expression, and meanwhile itself optimized independently. Also, should calling the same function multiple times with the same value be guaranteed to return the same output? This would probably require more attention around function inlining.

Still not sure whether it affects Rust at all β€” crtfastmath.o is a linkage thing and the decision seems to be due to the compiler driver, not the backend codegen.

Directly? No. I am more noting it as a potential consequence of missteps: we shouldn't allow Rust code to be a "bad citizen" and change the results in unaffected programs, so we should be careful about e.g. allowing changing the FPCSR as an optimization.

I agree with a scope-oriented model that would allow defining functions that have stricter or looser FP semantics, with a broadly similar model to target_feature. I envision something more equivalent to a particular kind of closure, but either way it would inherently describe the boundaries of such optimizations, and yes, it would be a massive boon to the portable SIMD planning, since such sites are likely going to see such annotations anyways.

https://simonbyrne.github.io/notes/fastmath/ makes some good points about the perils of fast-math.

JRF63 commented

There's quite a lot of comments about rounding modes. Is it about GCC and other backends? Pretty sure LLVM's fast-math flags doesn't even touch that so there shouldn't be any problem of fast-math enabled Rust libraries messing up other code that will link to it.

Besides couldn't we already do the really dangerous floating-point environment stuff like

#![feature(link_llvm_intrinsics)]
extern {
    #[link_name="llvm.set.rounding"]
    fn set_rounding(mode: i32);
}

and also via the CSR intrinsics in core::arch?

Just a note from a user here, attempting to progress on Rust CV by optimizing the Akaze visual feature detector. The lack of even opt-in location-specific fast math (such as fadd_fast and fmul_fast) on stable hinders the use of Rust in some key algorithms for computer vision, robotics and augmented reality applications. For example, in some cases, the same simple filters are 5-7 times slower than they could be (see this comparison). An alternative is to use SIMD directly, but the portable initiative has not landed yet and it is more work to rewrite the code in SIMD than simply writing reasonable loops that get auto-vectorized.

I hope that such a user's perspective can be considered when discussing the dangers of fast math. Because for the language adoption in several modern fields, there is also something to lose by not having something like fmul_fast and fadd_fast (as unsafe operations for example) on stable.

Are you able to probe (perhaps most easily in the C code since clang exposes flags for each) which of the flags are necessary? In particular, some like -ffinite-math-only must be unsafe in Rust, while others like -ffp-contract=fast can be made safe (with suitable intrinsics on stable).

Ok, so using the C code in the comparison, enabling -Ofast leads to factor 19 improvement compared to -O3! The smallest subset that leads to the improvement is -fno-signed-zeros -fassociative-math together. Removing either of them cancels it.

The comparison code is likely an extreme case as it looks like the compiler could really inline a lot.

@stephanemagnenat I assume you're using x86? What about with RUSTFLAGS=-Ctarget-feature=+fma?

Does anyone know if the developers even remember that we need -ffast-math?

@workingjubilee yes I'm using x86-64. Using that bench, I got worst results using the -Ctarget-feature=+fma (except for the C version that sees a 8% improvement):

cargo +nightly bench
[...]
test tests::bench_alice_convolution_parallel ... bench:      36,066 ns/iter (+/- 1,526)
test tests::bench_alice_convolution_serial   ... bench:       1,511 ns/iter (+/- 2)
test tests::bench_bjorn3_convolution         ... bench:       4,078 ns/iter (+/- 12)
test tests::bench_dodomorandi_convolution    ... bench:       3,459 ns/iter (+/- 17)
test tests::bench_pcpthm_convolution         ... bench:       1,507 ns/iter (+/- 5)
test tests::bench_zicog_convolution          ... bench:       1,595 ns/iter (+/- 3)
test tests::bench_zicog_convolution_fast     ... bench:       1,597 ns/iter (+/- 4)
test tests::bench_zicog_convolution_safe     ... bench:       3,456 ns/iter (+/- 12)
test tests::bench_zso_convolution            ... bench:      13,078 ns/iter (+/- 7)
test tests::bench_zso_convolution_ffi        ... bench:       1,209 ns/iter (+/- 48)

vs

RUSTFLAGS=-Ctarget-feature=+fma cargo +nightly bench
[...]
test tests::bench_alice_convolution_parallel ... bench:      36,376 ns/iter (+/- 1,729)
test tests::bench_alice_convolution_serial   ... bench:       6,499 ns/iter (+/- 74)
test tests::bench_bjorn3_convolution         ... bench:       4,057 ns/iter (+/- 24)
test tests::bench_dodomorandi_convolution    ... bench:       3,463 ns/iter (+/- 28)
test tests::bench_pcpthm_convolution         ... bench:       6,602 ns/iter (+/- 25)
test tests::bench_zicog_convolution          ... bench:       3,723 ns/iter (+/- 71)
test tests::bench_zicog_convolution_fast     ... bench:       6,787 ns/iter (+/- 40)
test tests::bench_zicog_convolution_safe     ... bench:       3,437 ns/iter (+/- 15)
test tests::bench_zso_convolution            ... bench:      13,073 ns/iter (+/- 86)
test tests::bench_zso_convolution_ffi        ... bench:       1,120 ns/iter (+/- 38)

Using an AMD Ryzen 9 7950X CPU. This is somewhat surprising. The +fma seems to break std::simd parallelization.

That's... Very Weird, given that usually it significantly improves it.

That's... Very Weird, given that usually it significantly improves it.

I fully agree, I don't think I made a mistake but that's always a possibility. It would be interesting for others to try to replicate this little experiment, it is very easy to do: just clone the repo, and run the benchmark, with and without the +fma flag.

@stephanemagnenat I observe the same slowdown with -C target-cpu=native (CPU is 5800X)

$ cargo bench
...
test tests::bench_alice_convolution_parallel ... bench:      16,976 ns/iter (+/- 2,312)
test tests::bench_alice_convolution_serial   ... bench:       1,803 ns/iter (+/- 83)
test tests::bench_bjorn3_convolution         ... bench:       5,006 ns/iter (+/- 183)
test tests::bench_dodomorandi_convolution    ... bench:       4,033 ns/iter (+/- 124)
test tests::bench_pcpthm_convolution         ... bench:       1,570 ns/iter (+/- 23)
test tests::bench_zicog_convolution          ... bench:       1,800 ns/iter (+/- 34)
test tests::bench_zicog_convolution_fast     ... bench:       1,803 ns/iter (+/- 41)
test tests::bench_zicog_convolution_safe     ... bench:       4,207 ns/iter (+/- 106)
test tests::bench_zso_convolution            ... bench:      17,090 ns/iter (+/- 525)
test tests::bench_zso_convolution_ffi        ... bench:       1,750 ns/iter (+/- 21)

RUSTFLAGS='-C target-cpu=native' cargo bench
...
test tests::bench_alice_convolution_parallel ... bench:      18,092 ns/iter (+/- 2,504)
test tests::bench_alice_convolution_serial   ... bench:       5,406 ns/iter (+/- 94)
test tests::bench_bjorn3_convolution         ... bench:       5,298 ns/iter (+/- 111)
test tests::bench_dodomorandi_convolution    ... bench:       4,285 ns/iter (+/- 19)
test tests::bench_pcpthm_convolution         ... bench:       7,642 ns/iter (+/- 73)
test tests::bench_zicog_convolution          ... bench:       5,214 ns/iter (+/- 42)
test tests::bench_zicog_convolution_fast     ... bench:       4,979 ns/iter (+/- 90)
test tests::bench_zicog_convolution_safe     ... bench:       4,447 ns/iter (+/- 129)
test tests::bench_zso_convolution            ... bench:      17,767 ns/iter (+/- 799)
test tests::bench_zso_convolution_ffi        ... bench:       1,613 ns/iter (+/- 14)

... which in hindsight is obvious, as it enables the feature 🀦

While I do think that adding an option to enable fast-math in Rust is definitely desireable, I don't like the idea of making a new type for it however.

I would rather make it an optional compiler flag, that is not set by default in --release. This way I can run my existing code with fast-math enabled if I want to and not use fast-math if I don't want to. Adding a new type would require me to either change all f64s to f64fast in my entire codebase or go through every function and think about whether it makes sense to use f64fast here or not and add var as f64 and var as f64fast all over the place.

Putting it in profile, allowing each crate to set it, and allowing the binary crate to override it per-crate seems to make sense.

You could then enable it for your library crate if you know it is safe , and for binary crate they can disable it if it turns out doesn't work, or enable it if they know what they are doing.

Making it a compile flag that applies to other crates sounds like a terrible idea. When you download a crate from the internet, you can't know whether it was written in a way that is compatible with fast-math semantics. It is very important not to apply fast-math semantics to code that assumes IEEE semantics.

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code. That would ultimately even undermine Rust's safety promise.

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code.

That sounds like a good path to go on in my eyes. Being able to set an attribute in the Cargo.toml which basically means "This crate is fast-math-safe". Compiling your code with fast-math on would then check every dependency whether it is fast-math-safe or not and compile it accordingly.

I use core::intrinsics::f*_fast or core::intrinsics::f*_algebraic to hint compiler for auto vectorization and it totally works. The only thing that I care about is these functions are gated with core_intrinsics, which seems quite awkward.