Imprecise floating point operations (fast-math)

Question

Imprecise floating point operations (fast-math)

mpdn opened this issue 9 years ago · 75 comments

There should be a way to use imprecise floating point operations like GCC's and Clang's -ffast-math. The simplest way to do this would be to do like GCC and Clang and implement a command line flag, but I think a better way to do this would be to create a f32fast and f64fast type that would then call the fast LLVM math functions. This way you can easily mix fast and "slow" floating point operations.

I think this could be implemented as a library if LLVM assembly could be used in the asm macro.

Answer 1 · 2015-01-28T02:14:51.000Z

Inline IR was discussed on #15180. Another option is extern "llvm-intrinsic" { ... } which I vaguely think we had at some point. If we added more intrinsics to std::intrinsics would that be sufficient?

Answer 2 · 2015-01-28T09:40:52.000Z

Yeah, adding it as a function in std::intrinsics could definitely work as well.

There are a few different fast math flags, but the fast flag is probably the most important as it implies all the other flags. Adding all of them would be unreasonable if using intrinsic functions, but I don't think all of them are necessary.

Answer 3 · 2015-08-17T14:48:41.000Z

This forum thread has examples of loops that llvm can vectorize well for integers, but doesn't for floats (a dot product).

Answer 4 · 2017-06-08T12:50:37.000Z

I've prototyped it using a newtype: https://gitlab.com/kornelski/ffast-math (https://play.rust-lang.org/?gist=d516771d1d002f740cc9bf6eb5cacdf0&version=nightly&backtrace=0)

It works in simple cases, but the newtype solution is insufficient:

it doesn't work with floating-point literals. That's a huge pain when converting programs to this newtype.
it doesn't work with the as operator, and a trait to make that possible has been rejected before.
the wrapper type and extra level of indirection affects inlining of code using it. I've found some large functions where the newtype was slower than regular float, but not because of float math, but because other structs and calls around it weren't as optimized. I wasn't able to reproduce it in simple cases, so I'm not sure what exactly is going on.

So I'm very keen on seeing it supported natively in Rust.

Answer 5 · 2017-06-08T17:57:39.000Z

@pornel The issue #24963 had a test case where a newtype impacted vectorization. That example was fixed (great!), sounds like the bug is probably still visible in similar code.

Answer 6 · 2017-06-08T19:22:50.000Z

I've tried -ffast-math in my C vs Rust benchmark of some graphics code:

https://github.com/pedrocr/rustc-math-bench

In the C code it's a ~20% improvement in clang but no benefit with GCC. In both cases it returns a wrong result and the math is extremely simple (multiplying a vector by a matrix). According to this:

https://stackoverflow.com/questions/38978951/can-ffast-math-be-safely-used-on-a-typical-project#38981307

-ffast-math is generally too unsafe for normal usage as it implies some strange things (e.g., NaN checks always return false). So it seems sensible to have a way to opt-in only to the more benign ones.

Answer 7 · 2017-06-08T20:46:06.000Z

@pedrocr Your benchmark has a loss of precision in sum regardless of fast-math mode. Both slow and fast give wrong result compared to summation using double sum.

With double for the sum and you'll get correct result, even with -ffast-math.

You get significantly different sum with float sum, because fast-math gives you a small systemic rounding error, which accumulates over 100 million additions.

All values from matrix multiplication are the same to at least 6 digits (I've diffed printf("%f", out[i]) of all values and they're all the same).

Answer 8 · 2017-06-08T22:28:38.000Z

@pornel thanks, fixed here:

pedrocr/rustc-math-bench@8169fa3

The benchmark results are fine though, the sum is only used as a checksum. Here are the averages of three runs in ms/megapixel:

Compiler	-O3 -march=native	-O3 -march=native -ffast-math
clang 3.8.0-2ubuntu4	6,91	5,40 (-22%)
gcc 5.4.0-6ubuntu1~16.04.4	5,71	5,85 (+2%)

So as I mentioned before clang/llvm gets a good benefit from ffast-math but not gcc. I'd say making sure things like is_normal() still work is very important but at least on llvm it helps to be able to enable ffast-math.

Answer 9 · 2017-06-08T22:35:59.000Z

I've suggested it would make sense to expose -ffast-math using the target-feature mechanisms:

https://internals.rust-lang.org/t/pre-rfc-stabilization-of-target-feature/5176/23

Answer 10 · 2017-06-08T23:18:26.000Z

Rust has fast math intrinsics, so the fast math behavior could be limited to a specific type or selected functions, without forcing the whole program into it.

Answer 11 · 2017-06-09T18:20:36.000Z

A usable solution for my use cases would probably be to have the vector types in the simd crate be the types that allow the opt-in to ffast-math. That way there's only one type I need to conciously convert the code to for speedups. But for the general solution of in normal code having to swap types seems cumbersome. But maybe just doing return val as f32 when val is an f32fast type isn't that bad.

Answer 12 · 2017-08-10T15:28:49.000Z

Created a pre-RFC discussion on internals to try and get a discussion on the best way to do this:

https://internals.rust-lang.org/t/pre-rfc-whats-the-best-way-to-implement-ffast-math/5740

Answer 13 · 2019-01-20T13:19:37.000Z

Is there a current recommended approach to using fast-math optimizations in rust nightly?

Answer 14 · 2019-10-24T15:57:54.000Z

If it helps, a good benchmark comparison article between C++ and Rust floating point optimizations (link) inside loops was written recently (Oct 19), with a good Hacker News discussion exploring this concept.

Personally, I think the key is that without specifying any (EDIT: floating-point specific) flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

(EDIT: It seems that -fvectorize -Ofast was specified for clang to get gcc-comparable results, see proceeding comment)

An important key for any discussion on opmitized float math should keep this in mind: vectorization isn't always less precise - a commenter pointed out that a vectorized floating point sum is actually more accurate than the un-vectorized version. Also see Stack Overflow: https://stackoverflow.com/a/7455442

I'm curious what criteria for vectorization clang (or gcc) uses for figuring out floating point optimization. I'm not enough of an expert in these areas to know specifics though. I'm also not sure what precision guarantees Rust makes for floating point math.

Answer 15 · 2019-10-24T16:03:27.000Z

Personally, I think the key is that without specifying any flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

That's not the case in the article. The clang compilation was using -Ofast which apparently enables -ffast-math.

Answer 16 · 2019-11-07T19:40:41.000Z

As mentioned above, enabling fast math globally on a build is way to unsafe. Some things like -ffp-contract=fast being on by default make sense. But others are necessarily situational. Manually calling intrinsics on every single call is tedious.

Perhaps it would be better to allow annotating a function. IE:

#[math(overflow="wrap", assumptions=("algebraic", "no-nan", "finite"))]
fn update_velocity(b1: &mut Body, b2: &mut Body, diff: f32, mag: f32) {
    b1.vel = b1.vel - diff * (b2.mass * mag);
    b2.vel = b2.vel + diff * (b1.mass * mag);
}

That would provide reasonable safety while being more convenient, doubly so if it could be applied to entire modules.

Answer 17 · 2020-01-12T22:39:29.000Z

Enabling fast match by default is going to break a lot of floating-point code.

For example, chances are that somewhere in your floating-point program, something is computing the sum of an array of floating-point values. If that code is doing it right, it is probably going to be using a crate like accurate to compute the sum efficiently with a small error.

One of the algorithms that accurate implements is Kahan summation, which is roughly:

S = X[0]
C = 0
for i in [1..N]:
  Y = X[i] - C
  T = S + Y
  C = (T - S) - Y
  S = T

With -ffast-math, a compiler can replace T in C = (T - S) - Y with S + Y, which results in C = ((S + Y) - S) - Y and optimize that to C = 0. Since C is never modified, and adding zero does nothing with -ffast-math, Kahan algorithm can be further optimized to:

S = X[0]
for i in [1..N]:
  S += X[i]

which defeats the point and produces quite inaccurate results.

This optimization is not theoretical, clang performs it when -ffast-math is enabled, e.g., see: https://gcc.godbolt.org/z/8NLIdB , where a more accurate Kahan summation gets optimized to an inaccurate sum.

Allowing users to enable -ffast-math globally is only going to destroy all properly-written floating-point code in libstd and elsewhere, which Rust users building applications with more than 200 crates are probably using without knowing.

Even allowing this at the function scope seems like a footgun, e.g., imagine a user writes:

#[math(overflow="wrap", assumptions=("algebraic", "no-nan", "finite"))]
fn foo(b1: &[f32]) -> f32 {
    accurate::kahan_sum(b1)
}

If accurate gets inlined into foo, the same algorithm-destroying optimization shown above would apply. To avoid that, we would need to prevent functions with "incompatible" #[math(...)] annotations from being inlined into each other (which might be worse for perf than the "fast-math" wins) - in this case, since accurate::kahan_sum wouldn't have a #[math(...)] annotation, it would not be inlineable into foo. This means, however, that the example used by @tkaitchuck above might not work, because the <f32 as Add<f32>::add method does not have a #[math] annotation either (and wouldn't have it with any proposal AFAICT).

Answer 18 · 2020-01-13T13:41:09.000Z

If accurate gets inlined into foo, the same algorithm-destroying optimization shown above would apply.

I'd consider that a compiler bug.

The whole dilemma seems to be coming from tying fp optimizations to scopes, rather than types. If there was f32 and fastf32, then inlining wouldn't be semantics-breaking.

Answer 19 · 2020-01-13T22:38:49.000Z

@gnzlbg

To avoid that, we would need to prevent functions with "incompatible" #[math(...)] annotations from being inlined into each other (which might be worse for perf than the "fast-math" wins)

This is not at all necessary in LLVM IR, where fast-math flags apply to individual instructions, not whole functions. MIR could adopt the same basic idea to also enable MIR inlining between functions with different fast-math settings.

This means, however, that the example used by @tkaitchuck above might not work, because the <f32 as Add<f32>::add method does not have a #[math] annotation either

The general question for how to propagate fast-math settings into other functions in a controlled and predictable fashion is a very tricky language design challenge, but this specific example is not affected. For primitive types, expressions like x + y (in contrast to x.add(y)) already get special-cased and lowered directly to built-in MIR operations, so no function call is involved anyway.

Answer 20 · 2020-01-14T08:35:17.000Z

@rkruppe good points about inlining in MIR and about x + y being special cased.

So IIUC, inlining would still need to be very careful of, after inlining, not applying other optimizations that could change the fast-math settings of the inlined code. In a "fast-math" function, x + y would get fast-math flags, but would an x + y that gets inlined from a x.add(y) also get them?

The whole dilemma seems to be coming from tying fp optimizations to scopes, rather than types. If there was f32 and fastf32, then inlining wouldn't be semantics-breaking.

@kornelski What fastf32 achieves is tying the fp-arithmetic constraints to the operations on the memory it wraps. We already have some core::intrinsics for fast-math. I don't know if it is a good idea, but I would be ok with exposing a perma-unstable set of intrinsics that can express all these operations, since that should be enough to write types like fastf32 and other approaches like NonNan<NoSignedZero<Associative<T>>> in Rust libraries. Maybe something like:

mod core::intrinsics { // or somewhere else
    // bitflags for fast-math:
    const NonNan: u32 = 0b1;
    const NoSignedZero: u32 = 0b10;
    const Associative: u32 = 0b100;
    ...

    // fp arithmetic intrinsics taking a const bitset of fast-math flags
    fn fp_add<T>(T, T, const fast_math_flags: u32) -> T;
    fn fp_sub<T>(T, T, const fast_math_flags: u32) -> T;  
   ...
   fn fp_sqrt<T>(T, T, const fast_math_flags: u32) -> T;
   ...
}

Alternatively, maybe we can just extend all the current floating-point core intrinsics with a bitset. With something like default function arguments, we would just set that bitset to 0 by default, preserving current behavior.

Answer 21 · 2020-01-15T22:29:00.000Z

So IIUC, inlining would still need to be very careful of, after inlining, not applying other optimizations that could change the fast-math settings of the inlined code. In a "fast-math" function, x + y would get fast-math flags, but would an x + y that gets inlined from a x.add(y) also get them?

No, if fast-math flags are per-instruction, then inlining automatically does the correct thing: just copy each instructions around without changing the fast-math flags on it. So if without inlining x.add(y) calls a function that does a regular fadd, then you'll get the same after inlining.

On the other hand, optimizations that specifically rewrite floating point operations need to take care to correctly interpret and update flags of all involved instructions (e.g. reassociating fadd x, (fadd assoc y, z) is incorrect).

Answer 22 · 2020-01-20T13:57:48.000Z

I would be ok with exposing a perma-unstable set of intrinsics that can express all these operations, since that should be enough to write types like fastf32 and other approaches like NonNan<NoSignedZero<Associative>> in Rust libraries

Was "perma-unstable" meant to be something else? Unless we think all libraries interested in fast-math are okay with being perma-unstable libraries.

Answer 23 · 2020-01-20T14:20:27.000Z

@lxrec a perma-unstable feature is enough to allow those interested in developing a stable-rust solution to prototype one in nightly, since as mentioned, it allows people to implement types like fastf32, type wrappers like NonNan<T>, and even proc macros like #[fast_math(...)].

That would allow those interested in "fast-math" to explore the design space without having to hack on the compiler, and to submit RFCs that can be tried, since library APIs using unstable features can be exposed from libcore to stable Rust users, without having to stabilize the "perma-unstable" features themselves.

Unless we think all libraries interested in fast-math are okay with being perma-unstable libraries.

I have yet to see anybody expressing desire for this goal.

Answer 24 · 2020-01-20T15:12:30.000Z

@gnzlbg Intrinsics for prototyping are OK, but they're not enough for practical use: #21690 (comment)

Answer 25 · 2020-01-20T15:46:45.000Z

You mention three issues in that comment:

lack of literal support: this is an orthogonal problem that should be solved by "user-defined literals".
lack of as support: From, TryFrom, round, transmute, and similar APIs are usually (always?) clearer than as, particularly when combined with type ascription,
bad performance: sounds like a compiler bug, did you report it ? Looking at your code, your ffast_math::fff type is repr(C) but should be repr(transparent) - notice that repr(C) inhibits the Scalar and ScalarPair repr(Rust) optimizations, forcing the ffast_math::fff type to use the Aggregate ABI class.

So from the issues you mention, the one I think has most weight is the lack of "user-defined literals". I don't think one needs to solve this problem to ship a fast-math feature. Today, we would need to write let x = NonNan(1.0);. IMO that is not that bad, definitely not bad enough to block an otherwise useful feature, in particular if we consider that with a "user-defined literals" feature we might need to write let x = 1.0_NonNan_f64; or similar (e.g. let x = 1_r64; for Real<f64>).

Answer 26 · 2020-02-14T00:18:38.000Z

For game engines and similar real-time apps where floating point-accuracy is not that big of an issue (but speed is), this is quite important. The experience and steps to add this today in rust programs is not at all great.

What would be the best practical next steps to get some traction on this issue?
What needs to be addressed and how can we help with this?

Potential questions that might need answers:
Does allowing -ffast-math have safety repercursions in safe rust? I can't imagine how it would violate safety invariants, but an example in safe rust would be great.

If it doesn't have any security/safety implications, my proposal would be for this be a compiler option like -Oz, etc. (exactly as clang has it?). Disabled by default, of course.

Does this need a RFC? The pre-rfc discussion from 2017 is now locked and in an inconclusive state.

Answer 27 · 2020-04-06T21:14:35.000Z

This particular feature is a real need within the game development community for obvious performance reasons. Usually we'd want this for hot-spot optimizations, however, it's also common to blanket enable -ffast-math for the entire (C++) game codebase.

I'd like to propose two things:

A per function syntax where we can opt-in to fast-math style optimizations along the lines of #[fast-math(no-trapping-math)] or #[fast-math(all)] so we can use it specifically on our math-heavy cpu hot-spots.
A per-crate override so we can enable-disable fast-math optimizations through the already existing profile overrides mechanism.

For example to opt out of fast-math everywhere one would simply:

[profile.dev.package.'*']
fast-math: none

This would allow both the more scientific community to disable fast-math where they want (say if a crate they use enables it), and allow us to opt-in where needed (and it is needed).

Answer 28 · 2020-04-06T21:52:32.000Z

@Jasper-Bekkers having it per-function (or per-crate) makes inlining difficult. intrinsics or types are more predictable: #21690 (comment)

Answer 29 · 2020-04-06T22:29:42.000Z

@kornelski LLVM doesn't allow for it?

Answer 30 · 2020-04-07T07:52:46.000Z

@kornelski LLVM doesn't allow for it?

It looks like LLVM is at least fine with all of this - it allows specificity fast-math on a per-instruction level: http://blog.llvm.org/2019/03/llvm-numerics-blog.html & https://www.duskborn.com/posts/fpscev/ so that argument sounds like a bit of a red-herring.

Answer 31 · 2020-04-07T08:58:54.000Z

The per-function/per-crate problem is with MIR and just ambiguity of what that means when the code is moved between scopes and crates.

OTOH the more precise solution of having fast math functions or fast math types use the LLVM per-instruction mechanism you've pointed out.

Answer 32 · 2020-04-07T09:30:37.000Z

The per-function/per-crate problem is with MIR and just ambiguity of what that means when the code is moved between scopes and crates.

MIR is under the control of the rust team so that should be fixable, and there are other situations where similar infrastructure is desirable as well (think per-function overrides of opt-level for debugging optimized builds which is also pretty common in the games industry).

OTOH the more precise solution of having fast math functions or fast math types use the LLVM per-instruction mechanism you've pointed out.

On the other hand, it's vastly un-ergonomic and impossible to override without running into similar problems.

Answer 33 · 2020-04-07T09:57:49.000Z

IR representations for fine-grained opt-in are mostly a solved or solvable problem, the more serious issue is one of language and library design: per-function is often too fine-grained, some mechanism for selectively propagating FMFs across function boundaries is needed. For example, f32::sqrt is just ordinary library code (which wraps an unstable intrinsic that is the actual compiler magic), it won't by itself opt into any FMFs, but we still need code that opts into FMFs to obtain a version of f32::sqrt that also has the same FMFs applied. If we just consider f32::sqrt to be "a function that does not opt into any FMFs" then fine-grained per-instruction FMFs will lead to the sqrt operation not having any FMFs even after inlining. So something different needs to happen to propagate FMFs from callers to callees, but only some callees, otherwise we're back to the problems that motivate fine-grained FMFs in the first place (accidental contamination of adjacent code that doesn't opt into FMFs).

How can we achieve this? C and C++ do not really have a solution for this, other than by special-casing a few standard library functions in the compiler front-end. I think the Rust project has more aversion to such special-casing and more reliance on user-defined library code and abstractions, so the incentives are probably different here. New types offer a solution that leans on an existing mechanism (generics + monomorphization) but this comes with downsides (e.g., type incompatibilities, potential code size explosion from pointless monomorphizations). It's a tricky problem IMO.

Answer 34 · 2020-04-07T10:09:04.000Z

@hanna-kruppe That's why I proposed the two pronged approach above, both per function and per crate. Typically in a game there are a few sub-systems that, at their core, rely heavily on math operations. Those functions I'd like to be able to decorate with #[fast-math(all)], think for example decorating the constraint solver of a physics engine with this, while leaving it un-affected in the larger part of the system or the rest of the engine.

#[fast-math(all)]
fn solve_my_physics_constraints() {
    ...
}

Solving this in the type system imho isn't the right approach; like you've pointed out it's extremely hairy, and it's a huge annoyance when going in and trying things out speculatively - which - like it or not is a part of the optimization workflow that's very important.

Also, having a few "fast math" intrinsics similar to what's in nightly at the moment isn't enough - by far.

Answer 35 · 2020-04-07T12:03:41.000Z

I don't understand how either "prong" helps with the problem I described. f32::sqrt is neither in any module of your game engine, nor in any crate you control. If you write

let x = something.sqrt();

... then a localized #[fast_math(all)] on the surrounding function, module, or crate will not result in LLVM IR that contains a call fast @llvm.sqrt.f32(...) instruction, which is the IR you'd want to match sqrt(something) in C or C++ under -ffast-math. You'll just get the same old call @llvm.sqrt.f32(...) (no FMF) from the library function in src/libstd/f32.rs that has no opt-in applied. Something beyond attributes applied to limited lexical scopes is needed to avoid this, not necessarily something involving the type system, but certainly something beyond scoped attributes.

Answer 36 · 2020-04-07T12:18:46.000Z

Something beyond attributes applied to limited lexical scopes is needed to avoid this, not necessarily something involving the type system, but certainly something beyond scoped attributes.

Wouldn't it work to have scoped attributes both for enabling fast-math stuff and also for allowing it to pass-through if called from such context? That way the sqrt example would work something like:

#[fast-math(all)]
fn need_all_the_speed(val: f32) -> f32 {
    val.sqrt()
}

(...) somewhere in the stdlib

#[fast-math-passthrough-allow(all)]
fn sqrt(val: f32) -> f32 { ... }

When sqrt() gets called from a fast-math context it uses that and if not behaves normally. No idea if this is feasible in the compiler as it requires compiling the stdlib sqrt() in how many ways there are contexts calling it with different fast-math contexts.

This together with other proposals of specifying subsets of fast-math would allow specifying what is and isn't valid for each function decided by whoever wrote the code. Instead of doing what C/C++ does and allowing who is building the code to overwrite that and generate broken code by fiddling with compiler flags. In all the discussions I've seen about this it was fairly clear that should not be done. Overriding fast-math on a per-project or per-crate level is a footgun.

Answer 37 · 2020-04-07T12:38:47.000Z

Right, an attribute opting into "inheriting" FMFs from callers is a possible direction, but it's not clear how to best integrate it with Rust's (or most other language's, for that matter) compilation model. While f32::sqrt is a trivial wrapper function that should pretty much always be inlined anyway, for other functions that would benefit from pass-through, simply duplicating all code for every distinct set of FMFs used is not the right choice.

And even if it was, implementing it in the compiler requires the same machinery as monomorphization for type and constant parameters. Bolting yet another kind of parameter on this code seems potentially quite ugly, especially because it would not apply in other parts of the compiler that currently share concepts and code (the type system and everything that interacts with generic types). Leaning on the type system has its own drawbacks, but it neatly side-steps these concerns.

Answer 38 · 2020-04-07T13:44:06.000Z

But why weird attribute pass-throughs when you can have f32::sqrt() that is always precise, and fastf32::sqrt() that is always fast?

Types match what LLVM is doing, so having non-types layer of attributes on top of that that rust will have to translate back to types is weird. It's like #[u32_is_signed_here] attribute for a function that should be using i32.

Instead of putting an attribute, you can make your functions take the fast-float type. You could also make them generic over f32/fastf32.

Separate types also give fine-grained control, e.g. you may want calculation fast, but some accumulator to use precise math. You can have f32 += fast32 to express that, instead even more special attributes.

Types also match what Rust is already doing with Wrapping type. There's no #[integers_can_wrap_here] #[ambivalent_about_integer_wrapping], but precise per-operation control.

Answer 39 · 2020-04-07T13:47:35.000Z

Types could work, but you're going to need much more than one as fast-math should not be a single toggle. It also makes the code uglier from the conversions but that can probably be worked around with macros that convert types automatically.

Answer 40 · 2020-04-07T14:13:21.000Z

You'll just get the same old call @llvm.sqrt.f32(...) (no FMF) from the library function in src/libstd/f32.rs that has no opt-in applied.

The idea would be that the flag / attribute gets inherited, without it it wouldn't work to begin with for reasons you've listed.

But why weird attribute pass-throughs when you can have f32::sqrt() that is always precise, and fastf32::sqrt() that is always fast?

Because:

This makes it extremely difficult & tedious to switch between the two
There's no easy opt-out for the scientific community
It's not even clear what the semantics would be then; do they opt in to all fast-math operations? Should we have a flush_to_zero_fastf32 as well?

There's no #[integers_can_wrap_here] #[ambivalent_about_integer_wrapping], but precise per-operation control.

Except that precise control is more or less exactly what you're opting out of when compiling with fast math enabled.

Answer 41 · 2020-04-07T14:17:19.000Z

Is the f32 vs fastf32 dichotomy enough? Has anyone looked at which fast-math flags agree with Rust's language guarantees and which not? It looks like to me, some are entirely harmless / only affect the value level, and some others more or less have preconditions and may even need to be guarded by unsafe?

Compare for example the nnan flag and the reassoc flag. Langref - from just reading, nnan is potentially unsafe (I don't know off hand what poison value means for Rust) and reassoc not.

Hopefully, maybe, the "safe" flags are enough for most optimizations.

Answer 42 · 2020-04-07T14:44:03.000Z

@bluss I don't think any of the flags currently would be seen as unsafe. The spec doesn't make any reference to f32 safety and explicitly mentions integer overflow as /not/ unsafe. As a matter of fact, there are only two references to IEEE 754 floats in the spec that I can find, one is to specify the data-storage as binary32 and binary64 for f32 and f64 respectively, and the other is to specify what happens during certain casting operations.

However, from a ideological point of view, I have no objections to fast-math having to be unsafe (whatever that means from a language point of view).

Answer 43 · 2020-04-07T15:35:41.000Z

I haven't followed the discussions unfortunately, but from a quick read it seems like anything that produces a poison value could be problematic. (See link rust-lang/unsafe-code-guidelines#6 (comment)) But the unsafe-c-g wg would be able to discuss it.

Answer 44 · 2020-04-07T15:37:14.000Z

Yes, nnan flag is very much unsafe, as it can result in several uses of the same value observing different values non-deterministically, which is a gadet that can be used to e.g. defeat bounds checks (example). I am not 100% sure if any other flags are safe. It seems they are all at risk, if a computation gets duplicated (which is generally allowed!) and then the two copies get optimized differently.

Answer 45 · 2020-04-08T01:44:02.000Z

I personally would be fine with per function and per crate attributes that enable some fast math flags without the need for callees inheriting the flags if that simplifies things. It gives the game dev community something to work with without needing a more elaborate solution. Fast math attributes could be ignored by the compiler by default unless explicitly allowed by compiler flags. No surprise foot guns there, only the foot guns I've asked for.

I'm fine to call a fast sqrt from my fast math enabled code, often games write their own sqrt/sin/cos etc approximations anyway so I don't have an issue with being explicit. If the compiler switch isn't enabled then fast sqrt results in regular old sqrt.

I think this would be a lot more ergonomic to work with than new types and gives some level of control over float opts.

For safety concerns perhaps an initial implementation could not allow problematic flags or alternatively require functions to be labelled as unsafe. The latter would be a lot less ergonomic though.

Answer 46 · 2020-04-08T07:20:59.000Z

I'm fine to call a fast sqrt from my fast math enabled code, often games write their own sqrt/sin/cos etc approximations anyway so I don't have an issue with being explicit. If the compiler switch isn't enabled then fast sqrt results in regular old sqrt.

Actually, for these I think it would be totally fine to just have a bunch of sqrt_native, sin_native functions on f32 directly. It's always been one of the few nice things about OpenCL as well.

https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/sqrt.html

Answer 47 · 2020-04-09T09:47:18.000Z

@hanna-kruppe I tried coming up with a list of llvm builtin ops that can listen to the afn flag, the llvm reference docs seem to suggest there are only a few, however, it looks like opencl has a bunch more native_ ops then this. (And I'm aware of fast-math style approximations for many more, as are you most likely). Could you help out set up a list for this so we can make a proposal and evaluate having these as extensions on f32 and f64?

llvm.exp.*
llvm.exp2.*
llvm.log.*
llvm.log10.*
llvm.log2.*
llvm.fma.*
llvm.sqrt.*
llvm.sin.*
llvm.cos.*
llvm.pow.*

Answer 48 · 2020-04-09T10:51:18.000Z

I can't invest significant time into this subject. But if it helps, I don't believe afn is the only or primary mechanism by which these and other functions get mapped to approximations. Depending on platform and operation, other options include:

gated on other LLVM IR attributes (e.g., "unsafe-fp-math" function attribute)
frontend emitting (or headers #define-ing) the symbols as different inline code or calls to different function symbols, possibly even bypassing the intrinsics altogether
just linking to a different libm implementation in the end

The whole subject is a real jungle of implementation details that still changes (example from less than a year ago) and varies between compilers, vendors and target platforms.

Answer 49 · 2020-09-21T02:01:22.000Z

@rustbot modify labels: +A-floating-point

Answer 50 · 2020-11-30T04:13:46.000Z

~~-ffast-math analysis also applies to vector intrinsics. I don't think we want to create a _fast version of every floating-point intrinsic for every architecture.~~ Turns out this was clang lowering _mm_cmpord_ps to native LLVM IR fcmp.

Relevant part on LLVM. I read that as function calls (that return float types?) on floating-points or floating-point vectors being amenable to fast-math optimizations.

Answer 51 · 2021-08-25T13:43:10.000Z

How would the attribute work if with_attribute called with_slow_sqrt or vice versa?

Answer 52 · 2021-08-25T14:43:03.000Z

@LiamTheProgrammer All the operations is vast because it includes intrinsics. Anything here with __mm512, __mm512d, __mm256, __mm256d, __mm128, __mm128d etc.

Anecdotally in the C++ code bases I've worked on that used fast math they've generally done the inverse -- use fast math everywhere then by hand annotate specific functions where it would be problematic with pragmas to disable the optimization for those functions. This works in practice because If you care strongly about floating point performance you're probably also using vector operations, and vector operations tend to mitigate precision problems (e.g. to implement vectorized dot product you have 8 separate accumulators effectively instead of just the one in a scalar implementation).

It's almost guaranteed that in a system where you have to bless particular operations that somebody is going to ship a crate with a newtype wrapper for floats that just lists every single operation and that will become the way most people end up using the operations.

Answer 53 · 2021-09-21T06:17:41.000Z

Adding a wishlist item when it eventually becomes true: Rust should get a way to temporarily avoid reordering in a certain part of an expression. Fortran compilers usually reorder as they please, but avoid breaking up parenthesized stuff. (I mean, I would love to have it in C too…)

Answer 54 · 2021-09-21T07:09:44.000Z

However these optimizations are approached, the consequences can be quite dire: shared objects emitted by some compilers with this style of optimizations can, upon being linked to, change the floating point CSR for the entire program and leave it that way, altering results in unrelated code. While Rust is more commonly a standalone binary and thus largely in control of its execution and not interfering with others, rather than a .so, .dll, or other dynamic library, the day that people frequently link against Rust dylibs is not far off:

https://bugzilla.redhat.com/show_bug.cgi?id=1127544

Answer 55 · 2021-09-21T07:14:04.000Z

Saw that thread too! Still not sure whether it affects Rust at all — crtfastmath.o is a linkage thing and the decision seems to be due to the compiler driver, not the backend codegen.

Answer 56 · 2021-09-21T07:20:03.000Z

Everywhere I've worked using Rust, people have been linking against Rust dylibs, so it's already fairly common in my experience, just not when the program itself is written in Rust. That said, I don't think this needs to be worried that much about so long as we don't do the wrong thing when calling the linker.

I think this is a very tricky problem, and the right solution might be to have different scoped-enabled fast math options. For a while I had an RFC I was working on that was going to propose enabling them in a manner similar to target_feature, which I think would be a good model for it. It explicitly allowed you to say "this function requires strict floating point semantics, even if they required allowed in the caller", but by default you'd inherit them from the caller...

There are a lot of cases where this can benefit stuff in portable_simd, and without some way of doing this, portable handling of edge cases like NaNs could easily make things... substantially slower (for example, fixing rust-lang/stdarch#1155 sped up my code in a tight loop by 50%, and the code had... more to do than just min)

That said, I'm very sympathetic to the point made here: https://twitter.com/johnregehr/status/1440090854383706116, that defining "fast-math" in terms optimizations performed rather than semantics is just going to be a mess.

Answer 57 · 2021-09-21T11:26:12.000Z

Drawing some inspiration from the CakeML paper, perhaps we could have an annotation to mark possible values (ranges and inf/NaN), and have an annotation to allow any value in the range spanned by every combination of real number and floating point evaluation (this should allow widening and fma, I think? - it could require some tolerance for rounding in the wrong direction; perhaps returning an adjacent floating point value (1 ulp away) should be allowed), as well as some way to specify looser error bounds (e.g. within this factor of either end of the range described).
Additionally, to deal with the issue of a subexpression being optimized differently in different places, perhaps it would have to be treated as a variable (“let binding”) when optimizing the outer expression, and meanwhile itself optimized independently. Also, should calling the same function multiple times with the same value be guaranteed to return the same output? This would probably require more attention around function inlining.

Answer 58 · 2021-09-21T19:35:15.000Z

Still not sure whether it affects Rust at all — crtfastmath.o is a linkage thing and the decision seems to be due to the compiler driver, not the backend codegen.

Directly? No. I am more noting it as a potential consequence of missteps: we shouldn't allow Rust code to be a "bad citizen" and change the results in unaffected programs, so we should be careful about e.g. allowing changing the FPCSR as an optimization.

I agree with a scope-oriented model that would allow defining functions that have stricter or looser FP semantics, with a broadly similar model to target_feature. I envision something more equivalent to a particular kind of closure, but either way it would inherently describe the boundaries of such optimizations, and yes, it would be a massive boon to the portable SIMD planning, since such sites are likely going to see such annotations anyways.

Answer 59 · 2021-11-15T18:52:20.000Z

https://simonbyrne.github.io/notes/fastmath/ makes some good points about the perils of fast-math.

Answer 60 · 2021-11-16T04:20:39.000Z

There's quite a lot of comments about rounding modes. Is it about GCC and other backends? Pretty sure LLVM's fast-math flags doesn't even touch that so there shouldn't be any problem of fast-math enabled Rust libraries messing up other code that will link to it.

Besides couldn't we already do the really dangerous floating-point environment stuff like

#![feature(link_llvm_intrinsics)]
extern {
    #[link_name="llvm.set.rounding"]
    fn set_rounding(mode: i32);
}

and also via the CSR intrinsics in core::arch?

Answer 61 · 2023-06-13T14:27:17.000Z

Just a note from a user here, attempting to progress on Rust CV by optimizing the Akaze visual feature detector. The lack of even opt-in location-specific fast math (such as fadd_fast and fmul_fast) on stable hinders the use of Rust in some key algorithms for computer vision, robotics and augmented reality applications. For example, in some cases, the same simple filters are 5-7 times slower than they could be (see this comparison). An alternative is to use SIMD directly, but the portable initiative has not landed yet and it is more work to rewrite the code in SIMD than simply writing reasonable loops that get auto-vectorized.

I hope that such a user's perspective can be considered when discussing the dangers of fast math. Because for the language adoption in several modern fields, there is also something to lose by not having something like fmul_fast and fadd_fast (as unsafe operations for example) on stable.

Answer 62 · 2023-06-13T15:30:03.000Z

Are you able to probe (perhaps most easily in the C code since clang exposes flags for each) which of the flags are necessary? In particular, some like -ffinite-math-only must be unsafe in Rust, while others like -ffp-contract=fast can be made safe (with suitable intrinsics on stable).

Answer 63 · 2023-06-13T16:02:43.000Z

Ok, so using the C code in the comparison, enabling -Ofast leads to factor 19 improvement compared to -O3! The smallest subset that leads to the improvement is -fno-signed-zeros -fassociative-math together. Removing either of them cancels it.

The comparison code is likely an extreme case as it looks like the compiler could really inline a lot.

Answer 64 · 2023-06-21T23:00:13.000Z

@stephanemagnenat I assume you're using x86? What about with RUSTFLAGS=-Ctarget-feature=+fma?

Answer 65 · 2023-07-17T15:44:45.000Z

Does anyone know if the developers even remember that we need -ffast-math?

Answer 66 · 2023-07-18T08:11:17.000Z

@workingjubilee yes I'm using x86-64. Using that bench, I got worst results using the -Ctarget-feature=+fma (except for the C version that sees a 8% improvement):

cargo +nightly bench
[...]
test tests::bench_alice_convolution_parallel ... bench:      36,066 ns/iter (+/- 1,526)
test tests::bench_alice_convolution_serial   ... bench:       1,511 ns/iter (+/- 2)
test tests::bench_bjorn3_convolution         ... bench:       4,078 ns/iter (+/- 12)
test tests::bench_dodomorandi_convolution    ... bench:       3,459 ns/iter (+/- 17)
test tests::bench_pcpthm_convolution         ... bench:       1,507 ns/iter (+/- 5)
test tests::bench_zicog_convolution          ... bench:       1,595 ns/iter (+/- 3)
test tests::bench_zicog_convolution_fast     ... bench:       1,597 ns/iter (+/- 4)
test tests::bench_zicog_convolution_safe     ... bench:       3,456 ns/iter (+/- 12)
test tests::bench_zso_convolution            ... bench:      13,078 ns/iter (+/- 7)
test tests::bench_zso_convolution_ffi        ... bench:       1,209 ns/iter (+/- 48)

vs

RUSTFLAGS=-Ctarget-feature=+fma cargo +nightly bench
[...]
test tests::bench_alice_convolution_parallel ... bench:      36,376 ns/iter (+/- 1,729)
test tests::bench_alice_convolution_serial   ... bench:       6,499 ns/iter (+/- 74)
test tests::bench_bjorn3_convolution         ... bench:       4,057 ns/iter (+/- 24)
test tests::bench_dodomorandi_convolution    ... bench:       3,463 ns/iter (+/- 28)
test tests::bench_pcpthm_convolution         ... bench:       6,602 ns/iter (+/- 25)
test tests::bench_zicog_convolution          ... bench:       3,723 ns/iter (+/- 71)
test tests::bench_zicog_convolution_fast     ... bench:       6,787 ns/iter (+/- 40)
test tests::bench_zicog_convolution_safe     ... bench:       3,437 ns/iter (+/- 15)
test tests::bench_zso_convolution            ... bench:      13,073 ns/iter (+/- 86)
test tests::bench_zso_convolution_ffi        ... bench:       1,120 ns/iter (+/- 38)

Using an AMD Ryzen 9 7950X CPU. This is somewhat surprising. The +fma seems to break std::simd parallelization.

Answer 67 · 2023-07-19T06:59:18.000Z

That's... Very Weird, given that usually it significantly improves it.

Answer 68 · 2023-07-19T07:02:27.000Z

That's... Very Weird, given that usually it significantly improves it.

I fully agree, I don't think I made a mistake but that's always a possibility. It would be interesting for others to try to replicate this little experiment, it is very easy to do: just clone the repo, and run the benchmark, with and without the +fma flag.

Answer 69 · 2023-07-19T12:43:26.000Z

@stephanemagnenat I observe the same slowdown with -C target-cpu=native (CPU is 5800X)

$ cargo bench
...
test tests::bench_alice_convolution_parallel ... bench:      16,976 ns/iter (+/- 2,312)
test tests::bench_alice_convolution_serial   ... bench:       1,803 ns/iter (+/- 83)
test tests::bench_bjorn3_convolution         ... bench:       5,006 ns/iter (+/- 183)
test tests::bench_dodomorandi_convolution    ... bench:       4,033 ns/iter (+/- 124)
test tests::bench_pcpthm_convolution         ... bench:       1,570 ns/iter (+/- 23)
test tests::bench_zicog_convolution          ... bench:       1,800 ns/iter (+/- 34)
test tests::bench_zicog_convolution_fast     ... bench:       1,803 ns/iter (+/- 41)
test tests::bench_zicog_convolution_safe     ... bench:       4,207 ns/iter (+/- 106)
test tests::bench_zso_convolution            ... bench:      17,090 ns/iter (+/- 525)
test tests::bench_zso_convolution_ffi        ... bench:       1,750 ns/iter (+/- 21)

RUSTFLAGS='-C target-cpu=native' cargo bench
...
test tests::bench_alice_convolution_parallel ... bench:      18,092 ns/iter (+/- 2,504)
test tests::bench_alice_convolution_serial   ... bench:       5,406 ns/iter (+/- 94)
test tests::bench_bjorn3_convolution         ... bench:       5,298 ns/iter (+/- 111)
test tests::bench_dodomorandi_convolution    ... bench:       4,285 ns/iter (+/- 19)
test tests::bench_pcpthm_convolution         ... bench:       7,642 ns/iter (+/- 73)
test tests::bench_zicog_convolution          ... bench:       5,214 ns/iter (+/- 42)
test tests::bench_zicog_convolution_fast     ... bench:       4,979 ns/iter (+/- 90)
test tests::bench_zicog_convolution_safe     ... bench:       4,447 ns/iter (+/- 129)
test tests::bench_zso_convolution            ... bench:      17,767 ns/iter (+/- 799)
test tests::bench_zso_convolution_ffi        ... bench:       1,613 ns/iter (+/- 14)

Answer 70 · 2023-07-19T12:45:12.000Z

... which in hindsight is obvious, as it enables the feature 🤦

Answer 71 · 2024-06-14T09:17:39.000Z

While I do think that adding an option to enable fast-math in Rust is definitely desireable, I don't like the idea of making a new type for it however.

I would rather make it an optional compiler flag, that is not set by default in --release. This way I can run my existing code with fast-math enabled if I want to and not use fast-math if I don't want to. Adding a new type would require me to either change all f64s to f64fast in my entire codebase or go through every function and think about whether it makes sense to use f64fast here or not and add var as f64 and var as f64fast all over the place.

Answer 72 · 2024-06-14T09:28:52.000Z

Putting it in profile, allowing each crate to set it, and allowing the binary crate to override it per-crate seems to make sense.

You could then enable it for your library crate if you know it is safe , and for binary crate they can disable it if it turns out doesn't work, or enable it if they know what they are doing.

Answer 73 · 2024-06-14T09:47:47.000Z

Making it a compile flag that applies to other crates sounds like a terrible idea. When you download a crate from the internet, you can't know whether it was written in a way that is compatible with fast-math semantics. It is very important not to apply fast-math semantics to code that assumes IEEE semantics.

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code. That would ultimately even undermine Rust's safety promise.

Answer 74 · 2024-06-14T09:55:36.000Z

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code.

That sounds like a good path to go on in my eyes. Being able to set an attribute in the Cargo.toml which basically means "This crate is fast-math-safe". Compiling your code with fast-math on would then check every dependency whether it is fast-math-safe or not and compile it accordingly.

Answer 75 · 2024-06-14T10:15:41.000Z

I use core::intrinsics::f*_fast or core::intrinsics::f*_algebraic to hint compiler for auto vectorization and it totally works. The only thing that I care about is these functions are gated with core_intrinsics, which seems quite awkward.