Document guarantees (or lack thereof) regarding sign, quietness, and payload of `NaN`s

Question

Document guarantees (or lack thereof) regarding sign, quietness, and payload of `NaN`s

Closed this issue 3 days ago · 45 comments

NaNs can behave in surprising ways. On top of that, a very common target is inherently buggy in more than one way. But on all other targets we actually follow fairly clear, if improperly documented, rules. See here for the current status.

Original issue

Several issues have been filed about surprising behavior of NaNs.

#55131, in which the sign of the result of 0.0 / 0.0 changed depending on whether the right-hand side came from a function argument or a literal.
#73288/#46948, in which the result of f32::from_bits(x).to_bits() was not always equal to x.

The root cause of these issues is that LLVM does not guarantee that NaN payload bits are preserved. Empirically, this applies to the signaling/quiet bit as well as (surprisingly) the sign bit. At least one LLVM developer seems open to changing this, although doing so may not be easy.

Unless we are prepared to guarantee more, we should do a better job of documenting that, besides having all 1s in the exponent and a non-zero significand, the bitwise value of a NaN is unspecified and may change at any point during program execution. In particular, the from_bits method on f32 and f64 types currently states:

This is currently identical to transmute::<u32, f32>(v) on all platforms.

and

this implementation favors preserving the exact bits. This means that any payloads encoded in NaNs will be preserved

These statements are misleading and should be changed.

We may also want to add documentation to {f32,f64}::NAN to this effect, see #52897 (comment).

cc #10186?

Answer 1 · 2020-06-13T22:43:49.000Z

This also affects the documentation for the methods in #72568.

Answer 2 · 2020-06-14T07:25:36.000Z

@ecstatic-morse wrote elsewhere

Indeed. The underlying cause is clear. I wonder what we should do here, though? Does Rust currently guarantee that extended precision is not used for operations on f64? If so, this is technically a miscompilation. However, I don't know whether it's worth fixing. Maybe we should just document the status quo and move on?

I don't think we can easily just "move on" -- as mentioned here, what LLVM currently does seems incoherent and is likely just plain unsound (but miscompilations are hard to trigger). In that sense this is similar to #28728: LLVM in its current state makes it impossible to build a safe language on top of it with reasonable effort, which means fixing this will be a lot of work, but from a Rust perspective that's nevertheless a critical soundness bug.

Cc @rust-lang/lang

Answer 3 · 2020-06-14T13:21:23.000Z

That issue does not involve NaN, and that comment is not applicable here.

Answer 4 · 2020-06-14T13:34:34.000Z

Fair. But I feel #72327 is related in the broader sense of "our FP semantics are a mess". Looks like we actually have two problems here:

LLVM doesn't preserve NaN bits (which is likely incoherent). That's this issue.
LLVM uses the x87 instructions on i686 which behave different from how IEEE floats should behave. That's #72327.

Answer 5 · 2020-06-14T13:39:36.000Z

I created rust-lang/unsafe-code-guidelines#237 to collect FP issues. That's indeed off-topic here, sorry for that.

Answer 6 · 2020-06-16T06:44:59.000Z

Related LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=45152

Answer 7 · 2020-09-09T09:46:52.000Z

Unless we are prepared to guarantee more, ... the bitwise value of a NaN is unspecified and may change at any point during program execution

This seems... way too conservative. I know it's trying to make the best of a bad situation, and I'm sympathetic here, but please realize how hard overly broad unspecified behavior like this makes it to write robust code (As a user of Rust who came to it from C, this feels like the same kind of undefined behavior you see in the C standard in cases where all supported platforms disagree).

So, my biggest concern is non-Wasm platforms. I think it would really be a huge blow to working with floats in rust to effectively zero guarantees around NaN. I don't really know a good solution here, but even just marking it as a LLVM bug on the problematic platforms (rather than deciding that this isn't a thing that Rust code gets to rely on ever) would be much better.

Just as an example, if NaN payload is totally unspecified and may change at any point, implementing any ordering stronger than PartialEq for floats is impossible (including #72599), as you cannot count on NaN bitwise values to be stable across two calls of to_bits() on the same float.

Same goes for things that stash f32 in a u32 and then expect to get it out again and be the same (for example, I implemented an AtomicF32 at one point on top of AtomicU32 + from_bits/to_bits. If I can't rely on stable bit values though from float => u32, things like compare_exhcange loops become not guaranteed to ever terminate.

Tbat said, I also "totally unspecified behavior" is too conservative on Wasm too — I've done a bit of poking and it seems like the behavior is a lot more sane than suggested, although it does violate IEEE734 and is probably not 100% intentional.

Basically: LLVM's behavior here is inherited from the wasm/js runtime, which canonicalizes NaNs whenever going from bits => float, as it wants to be able to guarantee certain things about which bit patterns are possibly in the float — certain NaNs are off limits.

That means:

The bits=>float operation is the only time the NaN payload can change (explaining the mentioned f32::from_bits(x).to_bits() round trip failure
Float => bits should be totally stable and consistent
After a float => bits operation, those bits are guaranteed not to change when going back to a float.
- There is, admittedly, some dodginess here since perhaps LLVM optimizes a bits => float => bits into a no-op. Perhaps that can be addressed directly and more easily though?

This is non-ideal but is still way easier to reason about and build on top of than arbitrary unspecified behavior.

Yeah that's the basic gist of my thoughts. Changing the documented guaranteed of from_bits/to_bits globally like that would totally neuter those APIs. I'm sympathetic to the position you're in and not having great choices, but that kind of change feels like very much the wrong call, and making the call be this kind of unspecified behavior feels really bad on any platform...

P.S. I accidentally posted an incomplete version of this comment by hitting ctrl+enter in the github text box, sorry if you saw that — really should just do these in a text editor first.

Answer 8 · 2020-09-09T10:19:09.000Z

I am open to better suggestions. I know hardly anything about floating point semantics, so "totally unspecified" is an easy and obviously "correct" choice for me to reach for. If someone with more in-depth knowledge can produce a spec that is consistent with LLVM behavior, I am sure this can be improved upon.

However, the core spec of Rust must be platform-independent, so unless we consider this a platform bug (which I think is what we do with the x87-induced issues on i686), whatever the spec is has to encompass all platforms.

In principle, certain platforms can decide to guarantee more than others, but that is a dangerous game as it risks code inadvertently becoming non-portable in the worst possible way -- usually "non-portable" means "fails to build on other platforms", now it would silently change behavior. Maybe we can handle this in a way similar to endianess, although the situation feels different.

And all of this is assuming that we can get LLVM to commit to preserving NaN payloads on these platforms. You are saying that this issue only affects wasm(-like) targets, but is there a document where LLVM otherwise makes stronger guarantees? The fact that issues only have been obvserved on these platforms does not help, we need an explicit statement by LLVM to establish and maintain this guarantee in the future.

Just as an example, if NaN payload is totally unspecified and may change at any point, implementing any ordering stronger than PartialEq for floats is impossible (including #72599), as you cannot count on NaN bitwise values to be stable across two calls of to_bits() on the same float.

So if I understand correctly, on wasm, the float => bit cast that is inherent in such a total order would canonicalize NaNs. This on its own is not a problem as this is a stable canonicalization, and that's why you think "unstable NaNs" are too broad. Is that accurate?

However, when you combine that with LLVM optimizing away "bit => float => bit" roundtrips (does it do that?), then this already brings us into an unstable situation. Some of the comparisons might have that optimization applied to them, and others not, so suddenly the same float (obtained via a bit => float cast) can compare in two different ways.

It is easy to make a target language spec such as wasm self-consistent, but to do the same on a heavily optimized IR like LLVM's or surface language like Rust is much harder.

Answer 9 · 2020-09-09T10:47:59.000Z

So if I understand correctly, on wasm, the float => bit cast that is inherent in such a total order would canonicalize NaNs.

No, float => bit should always* be stable, it's bit => float that canonicalizes. This means it's possible to implement a robust totalOrder without issues on Wasm (just not if all nan payloads are unspecified values which may change at any time).

My point with that paragraph was not that the LLVM behavior is bad (although I am not a fan), but that changing Rust's guarantees to: "the bitwise value of a NaN is unspecified and may change at any point during program execution" is both

Stronger than needed for Wasm
Makes it so that no matter which operations happen to canonicalize and which do not, it's not possible to write a totalOrder.

* (always... except for what I say in my next response)

However, when you combine that with LLVM optimizing away "bit => float => bit" round-trips (does it do that?)

I don't know if it does it on Wasm, but it's obviously free to do this on non-Wasm platforms (and I think I've seen it there, but it's hard to say and I don't have code I'm thinking of on hand).

I'd hope it wouldn't do this on Wasm, and would argue that if it does optimize that away it's an LLVM bug for that platform, but... yeah. Possible.

unless we consider this a platform bug (which I think is what we do with the x87-induced issues on i686)

Honestly that seems like the sanest decision to me, since the alternative is essentially saying that Rust code can't expect IEEE754-compliant floats anymore. And so, I think x87 is a good example because it's also an example of non-IEEE754 compliance, although probably a less annoying one in practice.

Concretely, I wouldn't have complained about this at all if it were listed as a platform bug.

Instead, my issue is entirely with all compliant Rust code loosing the ability to reason about float binary layout, which has been extremely useful in stuff like scientific computing, game development, programming language runtimes, math libraries, ... All things Rust is well suited to do, by design.

This wouldn't cripple those by any means, but it would make things worse for several of them.

Admittedly, in practice, unless it's flat out UB, I suspect people will just code to their target and not to the spec, which isn't great either, but honestly to me it feels like it might be better than Rust genuinely inheriting this limitation from the web platform.

(Ironically, this would also prevent writing a runtime in Rust that does the optimization which is the reason Wasm and JS runtimes want to canonicalize their NaNs. Although that optimization was already fairly unportable anyway)

Answer 10 · 2020-09-09T11:33:18.000Z

No, float => bit should always* be stable, it's bit => float that canonicalizes.

Oh I see... but that is not observable until you cast back? Or does wasm permit transmutation, like writing a float into memory and reading it back as an int without doing an explicit cast? (IIRC their memroy is int-only so you'd have to cast before writing, but I might misremember.)

I don't know if it does it on Wasm, but it's obviously free to do this on non-Wasm platforms (and I think I've seen it there, but it's hard to say and I don't have code I'm thinking of on hand).

I'd hope it wouldn't do this on Wasm, and would argue that if it does optimize that away it's an LLVM bug for that platform, but... yeah. Possible.

Whether it can do that or not depends solely on the semantics of LLVM IR, which (as far as I know) are not affected by whether you are compiling to Wasm or not. That is the entire point of having a single uniform IR.

There is no good way to make optimizations in a highly optimized language like Rust or LLVM IR depend on target behavior -- given how they interact with all the other optimizations, that is basically guaranteed to introduce contradicting assumptions.

Also, I don't think there is much point in discussing what we wish LLVM would do. We first need to figure out what it is doing.

(Ironically, this would also prevent writing a runtime in Rust that does the optimization which is the reason Wasm and JS runtimes want to canonicalize their NaNs. Although that optimization was already fairly unportable anyway)

Ah, but this is getting to the heart of the problem -- what if you implement a wasm runtime in Rust which uses this optimization, and compile that to wasm? Clearly that cannot work as the host wasm is already "using those bits". So, it is fundamentally impossible to have a semantics that achieves all of

platform independence
supporting this optimization
correct compilation to wasm

Instead, my issue is entirely with all compliant Rust code loosing the ability to reason about float binary layout, which has been extremely useful in stuff like scientific computing, game development, programming language runtimes, math libraries, ... All things Rust is well suited to do, by design.

I do feel like it is slightly exaggarated to say that all these usecases rely on stable NaN payloads. That said, there seems to be a fundamental conflict here between having a good cross-platform story (consistent semantics everywhere) and supporting low-level floating point manipulation. FP behavior is just not consistent enough across platforms.

Answer 11 · 2020-09-09T11:36:55.000Z

However, note that not just wasm has strange NaN behavior. We also have some bugs affecting x86_64: #55131, #69532. Both (I think) stem from the LLVM constant propagator (in one case its port to Rust) producing different NaN payloads than real CPUs. This means that if we guarantee stable NaN payloads in x86_64, we have to stop const-propagating unless all CPUs have consistent NaN payload (and then the const propagator needs to be fixed to match that).

So until LLVM commits to preserving NaN payloads on some targets, there is little we can do. It seems people already rely on that when compiling wasm runtimes in LLVM that use the NaN optimization, so maybe it would not be too hard to convince LLVM to commit to that?

Answer 12 · 2020-09-09T13:14:25.000Z

That is the entire point of having a single uniform IR.

This isn't really right tho is it? LLVM-IR includes tons of platform specific information. The fact that making LLVM-IR cross platform was non-viable was part of the motivation behind Wasm's current design even.

From the other issue:

A less drastic alternative is to say that every single FP operation (arithmetic and intrinsics and whatnot, but not copying), when it returns a NaN, non-deterministically picks any NaN representation.

This would be totally fine with me FWIW — as soon as you do arithmetic on NaN all portability is out the window in practice and in theory. My concern is largely with stuff like:

Stuff like https://searchfox.org/mozilla-central/source/js/rust/src/jsval.rs suddenly breaking, just as a quick file I remember from my last job as doing stuff that depends on this.
APIs like https://doc.rust-lang.org/core/arch/x86_64/fn._mm_cmpeq_ps.html being in a limbo where nothing guarantees that it works... even though it obviously must work or is a compiler bug.

For context here: this API is one of many SIMD intrinsic apis where you have shortlived NaNs in float vectors where the payload is very important.

Specifically this function will return a float vector (yes, float — __m128i would be the type for an int vector) with an all-bits-set f32 for every slot where the comparison succeeded. One of the ways you're intended to use the result is as a bitmask, to find the elements where the comparison succeeded/failed.

Since all-bits-set is a NaN with a specific payload, this requires the payload be preserved here

So, while I just gave you two examples of very much non-portable code...

The jsval code is probably more portable than you might expect (actually I have no idea what you might expect, but I believe it should support anything Firefox supports, and probably a little more).
Every target with vector registers does the same "it's really just a bag of bits" stuff somewhere in it's intrinsic API (And the solution here shouldn't be to declare core::arch broken — even if portable simd is on the way).

My big concern still comes back to the notion that these payloads are "unspecified values which may change at any time" according to Rust. The way I interpret that, and the general feeling of this conversation, means that there's no guarantee that target-specific things like these are even guaranteed to work reliably on the target in question.

I do feel like it is slightly exaggarated to say that all these usecases rely on stable NaN payloads

That's why I said "This wouldn't cripple those by any means", although honestly the SIMD stuff would be pretty bad if it were actually broken.

I also fully expect those cases to blindly continue doing things to NaN non-portably (and possibly non-deterministically).

This means that if we guarantee stable NaN payloads in x86_64, we have to stop const-propagating unless all CPUs have consistent NaN payload (and then the const propagator needs to be fixed to match that).

This is surprising, because I thought it was the whole point of LLVM's APFloat code (which even goes as far as to support like the horrible PowerPC long double type...). That said, it's not like I can argue with facts, if those bugs are happening, then they're happening... But are we sure those aren't just normal bugs in LLVM?

That said the only reason I wouldn't be willing to say "I don't care that much about what happens to NaN during const prop" is that you can't know when LLVM will happen to see enough to do more const prop.

That said, it seems totally unreasonable and very fragile to me to rely on things like:

A specific float expression (e.g. 0.0/0.0) producing a specific NaN.
Float numerical operations (arithmetic, math functions, etc) with NaN inputs doing anything beyond producing some arbitrary other NaN (except for sign manipulation — neg/abs/copysign and the like just toggle the sign bit).
...

That stuff is totally nonportable (IEEE754 recommends but doesn't require any of it) and unreliable both at compile time and at runtime. Again, my concern is more unexpected fallout here in stuff that expects NaN to go through smoothly.

Just took a peek at https://webassembly.github.io/spec/core/exec/numerics.html (and elsewhere in the spec) and regret not doing so sooner. In particular, there's a lot of mention on when canonicalization can happen, but none of the places are on load/reinterpret.

And so what's in there is pretty close to the suggestion you had earlier (the "less drastic alternative)... and what I suggested as the things that are totally nonportable.

And, it also definitely contradicts what I said before about when canonicalization happens (which mirrored what happened in ASM.js, what I seemed to see in my testing earlier, and would have explained from_bits(x).to_bits() not round-tripping... But maybe all of it be the "native doubles used in LLVM MC code" bug? Needs more investigation). That said, this would make things a lot more tractable, since it brings Wasm up to par as compliant IEEE-754 implementation, and (if true) just points the blame at LLVM for messing up...

Which would also (maybe?) explain why the bugs happen on all platforms, maybe?

...

Ugh, this is still a bit jumbled sorry, some it this needs to be unified and reordered, and more digging into what the deal with the discrepancy is, but I have to run, unfortunately.

Answer 13 · 2020-09-11T09:35:08.000Z

This isn't really right tho is it? LLVM-IR includes tons of platform specific information. The fact that making LLVM-IR cross platform was non-viable was part of the motivation behind Wasm's current design even.

It makes many platform-specific things such as pointer sizes etc explicit. But that is very different from an implicit change in behavior.

Your proposal would basically require many optimizations to have code like if (wasm) { one_thing; } else { another_thing; }. I do not think such code is common in LLVM today, if it exists at all. It is also very fragile as it is easy to forget to add this in all the right places. In contrast, the explicit reification of layout everywhere is impossible to ignore.

And this would affect many optimizations as it makes float point operations and/or-casts non-deterministic, which is a side-effect! So everything that treats them as pure operations needs to be adjusted.

From the other issue:

There's like 5 other issues, which one do you mean?^^ You are quoting this comment I think.

This would be totally fine with me FWIW — as soon as you do arithmetic on NaN all portability is out the window in practice and in theory.

(This was for making FP operations pick arbitrary NaNs.)
The problem is that this makes them non-deterministic. So e.g. if you have code like

let f = f1 / f2;
function(f, f);

then you are no longer allowed to "inline" the definition of f in both places, as that would change the function arguments from two values with definitely the same NaN payload to potentially different NaN payloads.

However, maybe we can make it deterministic but unspecified? As in, after each floating-point operation, if the result is NaN, something unspecified happens with the NaN bits, but given the same inputs there will definitely always be the same output?

The main issue with this is that it means that const-prop must exactly reproduce those NaN patterns (or refuse to const-prop if the result is a NaN).

My concern is largely with stuff like:

So is it the case that all that code would be okay with FP operations clobbering NaN bits?

My big concern still comes back to the notion that these payloads are "unspecified values which may change at any time" according to Rust.

Rust will probably just do whatever LLVM does, once they make up their mind and commit to a fixed and precise semantics. I think you are barking up the wrong tree here, I don't like unspecified values any more than you do. ;) I am just trying to come up with a consistent way to describe LLVM's behavior.

I'm a theoretical PL researcher, so that's something I have experience with that I am happy to lend here -- define a semantics that is consistent with optimizations and compilation to lower-level targets. However, not knowing much about floating-point makes this harder for me than it is for other topics. So I am relying on people like you to gather up the constraints to make sure the resulting semantics is not just consistent with LLVM but also useful. ;) It might turn out that that's impossible, in which case we can hopefully convince LLVM to change.

This is surprising, because I thought it was the whole point of LLVM's APFloat code (which even goes as far as to support like the horrible PowerPC long double type...). That said, it's not like I can argue with facts, if those bugs are happening, then they're happening... But are we sure those aren't just normal bugs in LLVM?

They might well be bugs! Since you seem to know a lot about floating-point, it would be great if you could help figure that out. :)

That said the only reason I wouldn't be willing to say "I don't care that much about what happens to NaN during const prop" is that you can't know when LLVM will happen to see enough to do more const prop.

Right, that's exactly the point -- const-prop must not change what the program does. So either it must produce the exact same results as hardware, or else we have to say that the involved operation is non-deterministic.

Just took a peek at https://webassembly.github.io/spec/core/exec/numerics.html (and elsewhere in the spec) and regret not doing so sooner. In particular, there's a lot of mention on when canonicalization can happen, but none of the places are on load/reinterpret.

So what is the executive summary?

A quick glance shows that these operations are definitely non-deterministic. So scratch all I said about this above, this basically forces LLVM to never ever duplicate floating-point instructions. Any proposals for (a) figuring out if they are doing this right and (b) documenting this in the LLVM LangRef to make sure they are aware of the problem?

Answer 14 · 2020-09-14T14:13:14.000Z

@ecstatic-morse you listed #73288 in the original issue here, but isn't that a different problem? Namely, this issue here is about NaN bits in general, whereas #73288 is specific to i686 and thus seems more related to #72327. (I don't think we have a meta-issue for "x87 floating point problems", but maybe we should.)

Answer 15 · 2020-09-14T16:40:18.000Z

#72327 affects only i586 targets (x86 without SSE2). This is a tier 2 platform, and the last x86 processor without SSE2 left the plant about 20 years ago, so I would have no problem exempting it from whatever guarantees around NaN payloads we wish to make. However, #73288 affects i686 (the latest 32-bit x86 target) as well, which is tier 1. Obviously, we could (and maybe should) exempt all 32-bit x86 targets from the NaN payload guarantees, but I consider #73288 to be of greater importance than issues only affecting i586.

As an aside, I will note that "Unless we are prepared to guarantee more" was doing a lot of work in the OP. I'd be very happy if we came up with a stricter set of semantics that we can support across tier 1 platforms (possibly exempting 32-bit x86) and implemented them. However, doing so will require a non-trivial amount of work, much of it on the LLVM side. I think that, in the meantime, we should explicitly state where we currently fall short in the documentation of affected APIs, similar to #10184. That's what this issue is about.

Answer 16 · 2020-09-14T16:44:39.000Z

Also, look out for my latest crate, AtomicNanCanonicalizingF32, on crates.io.

Answer 17 · 2020-09-14T16:52:14.000Z

#72327 affects only i586 targets (x86 without SSE2). This is a tier 2 platform, and the last x86 processor without SSE2 left the plant about 20 years ago, so I would have no problem exempting it from whatever guarantees around NaN payloads we wish to make. However, #73288 affects i686 (the latest 32-bit x86 target) as well, which is tier 1. Obviously, we could (and maybe should) exempt all 32-bit x86 targets from the NaN payload guarantees, but I consider #73288 to be of greater importance than issues only affecting i586.

Wait, so there's x87-specific bugs even when using SSE2? 😢 and here I was thinking that SSE2 solves the i586 mess.

Answer 18 · 2020-09-14T17:01:27.000Z

Yes. The x86 calling convention mandates that floating point values are returned on the FPU stack. Values on the FPU stack are extended-precision, so storing them into an 8-byte f64 involves truncation and thus is an "arithmetic operation", which canonicalizes NaNs, according to the x86 manual.

Answer 19 · 2020-09-15T07:24:21.000Z

However, #73288 affects i686 (the latest 32-bit x86 target) as well, which is tier 1.

<whisper> let's move all 32 bit targets to tier 2 </whisper>

Answer 20 · 2020-10-08T11:49:36.000Z

I said this in Zulip but it belonged here probably.

I came across https://github.com/WebAssembly/design/blob/master/Rationale.md#nan-bit-pattern-nondeterminism (and Also see https://github.com/WebAssembly/design/blob/master/Nondeterminism.md) which is interesting.

IEEE 754-2008 6.2 says that instructions returning a NaN should return one of their input NaNs. In WebAssembly, implementations may do this, however they are not required to. Since IEEE 754-2008 states this as a "should" (as opposed to a "shall"), it isn't a requirement for IEEE 754-2008 conformance.

This answers a lot of questions for #73328

Specifcally the way it works is: Certain instructions are not guaranteed to preserve the payload bitpattern (in practice you can't rely on this portably so it seems fine that not to guarantee anything). Specifically:

the instructions fsqrt, fceil, ffloor, ftrunc, fnearest, fadd, fsub, fmul, fdiv, fmin, fmax, promote (f32 as f64) and demote (f64 as f32): do not preserve the payload or sign bits for non-canonical nans, and do not preserve the sign bit for canonical nans (where "do not preserve means "set to a nondeterministic value").

the instructions fneg, fabs, and fcopysign (e.g. "sign bit operations" according to ieee754) fully preserve nan payload, and only modify the sign bit if expected for the operation, and introduce no nondeterminsim. (this is actually a hard requirement of IEEE754 so it's not surprising if they're going with the "technically compliant" argument lol)

All other operations, such as copying values around, loading/storing them to memory, roundtripping arbitrary bitpatterns (including the patterns of noncanonical nans) through float values, using them as args, returning them from functions... these should all preserve sign and payload of nans.

As I mentioned before you can't portably rely on what happens to these NaN payloads if you do math on them, so I don't think what's there is a big deal if this is followed. My big concern was mostly that that last set of things wouldn't work.

A couple of additional notes:

There are probably more LLVM bugs beyond this (we hit a value changing optimization in th portable simd group, yesterday...)
This is not very different from platforms that turn on "flush subnormal numbers to zero" by default, (like arm32), although I feel absurdly strongly that we should not adopt that nonsense just because one platform does it.

Answer 21 · 2020-10-08T20:22:28.000Z

Did some fiddling with bitpatterns and NaN, so things look better now on my x86_64 machine but I haven't exactly turned this kind of thing into a test that runs on all platforms and LLVM might be cheating by knowing my inputs already (and thus, that I'm watching it):

https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=6c25cb08877ac25b3247c400d73db17d

Answer 22 · 2020-10-10T11:37:52.000Z

@thomcc thanks! I have now opened a thread in the LLVM forum asking about the LLVM NaN semantics.

Answer 23 · 2021-03-17T23:17:22.000Z

Some remarks re: IEEE754-{2008,2019}, following from the other thread:

Depending on the sign bit of a NaN is considered to create a "non-reproducible" program.
Actions that preserve the payload of a NaN may not have to preserve the sign.
A "value-changing optimization" such as LLVM's must preserve the literal meaning of the source code... but this is allowed to alter the sign of a NaN.

Answer 24 · 2021-03-19T16:57:23.000Z

Depending on the sign bit of a NaN is considered to create a "non-reproducible" program.

What is the consequence of being "non-reproducible"? This is possible in safe code so it cannot do anything funny in Rust. In particular it may not introduce "unstable values" due to inconsistently applied compiler transformations.

Answer 25 · 2021-03-19T17:18:11.000Z

A "reproducible" program always gets the same output for a given input across all possible IEEE-754 implementations. A "non-reproducable" program doesn't have that assurance.

Answer 26 · 2021-03-19T17:35:48.000Z

What Lokathor said.

In particular it may not introduce "unstable values" due to inconsistently applied compiler transformations.

Ah, does this mean the situation in #81261 where two logically identical versions of the code yield different actual results is not something Rust should permit?

Answer 27 · 2021-03-19T17:40:52.000Z

Ah, does this mean the situation in #81261 where two logically identical versions of the code yield different actual results is not something Rust should permit?

This can be okay if the rest of the compiler is sufficiently careful. ;)
But what must not happen is e.g.

let x = NEG_INFINITY.mul(0.0);
test_bit_eq(x, x) // always returns `true`

getting optimized to

test_bit_eq(NEG_INFINITY.mul(0.0), NEG_INFINITY.mul(0.0))

and then subsequently optimizing one operand to NEG_INFINITY * 0.0, such that the return value ends up changing to false. The original program would never return false, so the compiled program may not do that, either.

Basically, if floating-point operations are observably non-deterministic (can have different results depending on arbitrary circumstances, even without anything else observably changing), then they are not pure operations, and hence it is not correct for the compiler to "duplicate" them like is done in the first step of my example.

#81261 basically says that NEG_INFINITY * 0.0 is non-deterministic, and as such must be treated as an impure operation: evaluating it twice can yield different results. It is okay for operations to be impure (evaluating malloc twice also yields different results), but the rest of the optimizations must handle this correctly.

Answer 28 · 2021-03-19T19:26:30.000Z

Ah, indeed! Yes, that would Not be okay.

Insofar as the standard is concerned, to my reading and understanding:

If all inputs to an op are non-NaN, then there are only a few sets of input values which can yield a NaN float, which do include mul(NEG_INFINITY, 0.0).
A NaN float is a bitstring with some bits set and others in an undetermined state. Their state can be revealed, however, by:
Operations that only examine a NaN float (e.g. partial_cmp) or interact with it solely as a bitstring (abs, neg, copysign, and Copy) are deterministic.

Most of the LLVM value-changing optimizations are noted as permissible to some degree by the IEEE754-2019 standard if offered as opt-ins, except for the "no signed zeros" marker, which the standard does not recognize as a valid optimization.

Answer 29 · 2021-03-20T09:44:52.000Z

#81261 basically says that NEG_INFINITY * 0.0 is non-deterministic

That's not quite right. The issue is that when evaluated at compile time, it produces one result, and at runtime, another. Evaluating either at compiletime or runtime is fully deterministic (modulo wasm, where I guess it's explicitly nondeterministic).

Answer 30 · 2021-03-20T09:53:14.000Z

That's not quite right. The issue is that when evaluated at compile time, it produces one result, and at runtime, another.

The only way this is not a bug is if evaluation is non-deterministic. Rust has the same evaluation rules for compile-time and run-time. Otherwise there'd be two Rust languages and we'd have a horrible mess...

Evaluating either at compiletime or runtime is fully deterministic (modulo wasm, where I guess it's explicitly nondeterministic).

Of course, the actual implementation is never non-deterministic. But the specification of Rust has to be non-deterministic here, or we have to change either compile-time or run-time behavior.

Answer 31 · 2021-03-20T09:57:30.000Z

The only way this is not a bug is if evaluation is non-deterministic

IMO it is a bug.

Of course, the actual implementation is never non-deterministic. But the specification of Rust has to be non-deterministic here, or we have to change either compile-time or run-time behavior.

I mean, it's really easy for me to argue that the changing the compile-time behavior is right. Unfortunately, that's difficult because it requires changing how APFloat works in LLVM, and it's not a trivial change either.

That said, IMO the solution to hard, low-impact bugs shouldn't be to rework the language so that they're not bugs. Eventually they should be fixed, even if it's not a high priority.

Additionally, a different Rust compiler probably wouldn't have the same difficulty here.

Answer 32 · 2021-03-20T10:01:31.000Z

That's not quite right. The issue is that when evaluated at compile time, it produces one result, and at runtime, another.

Also, that's not even true. The original code sample in that issue shows two different behaviors at runtime:

use std::ops::Mul;

fn main() {
    assert_eq!(1.0f64.copysign(f64::NEG_INFINITY.mul(0.0)), -1.0f64);
    assert_eq!(1.0f64.copysign(f64::NEG_INFINITY * 0.0), -1.0f64);
}

Answer 33 · 2021-03-20T10:16:30.000Z

What is the consequence of being "non-reproducible"? This is possible in safe code so it cannot do anything funny in Rust. In particular it may not introduce "unstable values" due to inconsistently applied compiler transformations.

I've been meaning to say this, but the reproducibility rules are probably a bit of a red herring. They're only really meant to apply to programs that opt into a subset of floating point semantics.

Also, that's not even true. The original code sample in that issue shows two different behaviors at runtime:

I believe this is due to one of these being impacted by LLVM's constant propagation and the other not.

Answer 34 · 2021-03-20T12:05:02.000Z

I believe this is due to one of these being impacted by LLVM's constant propagation and the other not.

Sure. But that doesn't change the fact that this is runtime code. And to my knowledge, LLVM doesn't consider this optimization a bug, since the result produced by LLVM is legal according to the IEEE floating-point spec. There isn't even an LLVM bugreport for the f64::NEG_INFINITY * 0.0 case, is there?

That said, IMO the solution to hard, low-impact bugs shouldn't be to rework the language so that they're not bugs. Eventually they should be fixed, even if it's not a high priority.

It is my understanding that some aspects of the bitwise results of floating-point operations (in particular for NaNs) are inherently not defined in the LLVM IR semantics (or in the IEEE semantics, which LLVM [mostly?] follows). This is not a bug, it is part of their spec. So if we want to use LLVM as the backend, we have no choice but to also incorporate a similar kind of non-determinism into the Rust semantics (or lobby for LLVM to change their spec).

This is not reworking the language, it is properly understanding the consequences of what it means to say that Rust uses IEEE floating-point semantics. If agree that it would be nice to have deterministic floating-point operations, but that's just not realistic when LLVM (and WebAssembly) made a different choice.

Answer 35 · 2021-03-20T12:22:12.000Z

Put differently: a bug usually means that something is not working according to spec. I don't see that happen here (but I keep getting lost in the details of FP semantics). My understanding is that this issue is about better documenting the Rust spec, not about changing the behavior of rustc.

One could argue that the spec has a bug due to being too liberal, but given that the spec we are talking about here is the LLVM IR spec and by extension the IEEE FP spec, that does not seem like a particularly useful of constructive approach. (Specs can certainly have bugs when they fail to be self-consistent or when they do not adequately reflect intended behavior, but that does not seem to be the case here.)

Answer 36 · 2021-03-23T17:34:06.000Z

I do not believe lobbying LLVM for a hardware-respecting behavior seems that unlikely. It may make some proofs regarding optimizations easier, for one.

Answer 37 · 2021-03-23T18:06:38.000Z

It may make some proofs regarding optimizations easier, for one.

I don't see how that would be the case.

I do not believe lobbying LLVM for a hardware-respecting behavior seems that unlikely.

Fair. But this is the wrong forum to do so. ;)

Answer 38 · 2021-10-29T22:14:47.000Z

What about refusing to constant-evaluate any operation that is non-reproducible?

Answer 39 · 2021-10-29T22:44:44.000Z

By const-evaluate I assume you mean constant propagation / constant folding, i.e., the optimization pass that tries to avoid redundant computations at runtime? That is distinct from CTFE (compile-time function evaluation, also sometimes called const evaluation), which is about computations that the spec says happen at compile-time (such as the initial values of a const, array sizes, or enum discriminant values).

We could do that in rustc, but can we convince LLVM to stop folding f64::NEG_INFINITY * 0.0?

Answer 40 · 2021-10-30T01:15:06.000Z

By const-evaluate I assume you mean constant propagation / constant folding, i.e., the optimization pass that tries to avoid redundant computations at runtime? That is distinct from CTFE (compile-time function evaluation, also sometimes called const evaluation), which is about computations that the spec says happen at compile-time (such as the initial values of a const, array sizes, or enum discriminant values).

We could do that in rustc, but can we convince LLVM to stop folding f64::NEG_INFINITY * 0.0?

File a bug against LLVM? I don’t know 🙂

Answer 41 · 2023-08-04T18:58:26.000Z

I have written a Pre-RFC on our floating-point guarantees, which is almost exclusively about NaNs. That document describes what are currently the best possible guarantees we can provide, given LLVM's documentation. However, LLVM also seems to be open to providing stronger guarantees.

Answer 42 · 2023-09-05T13:55:53.000Z

and the last x86 processor without SSE2 left the plant about 20 years ago

To be pedantic, the Vortex86DX3 is still being made and only supports SSE
And they claim linux support. Some poor soul out there may still be compiling x86-no-SSE2 code for linux shipped on "new" hardware. That said, I'm not aware of any instances of this actually happening, just raising the possibility.

Edit: #35045 (comment) mentioned in 2016 that he's using a VortexX86

Answer 43 · 2023-09-05T14:15:43.000Z

I'm more concerned about someone using -C target-cpu=pentium on one of our tier 1 i686 targets an expecting that to work properly. Maybe we should just forbid disabling SSE2 support...

Answer 44 · 2023-10-14T08:03:48.000Z

The RFC rust-lang/rfcs#3514 makes a concrete proposal for our guarantees for the bits of NaNs.

Answer 45 · 2024-10-05T12:09:17.000Z

I think this was resolved by #129559.