rust-lang/rust

Wrong signs on division producing NaN

dtolnay opened this issue · 16 comments

Noticed this while playing with #54235.

fn f(x: f64) -> f64 {
    0f64 / x
}

fn main() {
    println!("{:?}", (0f64 / 0f64).is_sign_negative());
    println!("{:?}", f(0f64).is_sign_negative());
}

As of rustc 1.31.0-nightly (46880f4 2018-10-15) on x86_64-unknown-linux-gnu, in debug mode this program prints false true and in release mode prints false false. Two of my expectations are violated:

  • The output should be consistent between debug mode and release mode.
  • The first and second println should print the same value.

(Happy to reconsider if these expectations are unfounded.)

Compilers 1.19 and older consistently print false false which aligns with my expectations; 1.20 and newer behave as above.

Evidently LLVM does not guarantee the sign of NaNs, just as it does not guarantee the signaling bit or payload. I can't say I would have known that, but it doesn't surprise me either.

Two observations that explain these discrepancies:

  • (0f64 / 0f64) is constant folded even in debug mode (by IRBuilder), while f(0f64) obviously is only constant folded when inlined, i.e., in release mode.
  • When constant folding a floating point computation that results in a NaN, LLVM prefers 0x7FF8000000000000 (which has positive sign). Apparently your CPU differs and produces a negative NaN for the runtime division.

In other words, the semantics of floating point operations would be something like "if the result is a NaN, non-deterministically pick any legal NaN representation". This non-determinism explains why debug and release builds differ in behavior.

I wonder if we should make Miri pick a random NaN payload and sign and signalling bit, just to drive home this point...

@hanna-kruppe notes that "NaNs are unstable under copying" seems rather excessive and in fact people might rely on NaN payloads being preserved on copy.

A less drastic alternative is to say that every single FP operation (arithmetic and intrinsics and whatnot, but not copying), when it returns a NaN, non-deterministically picks any NaN representation.

I believe it was @Lokathor who made this point, though I don't disagree.

However, I have doubts whether either option is enough to explain away the behavior LLVM can produce today. Don't have time to summarize but here's a link to the Zulip discussion for future reference: https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/floating.20point.20semantics

However, I have doubts whether either option is enough to explain away the behavior LLVM can produce today.

I did not see anything in that discussion that makes it sound like either option wouldn't work -- by current impression is that both correctly describe LLVM behavior. What did I miss? (Not urgent, just respond when you got time again.)

Specifically https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/floating.20point.20semantics/near/194786318 and the whole earlier discussion about how combinations of other optimizations can result in different uses of the same value (in Rust / the initial LLVM IR) observing different results. We talked about how maybe floats should be "frozen" when moving into the integer domain but this does not currently happen and as I said in https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/floating.20point.20semantics/near/194786318 LLVM can currently eliminate the float<->int bitcasts/transmutes/etc. that we do have (even if one might argue that it shouldn't).

Hm okay if LLVM will duplicate casts then that would indeed contradict a "typed copy messes up NaN" semantics.

For the "FP operations pick arbitrary NaN" semantics, I suppose LLVM will also happily duplicate floating point operations since it considers them deterministic?

But together with "NaNs are not preserved", that actually leads to a contradiction, and if we can make LLVM do the right optimizations in the right order we can likely show a miscompilation from this.

Right, I believe there's potential miscompilations lurking there, but they're probably very difficult to tease out -- maybe even impossible today, if the stars don't align.

Would it be worth bringing this up with LLVM? Seems like either they should clarify that NaN payloads are not preserved by some of their FP operations, or else they should consider this a bug. The former might be a problem because people compile browsers in LLVM and those browsers' JS/wasm runtimes might want to actually carry data in NaN payloads...

In other words, the semantics of floating point operations would be something like "if the result is a NaN, non-deterministically pick any legal NaN representation". This non-determinism explains why debug and release builds differ in behavior.

I wonder if we should make Miri pick a random NaN payload and sign and signalling bit, just to drive home this point...

One note: the IEEE 754 fp standard requires the result of arithmetic operations to not be signaling NaNs.

What the IEEE754 FP standard says and what the implementation does are very different things, in practice, per #10186

If we follow wasm, then Miri could pick any arithmetic NaN. Whether and how that aligns with being signalling or not, I do not know.

Based on this I am inclined to declare this not-a-bug: NaN-producing operations do not have a well-defined sign, so there cannot be a 'wrong' sign. This is the semantics both in LLVM and wasm. I think Rust should follow suit.

Muon commented

This is definitely permissible according to IEEE 754. The only guarantee is that the result of 0/0 is a quiet NaN. The sign bit is not required to be the same between two divisions. Although the target FPU usually produces only specific NaNs, Rust does not (presently) promise that it upholds the semantics of the target FPU.

Closing in favor of #73328: we are not guaranteeing anything about the sign of a NaN produced by 0.0 / 0.0. (This matches, for instance, the WebAssembly specification.) Better documentation of all this is clearly required, that's what the other issue is about.