rust-lang/rust

Tracking issue: 32bit x86 targets without SSE2 have unsound floating point behavior

RalfJung opened this issue Β· 35 comments

On x86 (32bit) targets that cannot use SSE2 instructions (this includes the tier 1 i686 targets with flags that disable SSE2 support, such as -C target-cpu=pentium), floating-point operation can return results that are rounded in different ways than they should, and results can be "inconsistent": depending on whether const-propagation happened, the same computation can produce different results, leading to a program that seemingly contradicts itself. This is caused by using x87 instructions to perform floating-point arithmetic, which do not accurately implement IEEE floating-point semantics (not with the right precision, anyway). The test tests/ui/numbers-arithmetic/issue-105626.rs has an example of such a problem.

Worse, LLVM can use x87 register to store values it thinks are floats, which resets the signaling bit and thus alters the value -- leading to miscompilations.

This is an LLVM bug: rustc is generating LLVM IR with the intended semantics, but LLVM does not compile that code in the way that the LLVM LangRef describes. This is a known and long-standing problem, and very hard to fix. The affected targets are so niche these days that that is nobody's priority. The purpose of this issue mostly is to document its existence and to give it a URL that can be referenced.

Some ideas that have been floated for fixing this problem:

  • We could emit different instruction sequences for floating-point operations that emulate the expected rounding behavior using x87 instructions. This will likely require changes deep in LLVM's x86 backend. This is what Java does.
  • We could use softfloats.
  • We could set the FPU control register to 64bit precision for Rust programs, and require other code to set the register in that way before calling into a Rust library. this does not work

Related issues:

  • llvm/llvm-project#44218: basically the same bug on the LLVM side, and pointing out the potential for soundness issues
  • #115567: that's a different problem, also affecting x86-32 but unrelated to what happens when an FP operation is executed; it is about the behavior of NaN bits as they get returned from a function

Prior issues:

Can we just drop support for x86 without SSE2 or fall back to software floating point?

Can we just drop support for x86 without SSE2

We currently have the following targets in that category:

  • i586-pc-windows-msvc
  • i586-unknown-linux-gnu
  • i586-unknown-linux-musl

They are all tier 2. I assume people added them for a reason so I doubt they will be happy about having them dropped. Not sure to what extent floating point support is needed on those targets, but I don't think we have a concept of "target without FP support".

or fall back to software floating point?

In theory of course we can, not sure if that would be any easier than following the Java approach.

or fall back to software floating point?

In theory of course we can, not sure if that would be any easier than following the Java approach.

it should be much easier since LLVM already supports that, unlike the Java scheme (afaik)

https://gcc.godbolt.org/z/14WdnKhhP

FWIW f32 on x86-32-noSSE should actually be fine, since double-rounding is okay as long as the precision gap between the two modes is big enough. Only f64 has a problem since it is "too close" to the 80bit-precision of the x87 FPU.

On another note, we even have code in the standard library that temporarily alters the x87 FPU control word to ensure exact 64bit precision...

I wonder if there's something that could be done about the fact that this also affects tier 1 targets with custom (stable) flags such as -C target-cpu=pentium. Do we still want to consider that a tier 1 target in itself? Is there a way that we can just reject flags that would disable SSE2 support, and tell people to use a different target instead?

@RalfJung What about forcing non-SSE2 targets to software floating point? That must be supported anyway because of kernel code.

Yeah that's an option listed above, since you already proposed it before. I have no idea how feasible it is. People are very concerned about the softfloat support for f16/f128 leading to code bloat and whatnot, so the same concerns would likely also apply here.

AFAIK the kernel code just doesn't use floats, I don't think they have softfloats?

Just to clarify, this is really only for 32-bit x86 non-SSE targets, and doesn't affect x86-64 non-SSE2 targets like x86-64-unknown-none?

Can we just drop support for x86 without SSE2

I would guess we'll need to support i686-unknown-none like x86-64-unknown-none for use in operating system kernels like Linux that don't allow vector registers to be used (without extra work) even when the hardware has them.

or fall back to software floating point?

That appears to be what x86_64-unknown-none does, one of the non-SSE2-by-default x86-64 targets. At least the table at https://doc.rust-lang.org/beta/rustc/platform-support.html says "softfloat", though the detailed target documentation at https://doc.rust-lang.org/beta/rustc/platform-support/x86_64-unknown-none.html doesn't mention softfloat.

Just to clarify, this is really only for 32-bit x86 non-SSE targets, and doesn't affect x86-64 non-SSE2 targets like x86-64-unknown-none?

x86-64-unknown-none uses softfloats so it should not be affected.
I don't think there are any x86-64 hardfloat targets without SSE.

I would guess we'll need to support i686-unknown-none like x86-64-unknown-none for use in operating system kernels like Linux that don't allow vector registers to be used (without extra work) even when the hardware has them.

I think they were asking about targets where the hardware doesn't have SSE.
Softfloat targets are fine, assuming the softfloat libraries implement the IEEE spec correctly.

We could set the FPU control register to 64bit precision for Rust programs, and require other code to set the register in that way before calling into a Rust library.

This is not sufficient to ensure full IEEE754 compliance (for example see here).

What's described there sounds different. That is using volatile to force rounding from "extended precision" to "double precision" after each operation. But of course the actual operation is still done with extended precision, which leads to double-rounding, which leads to wrong results.

The proposal you quoted was to switch x87 precision such that the operation is performed with double-precision to begin with, entirely avoiding double rounding.

It is possible that the x87 has further issues that make this not work, but the link you posted does not seem to even mention idea if changing the FPU control register to get a different precision, so as far as I can see it doesn't provide any evidence that setting the x87 to 64bit precision would lead to incorrect results.

From the post (emphasis mine):

Floating-point calculations done with x87 FPU instructions are done in extended-precision registers, even when the processor is in double-precision mode and the variables being operated on are double-precision.

Hm, my understanding was that switching the FPU mode to 64bit would solve the problem. But of course it's possible that x87 screws up even when explicitly asked for IEEE 754 64bit arithmetic. 🀷

Requiring the FPU to be in 64bit mode is anyway not a realistic option, I listed it just for completeness' sake.

running x87 in 53-bit mode works except that denormal f64 results have too much precision and the exponent range is too big (e.g. it can express 2^-8000 but f64 can't)

Does the Java encoding behave correctly for denormals?

That it can express too many exponents wouldn't be a problem if the extra precision doesn't lead to different rounding results.

Yes, Java behaves correctly in all cases. It scales the exponent of one of the arguments before multiplication and division to ensure that where a 64-bit op would evaluate to a denormal, the 80-bit op does the same (and then it scales the exponent of the result back to what it should be). This is all described in the PDF linked from OP.

Unfortunately, this doesn't just effect floating point arithmetic, as LLVM will load to and store from the x87 floating point stack even when just moving floats around. As loading and storing f32s and f64s to and from the x87 floating point stack quiets signalling NaNs , but LLVM assumes that it does not, this can lead to miscompilations. For instance, the following program will segfault when ran after compiling with optimisations for an i586 target (e.g. rustc -O --target=i586-unknown-linux-gnu code.rs).

#[derive(Copy, Clone)]
#[repr(u32)]
enum Inner {
    // Same bit pattern as a signalling NaN `f32`.
    A = (u32::MAX << 23) | 1,
    B,
}

#[derive(Copy, Clone)]
enum Data {
    I(Inner),
    F(f32),
}

#[inline(never)]
fn store_data(data: Data, data_out: &mut Data) {
    // Suggest to LLVM that the data payload is a float.
    std::hint::black_box(match data {
        Data::I(x) => 0.0,
        Data::F(x) => x,
    });
    // LLVM will optimise this to a float load and store (with a separate load/store for the discriminant).
    *data_out = match data {
        Data::I(x) => Data::I(x),
        Data::F(x) => Data::F(x),
    };
}

fn main() {
    let mut res = Data::I(Inner::A);
    store_data(Data::I(Inner::A), &mut res);
    if let Data::I(res) = res {
        // LLVM will optimise out the bounds check as the index should always be in range.
        let index = (res as u32 - (u32::MAX << 23)) as usize;
        dbg!([1, 2, 3, 4, 5][index]); // This will segfault.
    } else {
        unreachable!();
    }
}

Wow, that's a great example. It even works in entirely safe code. Impressive.

We probably have to upgrade this issue to I-unsound then. Is there an upstream LLVM issue for this miscompilation?

Muon commented

That is impressive. It looks closely related to llvm/llvm-project#44218. Perhaps it's yet another example to toss on that pile? I'm honestly starting to think LLVM should just deprecate f32 and f64 support on x87 targets.

As loading and storing f32s and f64s to and from the x87 floating point stack quiets signalling NaNs

(I have not tested it, but I wonder if this x87's behavior actually affects not only floats, but also the i586's 64-bit atomic load/store, which uses x87 load/store instructions.) EDIT: sorry, this my comment is not correct: see #114479 (comment)

Atomic loads/stores use the integer-to-float/float-to-integer load/store instructions (fild/fistp) as opposed to the float-to-float load/store instructions (fld/fstp) and are therefore unaffected.

The above example will segfault on current stable (1.77.2), but not current nightly (2024-04-22). Making the data argument of the store_data function an &mut reference makes it segfault on both:

#[derive(Copy, Clone)]
#[repr(u32)]
enum Inner {
    // Same bit pattern as a signalling NaN `f32`.
    A = (u32::MAX << 23) | 1,
    B,
}

#[derive(Copy, Clone)]
enum Data {
    I(Inner),
    F(f32),
}

#[inline(never)]
fn store_data(data: &mut Data, data_out: &mut Data) {
    // Suggest to LLVM that the data payload is a float.
    std::hint::black_box(match *data {
        Data::I(x) => 0.0,
        Data::F(x) => x,
    });
    // LLVM will optimise this to a float load and store (with a separate load/store for the discriminant).
    *data_out = match *data {
        Data::I(x) => Data::I(x),
        Data::F(x) => Data::F(x),
    };
}

fn main() {
    let mut res = Data::I(Inner::A);
    store_data(&mut Data::I(Inner::A), &mut res);
    if let Data::I(res) = res {
        // LLVM will optimise out the bounds check as the index should always be in range.
        let index = (res as u32 - (u32::MAX << 23)) as usize;
        dbg!([1, 2, 3, 4, 5][index]); // This will segfault.
    } else {
        unreachable!();
    }
}

It's also possible to cause miscompilations due to the difference between what floats evaluate to at compile-time vs. at runtime. The following program, which is a very lightly modified version of @comex's example from a different issue (rust-lang/unsafe-code-guidelines#471 (comment), see that comment for more details on how this works), will segfault when ran after compiling with optimisations on i586 targets:

#[inline(never)]
fn print_vals(x: f32, i: usize, vals_i: u32) {
    println!("x={x} i={i} vals[i]={vals_i}");
}

#[inline(never)]
pub fn evil(vals: &[u32; 300]) {
    // Loop variables:
    let mut x: f32 = 0.0; // increments by 1-and-a-bit every time
    let mut i: usize = 0; // increments by 2 every time

    while x != 90.0 {
        // LLVM will do a brute-force evaluation of this loop for up to 100
        // iterations to try to calculate an iteration count.  (See
        // `llvm/lib/Analysis/ScalarEvolution.cpp`.)  Under normal floating
        // point semantics, `x` will equal exactly 90.0 after 90 iterations;
        // LLVM discovers this by brute-force evaluation and concludes that the
        // iteration count is always 90.

        // Now, if this loop executes 90 times, then `i` must be in the range
        // `0..180`, so the bounds check in `vals[i]` should always pass, so
        // LLVM eliminates it.
        print_vals(x, i, vals[i]);

        // Update `x`.  The exact computation doesn't matter that much; it just
        // needs to:
        //   (a) be possible to constant-evaluate by brute force (i.e. by going
        //       through each iteration one at a time);
        //   (b) be too complex for IndVarSimplifyPass to simplify *without*
        //       brute force;
        //   (b) depend on floating point accuracy.

        // First increment `x`, to make sure it's not just the same value every
        // time (either in LLVM's opinion or in reality):
        x += 1.0;

		// This adds a small float to `x`. This should get rounded to no change
		// as the float being added is too small to make a difference to `f32`'s
		// 23-bit fraction. However, it will make a difference to the value of
		// the `f80` on the x87 floating point stack. This means that `x` will
		// no longer be a whole number and will never hit exactly 90.0.
        x += (1.0_f32 / 2.0_f32.powi(25));

        // Update `i`, the integer we use to index into `vals`.  Why increment
        // by 2 instead of 1?  Because if we increment by 1, then LLVM notices
        // that `i` happens to be equal to the loop count, and therefore it can
        // replace the loop condition with `while i != 90`.  With `i` as-is,
        // LLVM could hypothetically replace the loop condition with
        // `while i != 180`, but it doesn't.
        i += 2;

    }
}

pub fn main() {
    // Make an array on the stack:
    let mut vals: [u32; 300] = [0; 300];
    for i in 0..300 { vals[i as usize] = i; }
    evil(&vals);
}

Nominating for t-compiler discussion.

This tracking issue shows that we have targets that intersect tier platforms support in different ways. For example i686 are tier 1 but "non-SSE2" are tier 2 (and suffer from codegen unsoundnesses). All these differences are not apparent in our documentation.

So as discussed on Zulip there are probably a number of questions:

  • targets that just don't have SSE2, like i586 -- these are tier 2, maybe critical codegen bugs are "fine" there?
  • tier 1 targets with SSE2 disabled. we dont have a way to say "that's actually a tier 2 situation" so maybe we should just emit a hard error? or at least a warning?
  • (to which I also add): if these issues stem from LLVM, can we faithfully represent a realistic situation or should we just point at LLVM and say "we just do what they do"?

(please feel free to add context/correct how I represent the problem, thanks!)

@rustbot label I-compiler-nominated

In a sense the issue stems from LLVM, yeah -- x86 without SSE seems to be a poorly supported target. Even in the best of cases, f64 just behaves incorrectly on that target (and f32, to a lesser extent), and then it turns out things are actually a lot worse.

Altogether I think there are enough issues with floating point on x86 without SSE (this one, and also #115567) that IMO we should say that tier 1 hardfloat targets require SSE period. It is already the case that using feature flags to turn a hardfloat target into a softfloat target is unsound (Cc #116344), and we should simply hard-error in those cases (e.g. disabling the x87 feature on any hardfloat x86 target). IMO we should do the same when disabling SSE/SSE2 on an i686 target.

@rustbot label -I-compiler-nominated

Discussed in compiler triage (on Zulip). Copying here the summary:

  • we acknowledge this is a problem
  • we should update target docs to contain a mention of this issue (non-SSE2 x86 codegen is basically unsound)
  • a MIR lint to detect float ops when using these broken targets would help users
  • a lint to detect mismatched -Ctarget-features between dependencies would help with -Ctarget-feature=-sse and similar cases for other platforms
  • Current Tier 1 x86 targets require SSE-based floats at minimum (i.e. not softfloats)

Does Rust expose the difference between representation and evaluation method? In C, FLT_EVAL_METHOD` is used by applications to detect platforms on which the computations are performed at a higher precision than the underlying value representations.

This doesn't completely solve the problem as 8087 ABI also does representation conversion across function call boundaries so passing (or returning) a sNaN will have that transformed into qNaN and an exception.

It's a stretch to claim that the 8087 FPU has 'unsound' floating point behavior; it's all quite compliant with the IEEE 754 specification, if you squint hard enough. Changing the ABI to pass float/double without a representation change would resolve the worst of the problems I've found, and that's not a hardware issue.

In Rust, all operations are individually evaluated at the precision of the type being used (which I believe is the equivalent of FLT_EVAL_METHOD 0) using the default round to nearest, ties to even rounding mode (except for methods which explicitly document a different rounding mode like .round()). The basic +/-/*// operators are guaranteed to give correctly rounded results, and methods which have an unspecified precision are documented as such.

The unsoundness is not just theoretical; the LLVM IR Rust compiles f32 and f64 operations to has the desired Rust semantics, whereas the LLVM non-SSE x86 backend compiles that IR to machine code that violates the semantics of the IR. This means e.g. LLVM optimisations that (correctly) operate presuming the LLVM IR semantics with regards to evaluation precision can cause the LLVM non-SSE x86 backend introduce out-of-bounds reads/writes in safe code (see this earlier comment for a code example). The NaN quietening issue also violates the semantics of LLVM IR and can cause the emitted binary to mutate the value of non-float types (see this earlier comment for the code sample, details in this comment).

Because this is a miscompilation at the LLVM IR -> machine code stage, as opposed to the Rust -> LLVM IR stage, miscompilations can occur in other programming languages that use LLVM as a codegen backend. For example, llvm/llvm-project#89885 contains an example of a miscompilation from C. Ultimately what matters are the semantics of the LLVM IR; not everything that is permitted by the IEEE 754 specification is permitted by LLVM IR (and vice versa).

The return value ABI issue is tracked separately in #115567, and affects all 32-bit x86 targets, not just those with SSE/SSE2 disabled. It is possible to manually load/store a f32/f64 signalling NaN to/from an x87 register without quietening it (see e.g. the code in #115567 (comment)), but currently neither LLVM nor Rust do so. The "Rust" ABI (which doesn't have any stability guarantees) is being changed to avoid x87 registers completely in #123351.

Yeah, it sounds like rust cannot support other values for FLT_EVAL_METHOD, and so doesn't support these kinds of corner cases in the IEEE 754 spec. I suspect Rust and the x87 FPU are just not going to get along. Even if you kludge around the sNaN->qNaN adventures, you're still going to suffer from double rounding as every computation is first rounded to 80 bits and then to 32 or 64.

Having Rust produce results that depend on the underlying hardware seems antithetical to some pretty fundamental language goals. But, that would mean abandoning the x87 FPU entirely, and that doesn't seem feasible even though sse2, which provides the necessary native 32/64 binary format support is nearly old enough to serve in the US house of representatives.

I'd encourage someone to figure out how to tell applications that the underlying hardware doesn't work the way they expect. Mirroring some version of FLT_EVAL_METHOD from C would at least be easy and sufficient for this hardware.

It's a stretch to claim that the 8087 FPU has 'unsound' floating point behavior;

That's not the claim (so if it sounds like that is our claim we should fix that). The claim is that the behavior of LLVM's optimizer and backend in combination with that of the x87 FPU is unsound.

Having hardware-dependent behavior in an operation that is specified to be portable and deterministic would also still be unsound, but could conceivably be fixed by changing the spec. But an internally inconsistent backend is wrong for every reasonable spec.

(Whether we'd really want such a target-specific spec is a different question. Rust code is pretty portable by default. Do we really want to require all Rust code to have to deal with non-standard precision float arithmetic? Or should code have some explicit opt-in / opt-out for niche targets with non-standard behavior? For this thread the question is rather moot as reliably implementing consistent non-standard behavior would at best be a lot of work [probably involving using LLVM's strictfp operations everywhere, but I am not sure if even that would be enough] with little pay-off. That work is IMO better spent implementing the standard semantics on x87. A very similar question comes up with CHERI hardware; in that case, the intended approach for now is to experiment with a tier 3 target. In the future we might need crates to be able to indicate whether they support such targets.)

I'd encourage someone to figure out how to tell applications that the underlying hardware doesn't work the way they expect. Mirroring some version of FLT_EVAL_METHOD from C would at least be easy and sufficient for this hardware.

As @beetrees explained, it's not just that the underlying hardware works differently and that bleeds into language semantics. It's that Rust's primary backend, LLVM, assumes the underlying hardware to work the standard way -- the examples @beetrees referenced demonstrate that there is no reliable way to program against this hardware in any compiler that uses LLVM as its backend. (At least not if the compiler uses the standard LLVM float types and operations.)

To my knowledge, nobody on the LLVM side really cares about this. So until that changes it is unlikely that we'll be able to improve things in Rust here. Telling programmers "on this hardware your program may randomly explode for no fault of your own" is not very useful. (I mean, it is a useful errata of course, but it doesn't make sense to make this the spec.)

Even if you kludge around the sNaN->qNaN adventures, you're still going to suffer from double rounding as every computation is first rounded to 80 bits and then to 32 or 64.

For f32, double-rounding does not affect results. For f64, it's possible to get around that using the approach Java used, IIUC. It does cost some performance but it avoids the semantic inconsistencies.

We could also say that Rust code expects the FPU to be set to 64bit precision on such targets. That seems a lot easier...

There are legitimate use-cases for running with the FPU in flush-to-zero + denormals-are-zero mode, even on mainstream targets like x86-64. Processing denormals outside of FTZ+DAZ mode requires a microcode assist that causes ~100 cycles IIRC. In this application, denormal numbers correspond to sounds that are (IIUC) below the noise floor, so flushing them to zero is harmless. Failing to meet deadlines because of the microcode assist is a bug.

That’s separate but related to this issue.

Muon commented

For f32, double-rounding does not affect results.

This is only true for some operations, and then only if a single step is performed. It is not true for addition/subtraction, and it is not true even for other operations if multiple operations are performed at 80-bit precision.

There are legitimate use-cases for running with the FPU in flush-to-zero + denormals-are-zero mode, even on mainstream targets like x86-64.

Yes, but LLVM's support for anything except standard IEEE floating point arithmetic is broken or absent, so Rust can't hope to support this until LLVM does.

It is not true for addition/subtraction

For f32, a single a + b always gives the same result as (a as f80 + b as f80) as f32 per this paper.

There are legitimate use-cases for running with the FPU in flush-to-zero + denormals-are-zero mode, even on mainstream targets like x86-64.

Flushing denormals isn't even permitted by the IEEE standard. So while I am not doubting that that is something people may want to do, supporting such non-standards-compliant hardware requires inline assembly or new language features (and, depending on the constraints, work on the LLVM side). This is related to supporting floating point exception flags and non-default rounding modes, which is part of the standard but not supported in Rust. Please take that to another thread.

and it is not true even for other operations if multiple operations are performed at 80-bit precision.

Indeed the rounding down to f32 has to happen after each operation.

WG-prioritization assigning priority (see Zulip discussion for the related issue #129880).

@rustbot label -I-prioritize +P-medium