rust-lang/rust

floating point to integer casts can cause undefined behaviour

thestinger opened this issue ยท 234 comments

Status as of 2020-04-18

We intend to stabilize the saturating-float-casts behavior for as, and have stabilized unsafe library functions that handle the previous behavior. See #71269 for the latest discussion on that stabilization process.

Status as of 2018-11-05

A flag has been implemented in the compiler, -Zsaturating-float-casts, which will cause all float to integer casts have "saturating" behavior where if it's out of bounds it's clamped to the nearest bound. A call for benchmarking of this change went out awhile ago. Results, while positive in many projects, are quite negative for some projects and indicates that we're not done here.

The next steps are figuring out how to recover performance for these cases:

  • One option is to take today's as cast behavior (which is UB in some cases) and add unsafe functions for the relevant types and such.
  • Another is to wait for LLVM to add a freeze concept which means that we get a garbage bit pattern, but it's at least not UB
  • Another is to implement casts via inline assembly in LLVM IR, as the current codegen is not heavily optimized.

Old status

UPDATE (by @nikomatsakis): After much discussion, we've got the rudiments of a plan for how to address this problem. But we need some help with actually investigating the performance impact and working out the final details!


ORIGINAL ISSUE FOLLOWS:

If the value cannot fit in ty2, the results are undefined.

1.04E+17 as u8

Nominating

accepted for P-high, same reasoning as #10183

I don't think this is backwards incompatible at a language level. It will not cause code that was working OK to stop working. Nominating.

changing to P-high, same reasoning as #10183

nrc commented

How do we propose to solve this and #10185? Since whether behaviour is defined or not depends on the dynamic value of the number being cast, it seems the only solution is to insert dynamic checks. We seem to agree we do not want to do that for arithmetic overflow, are we happy to do it for cast overflow?

We could add an intrinsic to LLVM that performs a "safe conversion". @zwarich may have other ideas.

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Oh, that's easy enough then.

nrc commented

ping @pnkfelix is this covered by the new overflow checking stuff?

bluss commented

These casts are not checked by rustc with debug assertions.

Aatch commented

I'm happy to handle this, but I need a concrete solution. I personally think that it should be checked along with overflowing integer arithmetic, as it's a very similar issue. I don't really mind what we do though.

Note that this issue is currently causing an ICE when used in certain constant expressions.

bluss commented

This allows violating memory safety in safe rust, example from this forum post:

Undefs, huh? Undefs are fun. They tend to propagate. After a few minutes of wrangling..

#[inline(never)]
pub fn f(ary: &[u8; 5]) -> &[u8] {
    let idx = 1e100f64 as usize;
    &ary[idx..]
}

fn main() {
    println!("{}", f(&[1; 5])[0xdeadbeef]);
}

segfaults on my system (latest nightly) with -O.

Marking with I-unsound given the violation of memory safety in safe rust.

@bluss , this does not segfualt for me, just gives an assertion error. untagging since i was the one who added it

Sigh, I forgot the -O, re-tagging.

re-nominating for P-high. Apparently this was at some point P-high but got lower over time. This seems pretty important for correctness.

EDIT: didnโ€™t react to triage comment, adding label manually.

It seems like the precedent from the overflow stuff (e.g. for shifting) is to just settle on some behavior. Java seems to produce the result modulo the range, which seems not unreasonable; I'm not sure just what kind of LLVM code we'd need to handle that.

According to https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.3 Java also guarantees that NaN values are mapped to 0 and infinities to the minimum/maximum representable integer. Moreover the Java rule for the conversion is more complex than just wrapping, it can be a combination of saturation (for the conversion to int or long) and wrapping (for the conversion to smaller integral types, if needed). Replicating the whole conversion algorithm from Java is certainly possible, but it would require a fair amount of operations for every cast. In particular, in order to ensure that the result of a fpto[us]i operation in LLVM does not exhibit undefined behaviour, a range check would be needed.

As an alternative, I would suggest that float->int casts are guaranteed to only be valid if the truncation of the original value can be represented as a value of the destination type (or maybe as [iu]size?) and to have assertions on debug builds that trigger a panic when the value has not been represented faithfully.

The main advantages of the Java approach are that the conversion function is total, but this also means that unexpected behaviour might creep in: it would prevent undefined behaviour, but it would be easy to be tricked into not checking if the cast actually made any sense (this is unfortunately true also for the other casts ๐Ÿ˜Ÿ ).

The other approach matches the one currently used for arithmetic operations: simple & efficient implementation in release, panics triggered by range checking in debug. Unfortunately unlike other as casts, this would make such conversion checked, which can be surprising to the user (although maybe the analogy to arithmetic operations can help here). This would also break some code, but AFAICT it should only happen for code which is currently relying on undefined behaviour (i.e. it would replace the undefined behaviour "let's return any integer, you obviously don't care which" with a panic).

The problem isn't "let's return any integer, you obviously don't care which", it is that it causes an undef which isn't a random value but rather a nasal demon value and LLVM is allowed to assume the undef never occurs enabling optimizations that do horrible incorrect things. If it was a random value, but crucially not undef, then that would be enough to fix the soundness issues. We don't need to define how unrepresentable values are represented, we just need to prevent undef.

Discussed in @rust-lang/compiler meeting. The most consistent course of action remains:

  1. when overflow checks are enabled, check for illegal casts and panic.
  2. otherwise, we need a fallback behavior, it should be something which has minimal (ideally, zero) runtime cost for valid values, but the precise behavior is not that important, so long as it is not LLVM undef.

The main problem is that we need a concrete suggestion for option 2.

triage: P-medium

@nikomatsakis Does as ever currently panic in debug builds? If it doesn't, for consistency and predictability it seems preferable to keep it that way. (I think it should have, just like arithmetic, but that's a separate, and past, debate.)

otherwise, we need a fallback behavior, it should be something which has minimal (ideally, zero) runtime cost for valid values, but the precise behavior is not that important, so long as it is not LLVM undef.

Concrete suggestion: extract digits and exponent as u64 and bitshift digits by exponent.

fn f64_as_u64(f: f64) -> u64 {
    let (mantissa, exponent, _sign) = f.integer_decode();
    mantissa >> ((-exponent) & 63)
}

Yes it's not zero cost, but it's somewhat optimizable (would be better if we marked integer_decode inline) and at least deterministic. A future MIR-pass that expands a float->int cast could probably analyze whether the float is guaranteed to be ok to cast and skip this heavy conversion.

eddyb commented

Does LLVM not have platform intrinsics for the conversion functions?

EDIT: @zwarich said (a long time ago):

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Why even bother panicking? AFAIK, @glaebhoerl is correct, as is supposed to truncate/extend, not check the operands.

On Sat, Mar 05, 2016 at 03:47:55AM -0800, Gรกbor Lehel wrote:

@nikomatsakis Does as ever currently panic in debug builds? If it doesn't, for consistency and predictability it seems preferable to keep it that way. (I think it should have, just like arithmetic, but that's a separate, and past, debate.)

True. I find that persuasive.

On Wed, Mar 09, 2016 at 02:31:05AM -0800, Eduard-Mihai Burtescu wrote:

Does LLVM not have platform intrinsics for the conversion functions?

EDIT:

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Why even bother panicking? AFAIK, @glaebhoerl is correct, as is supposed to truncate/extend, not check the operands.

Yes, I think I was mistaken before. as is the "unchecked truncation"
operator, for better or worse, and it seems best to stay consistent
with that philosophy. Using target-specific intrinsics may be a perfectly
fine solution though?

@nikomatsakis: it seems the behavior hasn't been defined yet? Can you give an update about the planning regarding that?

Just ran into this with much smaller numbers

    let x: f64 = -1.0;
    x as u8

Results in 0, 16, etc. depending on optimizations, I was hoping it would be defined as 255 so I don't have to write x as i16 as u8.

vks commented

@gmorenz Did you try !0u8?

In context that wouldn't make sense, I was getting the f64 from a transformation on data sent over the network, with a range of [-255, 255]. I was hoping it would wrap nicely (in the exact way that <i32> as u8 wraps).

Here's a recent LLVM proposal to "kill undef" http://lists.llvm.org/pipermail/llvm-dev/2016-October/106182.html , though I'm hardly knowledgeable enough to know whether or not this would automagically resolve this issue.

They're replacing undef with poison, the semantics being slightly different. It's not going to make int -> float casts defined behavior.

We probably should provide some explicit way to do a saturating cast? Iโ€™ve wanted that exact behaviour just now.

Seems like this should be marked I-crash, given #10184 (comment) .

We had a question about this in #rust-beginners today, someone ran into it in the wild.

The book I'm writing with @jimblandy, Programming Rust, mentions this bug.

Several kinds of casts are permitted.

  • Numbers may be cast from any of the built-in numeric types to any other.

    (...)

    However, as of this writing, casting a large floating-point value to an integer type that is too small to represent it can lead to undefined behavior. This can cause crashes even in safe Rust. It is a bug in the compiler, github.com/rust-lang/rust/issues/10184.

Our deadline for this chapter is May 19. I'd love to delete that last paragraph, but I feel like we should at least have some kind of plan here first.

Apparently current JavaScriptCore uses an interesting hack on x86. They use the CVTTSD2SI instruction, then fall back on some hairy C++ if the value is out of range. Since out-of-range values currently explode, using that instruction (with no fallback!) would be an improvement on what we have now, albeit only for one architecture.

vks commented

Honestly I think we should deprecate numeric casts with as and use From and TryFrom or something like the conv crate instead.

Maybe so, but that seems orthogonal to me.

OK, I've just re-read this whole conversation. I think there is agreement that this operation should not panic (for general consistency with as). There are two leading contenders for what the behavior ought to be:

  • Some sort of defined result
    • Pro: This I think is maximally consistent with our general philosophy thus far.
    • Con: There doesn't seem to be a truly portable way to produce any particular defined result in this case. This implies that we would be using platform-specific intrinsics with some sort of fallback for out of range values (for example, falling back to saturation, this function that @oli-obk proposed, to Java's definition, or to whatever "hairy C++" JSC uses.
    • At worst, we can just insert some ifs for the "out of range" cases.
  • An undefined value (not undefined behavior)
    • Pro: This lets us just use the platform-specific intrinsics that are available on each platform.
    • Con: It's a portability hazard. In general, I feel like we've not made use of undefined results very often, at least in the language (I'm sure we do in the libs in various places).

It's not clear to me if there is a clear precedent for what the result ought to be in the first case?

After having written that out, my preference would be to maintain a deterministic result. I feel like every place that we can hold the line on determinism is a win. I am not really sure what the result ought to be though.

I like saturation because I can understand it and it seems useful, but it seems somehow incongruent with the way that u64 as u32 does truncation. So perhaps some sort of result based on truncation makes sense, which I guess is probably what @oli-obk proposed -- I don't fully understand what that code is intended to do. =)

My code gives the correct value for things in the range 0..2^64 and deterministic but bogus values for everything else.

floats are represented by mantissa ^ exponent, e.g. 1.0 is (2 << 52) ^ -52 and since bitshifts and exponents are the same thing in binary, we can just reverse the shift (thus the negation of the exponent and the right shift).

+1 for determinism.

I see two semantics that make good sense for humans, and I think we should pick whichever one is faster for values that are in range, when the compiler can't optimize away any of the computation. (When the compiler knows that a value is in range, both options give the same results, so they are equally optimize-able.)

  • Saturation (out-of-range values become IntType::max_value()/min_value())
  • Modulo (out-of-range values are treated as if by converting to a bigint first, then truncating)

The table below is meant to specify both options fully. T is any machine integer type. Tmin and Tmax are T::min_value() and T::max_value(). RTZ(v) means take the mathematical value of v and Round Toward Zero to get a mathematical integer.

v v as T (saturation) v as T (modulo)
in range (Tmin <= v <= Tmax) RTZ(v) RTZ(v)
negative zero 0 0
NaN 0 0
Infinity Tmax 0
-Infinity Tmin 0
v > Tmax Tmax RTZ(v) truncated to fit T
v < Tmin Tmin RTZ(v) truncated to fit T

The ECMAScript standard specifies operations ToInt32, ToUint32, ToInt16, ToUint16, ToInt8, ToUint8, and my intent with the "modulo" option above is to match those operations in every case.

ECMAScript also specifies ToInt8Clamp which does not match either case above: it does "round half to even" rounding on fractional values rather than "round to zero".

@oli-obk's suggestion is a third way, worth considering if it's faster to compute, for values that are in range.

@oli-obk What about signed integer types?

est31 commented

Related #41799

Throwing another proposal into the mix: Mark u128 casts to floats as unsafe and force folks to explicitly choose a way of handling it. u128 is pretty rare currently.

@Manishearth Iโ€™d hope for similar semantics integers โ†’ floats as floats โ†’ integers. Since both are UB-ful, and we cannot make floatโ†’integer unsafe anymore, we should probably avoid making integerโ†’float unsafe as well.

For floatโ†’integer saturating will be faster AFAICT (resulting in a sequence of and, test+jump float comparison and jump, all for 0.66 or 0.5 2-3 cycles on modern arches). I personally couldnโ€™t care less for what exact behaviour we decide on as long as the in-range values are as fast as they possibly could be.

CryZe commented

Wouldn't it make sense to make it behave like overflow? So in a debug build it would panic if you do a cast with undefined behaviour. Then you could have methods for specifying the casting behaviour like 1.04E+17.saturating_cast::<u8>(), unsafe { 1.04E+17.unsafe_cast::<u8>() } and potentially others.

Oh, I thought the issue was only for u128, and we can make that unsafe both ways.

@CryZe UB should not exist even in release mode in safe code. The overflow stuff is still defined behavior.

That said, panic on debug, and on release would be great.

This affects:

  • f32 -> u8, u16, u32, u64, u128, usize (-1f32 as _ for all, f32::MAX as _ for all but u128)
  • f32 -> i8, i16, i32, i64, i128, isize (f32::MAX as _ for all)
  • f64 -> all ints (f64::MAX as _ for all)

f32::INFINITY as u128 is also UB

@CryZe

Wouldn't it make sense to make it behave like overflow? So in a debug build it would panic if you do a cast with undefined behaviour.

This is what I thought initially, but I was reminded that as conversions never panic at present (we don't do overflow checking with as, for better or worse). So the most analogous thing is for it to "do something defined".

comex commented

FWIW, the "kill undef" thing would, in fact, provide a way to fix the memory unsafety, but leaving the result nondeterministic. One of the key components is:

  1. Create a new instruction, '%y = freeze %x', that stops propagation of
    poison. If the input is poison, then it returns an arbitrary, but fixed,
    value. (like old undef, but each use gets the same value), otherwise it
    just returns its input value.

The reason undefs can be used to violate memory safety today is that they can magically change values between uses: in particular, between a bounds check and subsequent pointer arithmetic. If rustc added a freeze after every dangerous cast, you would just get an unknown but otherwise well-behaved value. Performance-wise, freeze is basically free here, since of course the machine instruction corresponding to the cast produces a single value, not a fluctuating one; even if the optimizer feels like duplicating the cast instruction for some reason, it should be safe to do so because the result for out-of-range inputs is usually deterministic on a given architecture.

...But not deterministic across architectures, if anyone was wondering. x86 returns 0x80000000 for all bad inputs; ARM saturates for out-of-range inputs and (if I'm reading this pseudocode right) returns 0 for NaN. So if the goal is to produce a deterministic and platform-independent result, it's not enough to just use the platform's fp-to-int intrinsic; at least on ARM you also need to check the status register for an exception. This may have some overhead in itself, and certainly prevents autovectorization in the unlikely event that using the intrinsic didn't already. Alternately, I guess you could explicitly test for in-range values using regular compare operations and then use a regular float-to-int. That sounds a lot nicer on the optimizerโ€ฆ

as conversions never panic at present

At some point we changed + to panic (on debug mode). I wouldnโ€™t be shocked to see as panic in cases that were previously UB.

If we care about checking (which we should), then we should either deprecate as (is there any use case where it's the only good option?) or at least advise against using it, and move people onto things like TryFrom and TryInto instead, which is what we said we were planning to do back when it was decided to leave as as-is. I don't feel that the cases under discussion are qualitatively different, in the abstract, from the cases where as is already defined not to do any checks. The difference is just that in practice the implementation for these cases is currently incomplete and has UB. A world where you can't rely on as doing checks (because for most types, it doesn't), and you can't rely on it not panicking (because for some types, it would), and it's not consistent, and we still haven't deprecated it seems like the worst of all of them to me.

So, I think at this point @jorendorff basically enumerated what seems to me to be the best plan:

  • as will have some deterministic behavior;
  • we'll pick a behavior based on a combination of how sensible it is, and how efficient it is

He enumerated three possibilities. I think the remaining work is to investigate those possibilities -- or at least investigate one of them. That is, actually implement it, and try to get some feeling for how "slow" or "fast" it is.

Is there anyone out there who feels motivated to take a stab at that? I'm going to tag this as E-help-wanted in hopes of attracting some a person. (@oli-obk?)

Uh, I'd rather not pay price for cross-platform consistency :/ It's garbage-in, I don't care what garbage goes out (however a debug assertion would be super helpful).

Currently all rounding/truncating functions in Rust are very slow (function calls with painstakingly precise implementations), so the as is my last resort hack for fast float rounding.

If you're going to make as anything more than bare cvttss2si, please also add a stable alternative that is just that.

est31 commented

@pornel this isn't just UB of the theoretic kind where stuff is okay if you ignore that it is ub, it has real world implications. I've extracted #41799 from a real world code example.

@est31 I agree that leaving it as UB is wrong, but I've seen freeze proposed as a solution to UB. AFAIK that makes it a defined deterministic value, you just don't get to say which. That behavior is fine with me.

So I'd be fine if e.g. u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.

freeze would not produce a defined, deterministic, unspecified value. Its result is still "any bit pattern the compiler likes", and it is consistent only across uses of the same operation. This may sidestep the UB-producing examples people have collected above, but it would not give this:

u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.

For example, if LLVM notices that u128::MAX as f32 overflows and replaces it with freeze poison, a valid lowering of fn foo() -> f32 { u128::MAX as f32 } on x86_64 might be this:

foo:
  ret

(that is, just return whatever was last stored in the return register)

I see. That's still acceptable for my uses (for cases where I expect out of range values, I do clamping beforehand. Where I expect values in range, but they aren't, then I'm not going to get a correct result no matter what).

I have no problem with out of range float casts returning arbitrary values as long as the values are frozen so they can't cause further undefined behavior.

Is something like freeze available on LLVM? I thought that was a purely theoretical construct.

eddyb commented

@nikomatsakis I've never seen it used like that (unlike poison) - it's a planned revamp of poison/undef.

freeze does not exist at all in LLVM today. It's only been proposed (this PLDI paper is a self-contained version, but it's also been discussed a lot on the mailing list). The proposal seems to have considerable buy-in, but of course that is no guarantee it will be adopted, much less adopted in a timely manner. (Removing the pointee types from pointer types has been accepted for years and it's still not done.)

Do we want to open up an RFC to get more widespread discussion on the changes being proposed here? IMO, anything that potentially impacts the performance of as is going to be contentious, but it will be doubly contentious if we don't give people the chance to make their voice heard.

I'm a Julia developer, and I've been following this issue for a while, as we share the same LLVM backend and so have similar issues. In case it's of interest, here is what we have settled on (with approximate timings for a single function on my machine):

  • unsafe_trunc(Int64, x) maps directly to the corresponding LLVM intrinsic fptosi (1.5 ns)
  • trunc(Int64, x) throws an exception for out of range values (3 ns)
  • convert(Int64, x) throws an exception for out of range or non-integer values (6 ns)

Also, I have asked on the mailing list about making the undefined behaviour a little more defined, but did not receive a very promising response.

@bstrie I'm fine with an RFC, but I think it'd definitely be useful to have data! @simonbyrne's comment is super helpful in that regard, however.

I've toyed around with JS semantics (the modulo @jorendorff mentioned) and the Java semantics which appear to be the "saturation" column. In case those links expire it's JS and Java.

I also whipped up a quick implementation of saturation in Rust which I think (?) is correct. And got some benchmark numbers as well. Interestingly I'm seeing that the saturating implementation is 2-3x slower than the intrinsic, which is different from what @simonbyrne found with only 2x slower.

I'm not entirely sure how to implement the "mod" semantics in Rust...

To me, though, it seems clear that we'll need a slew of f32::as_u32_unchecked() methods and such for those who need the performance.

eddyb commented

it seems clear that we'll need a slew of f32::as_u32_unchecked() methods and such for those who need the performance.

That's a bummer - or do you mean a safe but implementation-defined variant?

Is there no option for an implementation defined fast default?

@eddyb I was thinking that we'd just have unsafe fn as_u32_unchecked(self) -> u32 on f32 and such which is a direct analog to what as is today.

I'm certainly not going to claim that the Rust implementation I wrote is optimal, but I was under the impression that when reading this thread determinism and safety were more important than speed in this context most of the time. The unsafe escape hatch is for those on the other side of the fence.

eddyb commented

So there's no cheap platform-dependent variant? I want something that's fast, gives an unspecified value when outside of bounds and is safe. I don't want UB for some inputs and I think that's too dangerous for common use, if we can do better.

As far as I am aware, on most if not all platforms the canonical way to implement this conversion does something to out-of-range inputs that isn't UB. But LLVM does not seem to have any way to pick that option (whatever it may be) over UB. If we could convince LLVM developers to introduce an intrinsic that yields an "unspecified but not undef/poison" result on out-of-range inputs, we could use that.

But I'd estimate that someone in this thread would have to write a convincing RFC (on the llvm-dev list), get buy-in, and implement it (in backends we care about, and with a fall-back implementation for other targets). Probably easier than convincing llvm-dev to make the existing casts not-UB (because it side-steps questions like "will this make any C and C++ programs slower"), but still not very easy.

Just in case you will be choosing between these:

Saturation (out-of-range values become IntType::max_value()/min_value())
Modulo (out-of-range values are treated as if by converting to a bigint first, then truncating)

IMO only saturation would make sense here, because absolute precision of floating point quickly drops as values get large, so at some point the modulo would be something useless like all zeroes.

I marked this as E-needs-mentor and tagged it with WG-compiler-middle since it seems the impl period might be a great time to investigate this further! My existing notes on the plan of record are pretty sparse though, so it'd be great if someone from @rust-lang/compiler wanted to help elaborate a those a bit further!

@nikomatsakis

IIRC LLVM are planning to eventually implement freeze, which should allow us to deal with the UB by doing a freeze.

My results so far: https://gist.github.com/s3bk/4bdfbe2acca30fcf587006ebb4811744

The _array variants run a loop of 1024 values.
_cast: x as i32
_clip: x.min(MAX).max(MIN) as i32
_panic: panics if x is out of bounds
_zero: sets the result to zero if out of bounds

test bench_array_cast       ... bench:       1,840 ns/iter (+/- 37)
test bench_array_cast_clip  ... bench:       2,657 ns/iter (+/- 13)
test bench_array_cast_panic ... bench:       2,397 ns/iter (+/- 20)
test bench_array_cast_zero  ... bench:       2,671 ns/iter (+/- 19)
test bench_cast             ... bench:           2 ns/iter (+/- 0)
test bench_cast_clip        ... bench:           2 ns/iter (+/- 0)
test bench_cast_panic       ... bench:           2 ns/iter (+/- 0)
test bench_cast_zero        ... bench:           2 ns/iter (+/- 0)

Perhaps you need not round the results to integer for individual operations. Clearly there must be some difference behind these 2 ns/iter. Or is it really like this, exactly 2 ns for all 4 variants?

eddyb commented

@sp-1234 I wonder if it's partially optimized out.

@sp-1234 It is too fast to measure. The non-array benchmarks are basically useless.
If you force the single-value functions to be functions via #[inline(never)], you get 2ns vs 3ns.

@arielb1
I have some reservations regarding freeze. If I understand correctly, a frozen undef can still contain any arbitrary value, it just won't change between uses. In practice, the compiler will probably reuse a register or stack slot.

However this means that we can now read uninitialized memory from safe code. This could lead to secret data being leaked, somewhat like Heartbleed. It's debatable whether this is truly considered UB from the point of view of Rust, but it clearly seems undesirable.

I ran @s3bk's benchmark locally. I can confirm the scalar versions are optimized out completely, and the asm for the array variants also looks suspiciously well-optimized: for example, the loops are vectorized, which is nice but makes it hard to extrapolate performance to scalar code.

Unfortunately, spamming black_box doesn't seem to help. I do see the asm doing useful work, but running the benchmark still consistently gives 0ns for the scalar benchmarks (except cast_zero, which shows 1ns). I see that @alexcrichton performed the comparison 100 times in their benchmarks, so I adopted the same hack. I'm now seeing these numbers (source code):

test bench_cast             ... bench:          53 ns/iter (+/- 0)
test bench_cast_clip        ... bench:         164 ns/iter (+/- 1)
test bench_cast_panic       ... bench:         172 ns/iter (+/- 2)
test bench_cast_zero        ... bench:         100 ns/iter (+/- 0)

The array benchmarks vary too much for me to trust them. Well, truth to be told, I'm skeptical of the test benchmarking infrastructure anyway, especially after seeing the above numbers compared to the flat 0ns I got previously. Furthermore, even just 100 iterations of black_box(x); (as a baseline) takes 34ns, which makes it even harder to reliably interpret those numbers.

Two points worth noting:

  • Despite not handling NaN specifically (it returns -inf instead of 0?), the cast_clip implementation appears to be slower than @alexcrichton's saturating cast (note that their run and mine have roughly the same timing for as casts, 53-54ns).
  • Unlike @s3bk's array results, I'm seeing cast_panic being slower than the other checked casts. I also see an even greater slowdown on the array benchmarks. Maybe these things are just highly dependent on microarchitectural details and/or optimizer mood?

For the record, I've measured with rustc 1.21.0-nightly (d692a91 2017-08-04), -C opt-level=3, on an i7-6700K under light load.


In conclusion, I conclude that don't have reliable data so far and that getting more reliable data seems hard. Furthermore, I strongly doubt that any real application spends even 1% of its wall clock time on this operation. Therefore, I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag, and then running some non-artificial benchmarks with and without this flag to determine the impact on realistic applications.

Edit: I would also recommend to run such benchmarks on a variety of architectures (e.g., including ARM) and microarchitectures, if at all possible.

I admit I'm not that familiar with rust, but I think this line is subtly incorrect: std::i32::MAX (2^31-1) is not exactly representable as a Float32, so std::i32::MAX as f32 will be rounded to the nearest representable value (2^31). If this value is used as the argument x, the result is technically undefined. Replacing with a strict inequality should fix this case.

Yeah, we had exactly that problem in Servo before. The final solution was to cast to f64 and then clamp.

There are other solutions but they're pretty tricky and rust doesn't expose nice APIs for dealing with this well.

using 0x7FFF_FF80i32 as upper limit and -0x8000_0000i32 should solve this without casting to f64.
edit: use the correct value.

I think you mean 0x7fff_ff80, but simply using a strict inequality would probably make the intention of the code clearer.

as in x < 0x8000_0000u32 as f32 ? That would probably be a good idea.

I think of all the suggested deterministic options, clamping is geneally most useful one because I think it's done often anyway. If the conversion type would actually be documentated to be saturating, manual clamping would become unnecessary.

I am only a little worried about the suggested implementation because it doesn't properly translate to machine instructions and it relies heavily on branching. Branching makes the performance dependent on specific data patterns. In the test cases given above everything looks (comparatively) fast because the same branch is taken always and the processor has good branch prediction data from many previous loop iterations. The real world will probably not look like that. Additionally the branching hurts the ability of the compiler to vectorize the code. I disagree with the opinion of @rkruppe , that the operation shouldn't also be tested in combination with vectorization. Vectorization is important in high performance code and being able to vectorize simple casts on common architectures should be a crucial requirement.

For the reasons given above, I played around with an alternative branchless and data flow oriented version of @alexcrichton 's cast with saturation semantics and @simonbyrne 's fix. I implemented it for u16, i16 and i32 since they all have to cover slightly different cases which result in varying performance.

The results:

test i16_bench_array_cast       ... bench:          99 ns/iter (+/- 2)
test i16_bench_array_cast_clip  ... bench:         197 ns/iter (+/- 3)
test i16_bench_array_cast_clip2 ... bench:         113 ns/iter (+/- 3)
test i16_bench_cast             ... bench:          76 ns/iter (+/- 1)
test i16_bench_cast_clip        ... bench:         218 ns/iter (+/- 25)
test i16_bench_cast_clip2       ... bench:         148 ns/iter (+/- 4)
test i16_bench_rng_cast         ... bench:       1,181 ns/iter (+/- 17)
test i16_bench_rng_cast_clip    ... bench:       1,952 ns/iter (+/- 27)
test i16_bench_rng_cast_clip2   ... bench:       1,287 ns/iter (+/- 19)

test i32_bench_array_cast       ... bench:         114 ns/iter (+/- 1)
test i32_bench_array_cast_clip  ... bench:         200 ns/iter (+/- 3)
test i32_bench_array_cast_clip2 ... bench:         128 ns/iter (+/- 3)
test i32_bench_cast             ... bench:          74 ns/iter (+/- 1)
test i32_bench_cast_clip        ... bench:         168 ns/iter (+/- 3)
test i32_bench_cast_clip2       ... bench:         189 ns/iter (+/- 3)
test i32_bench_rng_cast         ... bench:       1,184 ns/iter (+/- 13)
test i32_bench_rng_cast_clip    ... bench:       2,398 ns/iter (+/- 41)
test i32_bench_rng_cast_clip2   ... bench:       1,349 ns/iter (+/- 19)

test u16_bench_array_cast       ... bench:          99 ns/iter (+/- 1)
test u16_bench_array_cast_clip  ... bench:         136 ns/iter (+/- 3)
test u16_bench_array_cast_clip2 ... bench:         105 ns/iter (+/- 3)
test u16_bench_cast             ... bench:          76 ns/iter (+/- 2)
test u16_bench_cast_clip        ... bench:         184 ns/iter (+/- 7)
test u16_bench_cast_clip2       ... bench:         110 ns/iter (+/- 0)
test u16_bench_rng_cast         ... bench:       1,178 ns/iter (+/- 22)
test u16_bench_rng_cast_clip    ... bench:       1,336 ns/iter (+/- 26)
test u16_bench_rng_cast_clip2   ... bench:       1,207 ns/iter (+/- 21)

The test was run on an Intel Haswell i5-4570 CPU and Rust 1.22.0-nightly.
clip2 is the new branchless implementation. It agrees with clip on all 2^32 possible f32 input values.

For the rng benchmarks, random input values are used that hit different cases often. This uncovers the extreme performance cost (roughly 10 times the normal cost!!!) that occurs if branch prediction fails. I think it is very important to consider this. It's not the average real world performance either, but it's still a possible case and some applications will hit this. People will expect a f32 cast to have consistent performance.

Assmbly comparison on x86: https://godbolt.org/g/AhdF71
The branchless version very nicely maps to the minss/maxss instructions.

Unfortunately I wasn't able to make godbolt generate ARM assembly from Rust, but here is a ARM comparison of the methods with Clang: https://godbolt.org/g/s7ronw
Without being able to test the code and knowing much of ARM: The code size seems smaller too and LLVM mostly generates vmax/vmin which looks promising. Maybe LLVM could be teached eventually to fold most of the code into a single instruction?

@ActuallyaDeviloper The asm and the benchmark results look very good! Furthermore, branchless code like yours is probably easier to generate in rustc than the nested conditionals of other solutions (for the record, I am assuming we want to generate inline IR instead of calling a lang item function). Thank you so much for writing this.

I have a question about u16_cast_clip2: it doesn't seem to handle NaN?! There is a comment talking about NaN, but I believe the function will pass NaN through unmodified and attempt to cast it to f32 (and even if it didn't, it would produce one of the boundary values rather than 0).

PS: To be clear, I was not trying to imply that it's unimportant whether the cast can be vectorized. It clearly is important if the surrounding code is otherwise vectorizable. But scalar performance is also important, as vectorization is often not applicable, and the benchmarks I was commenting on were not making any statement about scalar performance. Out of interest, have you checked the asm of the *array* benchmarks to see if they're still vectorized with your implementation?

@rkruppe You are right, I accidently swapped the sides of the if and forgot about that. f32 as u16 happend to do the right thing by truncating the upper 0x8000 away, so the tests didn't catch it either. I fixed the problem now by swapping the branches again and testing all methods with if (y.is_nan()) { panic!("NaN"); } this time.

I updated my previous post. The x86 code did not change significantly at all but unfortunately the change stops LLVM from generating vmax in the u16 ARM case for some reason. I assume this has to do with some details about NaN handling of that ARM instruction or maybe it's a LLVM limitation.

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

The array versions are vectorized.
Godbolt: https://godbolt.org/g/HnmsSV

Re: the ARM asm, I believe the reason vmax is not used any more is that it returns NaN if either operand is NaN. The code is still branchless, though, it just uses predicated moves (vmovgt, referring to to the result of the earlier vcmp with 0).

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

Ohhh, right. Nice.

I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag

I have implemented this and will file a PR once I also fixed #41799 and have a lot more tests.

#45134 has pointed out a code path that I missed (generation of LLVM constant expressions โ€“ this is separate from rustc's own constant evaluation). I'll roll a fix for that into the same PR, but it will take a little while longer.

eddyb commented

@rkruppe You should coordinate with @oli-obk so that miri ends up with the same changes.

Pull request is up: #45205

#45205 has been merged, so anyone can now (well, starting with the next nightly) measure the performance impact of saturation by passing -Z saturating-float-casts via RUSTFLAGS. [1] Such measurements would be very valuable for deciding how to proceed with this issue.

[1] Strictly speaking, this won't affect the non-generic, non-#[inline] portions of the standard library, so to be 100% accurate you'd want to locally build std with Xargo. However, I don't expect that there will be a lot of code affected by this (the various conversion trait impls are #[inline], for example).

@rkruppe I suggest starting an internals/users page to collect data, in the same vein as https://internals.rust-lang.org/t/help-us-benchmark-incremental-compilation/6153/ (we can then also link people to that, rather than some random comments in our issue tracker)

est31 commented

@rkruppe you should create a tracking issue. This discussion is split up into two issues already. That's not good!

@gankro Yeah I agree, but it may be a few days before I find the time to write that post properly, so I figured I'd solicit feedback from the people subscribed to this issue in the mean time.

@est31 Hmm. Although the -Z flag covers both cast directions (which may have been a mistake, in retrospect), it seems unlikely that we'll flip the switch on both at the same time, and there's little overlap between the two in terms of what must be discussed (e.g., this issue hinges on the performance of saturation, while in #41799 it's agreed upon what the right solution is).
It is a bit silly that benchmarks primarily targeted at this issue would also measure the impact of the fix to #41799, but that can at most lead to overreporting of performance regressions, so I'm sort of okay with that. (But if anyone is motivated to split the -Z flag into two, go ahead.)

I've considered a tracking issue for the task of removing the flag once it has outlived its usefulness, but I don't see the need to merging the discussions occuring here and in #41799.

I have drafted up an internals post: https://gist.github.com/Gankro/feab9fb0c42881984caf93c7ad494ebd

Feel free to copy that, or just give me notes so I can post it. (note I'm a bit confused about the const fn behaviour)