floating point to integer casts can cause undefined behaviour

Question

floating point to integer casts can cause undefined behaviour

thestinger opened this issue 11 years ago · 234 comments

Status as of 2020-04-18

We intend to stabilize the saturating-float-casts behavior for as, and have stabilized unsafe library functions that handle the previous behavior. See #71269 for the latest discussion on that stabilization process.

Status as of 2018-11-05

A flag has been implemented in the compiler, -Zsaturating-float-casts, which will cause all float to integer casts have "saturating" behavior where if it's out of bounds it's clamped to the nearest bound. A call for benchmarking of this change went out awhile ago. Results, while positive in many projects, are quite negative for some projects and indicates that we're not done here.

The next steps are figuring out how to recover performance for these cases:

One option is to take today's as cast behavior (which is UB in some cases) and add unsafe functions for the relevant types and such.
Another is to wait for LLVM to add a freeze concept which means that we get a garbage bit pattern, but it's at least not UB
Another is to implement casts via inline assembly in LLVM IR, as the current codegen is not heavily optimized.

Old status

UPDATE (by @nikomatsakis): After much discussion, we've got the rudiments of a plan for how to address this problem. But we need some help with actually investigating the performance impact and working out the final details!

ORIGINAL ISSUE FOLLOWS:

If the value cannot fit in ty2, the results are undefined.

1.04E+17 as u8

Answer 1 · 2013-10-31T01:42:50.000Z

Nominating

Answer 2 · 2013-11-07T18:44:41.000Z

accepted for P-high, same reasoning as #10183

Answer 3 · 2013-11-22T04:45:46.000Z

I don't think this is backwards incompatible at a language level. It will not cause code that was working OK to stop working. Nominating.

Answer 4 · 2013-12-19T18:45:43.000Z

changing to P-high, same reasoning as #10183

Answer 5 · 2014-09-12T00:11:58.000Z

How do we propose to solve this and #10185? Since whether behaviour is defined or not depends on the dynamic value of the number being cast, it seems the only solution is to insert dynamic checks. We seem to agree we do not want to do that for arithmetic overflow, are we happy to do it for cast overflow?

Answer 6 · 2014-09-12T01:38:15.000Z

We could add an intrinsic to LLVM that performs a "safe conversion". @zwarich may have other ideas.

Answer 7 · 2014-09-12T01:40:12.000Z

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Answer 8 · 2014-09-12T01:51:36.000Z

Oh, that's easy enough then.

Answer 9 · 2015-04-23T22:35:09.000Z

ping @pnkfelix is this covered by the new overflow checking stuff?

Answer 10 · 2015-05-07T20:09:15.000Z

These casts are not checked by rustc with debug assertions.

Answer 11 · 2015-09-13T06:22:52.000Z

I'm happy to handle this, but I need a concrete solution. I personally think that it should be checked along with overflowing integer arithmetic, as it's a very similar issue. I don't really mind what we do though.

Note that this issue is currently causing an ICE when used in certain constant expressions.

Answer 12 · 2015-09-13T10:22:04.000Z

This allows violating memory safety in safe rust, example from this forum post:

Undefs, huh? Undefs are fun. They tend to propagate. After a few minutes of wrangling..

#[inline(never)]
pub fn f(ary: &[u8; 5]) -> &[u8] {
    let idx = 1e100f64 as usize;
    &ary[idx..]
}

fn main() {
    println!("{}", f(&[1; 5])[0xdeadbeef]);
}

segfaults on my system (latest nightly) with -O.

Answer 13 · 2015-10-08T14:28:03.000Z

Marking with I-unsound given the violation of memory safety in safe rust.

Answer 14 · 2015-10-29T20:34:41.000Z

@bluss , this does not segfualt for me, just gives an assertion error. untagging since i was the one who added it

Answer 15 · 2015-10-29T20:35:30.000Z

Sigh, I forgot the -O, re-tagging.

Answer 16 · 2016-02-21T20:30:17.000Z

re-nominating for P-high. Apparently this was at some point P-high but got lower over time. This seems pretty important for correctness.

EDIT: didn’t react to triage comment, adding label manually.

Answer 17 · 2016-02-25T21:15:44.000Z

It seems like the precedent from the overflow stuff (e.g. for shifting) is to just settle on some behavior. Java seems to produce the result modulo the range, which seems not unreasonable; I'm not sure just what kind of LLVM code we'd need to handle that.

Answer 18 · 2016-02-26T09:43:43.000Z

According to https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.3 Java also guarantees that NaN values are mapped to 0 and infinities to the minimum/maximum representable integer. Moreover the Java rule for the conversion is more complex than just wrapping, it can be a combination of saturation (for the conversion to int or long) and wrapping (for the conversion to smaller integral types, if needed). Replicating the whole conversion algorithm from Java is certainly possible, but it would require a fair amount of operations for every cast. In particular, in order to ensure that the result of a fpto[us]i operation in LLVM does not exhibit undefined behaviour, a range check would be needed.

As an alternative, I would suggest that float->int casts are guaranteed to only be valid if the truncation of the original value can be represented as a value of the destination type (or maybe as [iu]size?) and to have assertions on debug builds that trigger a panic when the value has not been represented faithfully.

The main advantages of the Java approach are that the conversion function is total, but this also means that unexpected behaviour might creep in: it would prevent undefined behaviour, but it would be easy to be tricked into not checking if the cast actually made any sense (this is unfortunately true also for the other casts 😟 ).

The other approach matches the one currently used for arithmetic operations: simple & efficient implementation in release, panics triggered by range checking in debug. Unfortunately unlike other as casts, this would make such conversion checked, which can be surprising to the user (although maybe the analogy to arithmetic operations can help here). This would also break some code, but AFAICT it should only happen for code which is currently relying on undefined behaviour (i.e. it would replace the undefined behaviour "let's return any integer, you obviously don't care which" with a panic).

Answer 19 · 2016-02-26T10:18:29.000Z

The problem isn't "let's return any integer, you obviously don't care which", it is that it causes an undef which isn't a random value but rather a nasal demon value and LLVM is allowed to assume the undef never occurs enabling optimizations that do horrible incorrect things. If it was a random value, but crucially not undef, then that would be enough to fix the soundness issues. We don't need to define how unrepresentable values are represented, we just need to prevent undef.

Answer 20 · 2016-03-03T21:07:27.000Z

Discussed in @rust-lang/compiler meeting. The most consistent course of action remains:

when overflow checks are enabled, check for illegal casts and panic.
otherwise, we need a fallback behavior, it should be something which has minimal (ideally, zero) runtime cost for valid values, but the precise behavior is not that important, so long as it is not LLVM undef.

The main problem is that we need a concrete suggestion for option 2.

Answer 21 · 2016-03-03T21:08:22.000Z

triage: P-medium

Answer 22 · 2016-03-05T11:47:22.000Z

@nikomatsakis Does as ever currently panic in debug builds? If it doesn't, for consistency and predictability it seems preferable to keep it that way. (I think it should have, just like arithmetic, but that's a separate, and past, debate.)

Answer 23 · 2016-03-09T10:04:24.000Z

otherwise, we need a fallback behavior, it should be something which has minimal (ideally, zero) runtime cost for valid values, but the precise behavior is not that important, so long as it is not LLVM undef.

Concrete suggestion: extract digits and exponent as u64 and bitshift digits by exponent.

fn f64_as_u64(f: f64) -> u64 {
    let (mantissa, exponent, _sign) = f.integer_decode();
    mantissa >> ((-exponent) & 63)
}

Yes it's not zero cost, but it's somewhat optimizable (would be better if we marked integer_decode inline) and at least deterministic. A future MIR-pass that expands a float->int cast could probably analyze whether the float is guaranteed to be ok to cast and skip this heavy conversion.

Answer 24 · 2016-03-09T10:30:30.000Z

Does LLVM not have platform intrinsics for the conversion functions?

EDIT: @zwarich said (a long time ago):

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Why even bother panicking? AFAIK, @glaebhoerl is correct, as is supposed to truncate/extend, not check the operands.

Answer 25 · 2016-03-10T14:51:20.000Z

On Sat, Mar 05, 2016 at 03:47:55AM -0800, Gábor Lehel wrote:

@nikomatsakis Does as ever currently panic in debug builds? If it doesn't, for consistency and predictability it seems preferable to keep it that way. (I think it should have, just like arithmetic, but that's a separate, and past, debate.)

True. I find that persuasive.

Answer 26 · 2016-03-10T14:52:57.000Z

On Wed, Mar 09, 2016 at 02:31:05AM -0800, Eduard-Mihai Burtescu wrote:

Does LLVM not have platform intrinsics for the conversion functions?

EDIT:

AFAIK the only solution at the moment is to use the target-specific intrinsics. That's what JavaScriptCore does, at least according to someone I asked.

Why even bother panicking? AFAIK, @glaebhoerl is correct, as is supposed to truncate/extend, not check the operands.

Yes, I think I was mistaken before. as is the "unchecked truncation"
operator, for better or worse, and it seems best to stay consistent
with that philosophy. Using target-specific intrinsics may be a perfectly
fine solution though?

Answer 27 · 2016-06-17T13:37:39.000Z

@nikomatsakis: it seems the behavior hasn't been defined yet? Can you give an update about the planning regarding that?

Answer 28 · 2016-08-21T01:35:33.000Z

Just ran into this with much smaller numbers

    let x: f64 = -1.0;
    x as u8

Results in 0, 16, etc. depending on optimizations, I was hoping it would be defined as 255 so I don't have to write x as i16 as u8.

Answer 29 · 2016-08-24T09:46:52.000Z

@gmorenz Did you try !0u8?

Answer 30 · 2016-08-24T13:44:17.000Z

In context that wouldn't make sense, I was getting the f64 from a transformation on data sent over the network, with a range of [-255, 255]. I was hoping it would wrap nicely (in the exact way that <i32> as u8 wraps).

Answer 31 · 2016-10-19T06:26:32.000Z

Here's a recent LLVM proposal to "kill undef" http://lists.llvm.org/pipermail/llvm-dev/2016-October/106182.html , though I'm hardly knowledgeable enough to know whether or not this would automagically resolve this issue.

Answer 32 · 2016-11-27T23:38:59.000Z

They're replacing undef with poison, the semantics being slightly different. It's not going to make int -> float casts defined behavior.

Answer 33 · 2016-12-02T15:28:48.000Z

We probably should provide some explicit way to do a saturating cast? I’ve wanted that exact behaviour just now.

Answer 34 · 2016-12-10T01:01:08.000Z

Seems like this should be marked I-crash, given #10184 (comment) .

Answer 35 · 2017-02-25T04:21:07.000Z

We had a question about this in #rust-beginners today, someone ran into it in the wild.

Answer 36 · 2017-05-03T16:01:42.000Z

The book I'm writing with @jimblandy, Programming Rust, mentions this bug.

Several kinds of casts are permitted.

Numbers may be cast from any of the built-in numeric types to any other.

(...)

However, as of this writing, casting a large floating-point value to an integer type that is too small to represent it can lead to undefined behavior. This can cause crashes even in safe Rust. It is a bug in the compiler, github.com/rust-lang/rust/issues/10184.

Our deadline for this chapter is May 19. I'd love to delete that last paragraph, but I feel like we should at least have some kind of plan here first.

Apparently current JavaScriptCore uses an interesting hack on x86. They use the CVTTSD2SI instruction, then fall back on some hairy C++ if the value is out of range. Since out-of-range values currently explode, using that instruction (with no fallback!) would be an improvement on what we have now, albeit only for one architecture.

Answer 37 · 2017-05-03T16:14:38.000Z

Honestly I think we should deprecate numeric casts with as and use From and TryFrom or something like the conv crate instead.

Answer 38 · 2017-05-03T16:46:42.000Z

Maybe so, but that seems orthogonal to me.

Answer 39 · 2017-05-04T15:12:47.000Z

OK, I've just re-read this whole conversation. I think there is agreement that this operation should not panic (for general consistency with as). There are two leading contenders for what the behavior ought to be:

Some sort of defined result
- Pro: This I think is maximally consistent with our general philosophy thus far.
- Con: There doesn't seem to be a truly portable way to produce any particular defined result in this case. This implies that we would be using platform-specific intrinsics with some sort of fallback for out of range values (for example, falling back to saturation, this function that @oli-obk proposed, to Java's definition, or to whatever "hairy C++" JSC uses.
- At worst, we can just insert some ifs for the "out of range" cases.
An undefined value (not undefined behavior)
- Pro: This lets us just use the platform-specific intrinsics that are available on each platform.
- Con: It's a portability hazard. In general, I feel like we've not made use of undefined results very often, at least in the language (I'm sure we do in the libs in various places).

It's not clear to me if there is a clear precedent for what the result ought to be in the first case?

Answer 40 · 2017-05-04T15:18:38.000Z

After having written that out, my preference would be to maintain a deterministic result. I feel like every place that we can hold the line on determinism is a win. I am not really sure what the result ought to be though.

I like saturation because I can understand it and it seems useful, but it seems somehow incongruent with the way that u64 as u32 does truncation. So perhaps some sort of result based on truncation makes sense, which I guess is probably what @oli-obk proposed -- I don't fully understand what that code is intended to do. =)

Answer 41 · 2017-05-04T15:41:01.000Z

My code gives the correct value for things in the range 0..2^64 and deterministic but bogus values for everything else.

floats are represented by mantissa ^ exponent, e.g. 1.0 is (2 << 52) ^ -52 and since bitshifts and exponents are the same thing in binary, we can just reverse the shift (thus the negation of the exponent and the right shift).

Answer 42 · 2017-05-04T15:56:44.000Z

+1 for determinism.

I see two semantics that make good sense for humans, and I think we should pick whichever one is faster for values that are in range, when the compiler can't optimize away any of the computation. (When the compiler knows that a value is in range, both options give the same results, so they are equally optimize-able.)

Saturation (out-of-range values become IntType::max_value()/min_value())
Modulo (out-of-range values are treated as if by converting to a bigint first, then truncating)

The table below is meant to specify both options fully. T is any machine integer type. Tmin and Tmax are T::min_value() and T::max_value(). RTZ(v) means take the mathematical value of v and Round Toward Zero to get a mathematical integer.

`v`	`v as T` (saturation)	`v as T` (modulo)
in range (Tmin <= v <= Tmax)	RTZ(v)	RTZ(v)
negative zero	0	0
NaN	0	0
Infinity	Tmax	0
-Infinity	Tmin	0
v > Tmax	Tmax	RTZ(v) truncated to fit T
v < Tmin	Tmin	RTZ(v) truncated to fit T

The ECMAScript standard specifies operations ToInt32, ToUint32, ToInt16, ToUint16, ToInt8, ToUint8, and my intent with the "modulo" option above is to match those operations in every case.

ECMAScript also specifies ToInt8Clamp which does not match either case above: it does "round half to even" rounding on fractional values rather than "round to zero".

@oli-obk's suggestion is a third way, worth considering if it's faster to compute, for values that are in range.

Answer 43 · 2017-05-04T16:02:06.000Z

@oli-obk What about signed integer types?

Answer 44 · 2017-05-07T02:32:29.000Z

Related #41799

Answer 45 · 2017-05-08T15:18:23.000Z

Throwing another proposal into the mix: Mark u128 casts to floats as unsafe and force folks to explicitly choose a way of handling it. u128 is pretty rare currently.

Answer 46 · 2017-05-08T15:24:58.000Z

@Manishearth I’d hope for similar semantics integers → floats as floats → integers. Since both are UB-ful, and we cannot make float→integer unsafe anymore, we should probably avoid making integer→float unsafe as well.

For float→integer saturating will be faster AFAICT (resulting in a sequence of and, ~~test+jump~~ float comparison and jump, all for ~~0.66 or 0.5~~ 2-3 cycles on modern arches). I personally couldn’t care less for what exact behaviour we decide on as long as the in-range values are as fast as they possibly could be.

Answer 47 · 2017-05-08T15:27:36.000Z

Wouldn't it make sense to make it behave like overflow? So in a debug build it would panic if you do a cast with undefined behaviour. Then you could have methods for specifying the casting behaviour like 1.04E+17.saturating_cast::<u8>(), unsafe { 1.04E+17.unsafe_cast::<u8>() } and potentially others.

Answer 48 · 2017-05-08T15:31:25.000Z

Oh, I thought the issue was only for u128, and we can make that unsafe both ways.

@CryZe UB should not exist even in release mode in safe code. The overflow stuff is still defined behavior.

That said, panic on debug, and on release would be great.

Answer 49 · 2017-05-08T15:42:53.000Z

This affects:

f32 -> u8, u16, u32, u64, u128, usize (-1f32 as _ for all, f32::MAX as _ for all but u128)
f32 -> i8, i16, i32, i64, i128, isize (f32::MAX as _ for all)
f64 -> all ints (f64::MAX as _ for all)

Answer 50 · 2017-05-08T15:52:08.000Z

f32::INFINITY as u128 is also UB

Answer 51 · 2017-05-08T18:42:58.000Z

@CryZe

Wouldn't it make sense to make it behave like overflow? So in a debug build it would panic if you do a cast with undefined behaviour.

This is what I thought initially, but I was reminded that as conversions never panic at present (we don't do overflow checking with as, for better or worse). So the most analogous thing is for it to "do something defined".

Answer 52 · 2017-05-08T20:05:47.000Z

FWIW, the "kill undef" thing would, in fact, provide a way to fix the memory unsafety, but leaving the result nondeterministic. One of the key components is:

Create a new instruction, '%y = freeze %x', that stops propagation of
poison. If the input is poison, then it returns an arbitrary, but fixed,
value. (like old undef, but each use gets the same value), otherwise it
just returns its input value.

The reason undefs can be used to violate memory safety today is that they can magically change values between uses: in particular, between a bounds check and subsequent pointer arithmetic. If rustc added a freeze after every dangerous cast, you would just get an unknown but otherwise well-behaved value. Performance-wise, freeze is basically free here, since of course the machine instruction corresponding to the cast produces a single value, not a fluctuating one; even if the optimizer feels like duplicating the cast instruction for some reason, it should be safe to do so because the result for out-of-range inputs is usually deterministic on a given architecture.

...But not deterministic across architectures, if anyone was wondering. x86 returns 0x80000000 for all bad inputs; ARM saturates for out-of-range inputs and (if I'm reading this pseudocode right) returns 0 for NaN. So if the goal is to produce a deterministic and platform-independent result, it's not enough to just use the platform's fp-to-int intrinsic; at least on ARM you also need to check the status register for an exception. This may have some overhead in itself, and certainly prevents autovectorization in the unlikely event that using the intrinsic didn't already. Alternately, I guess you could explicitly test for in-range values using regular compare operations and then use a regular float-to-int. That sounds a lot nicer on the optimizer…

Answer 53 · 2017-05-11T18:37:08.000Z

as conversions never panic at present

At some point we changed + to panic (on debug mode). I wouldn’t be shocked to see as panic in cases that were previously UB.

Answer 54 · 2017-05-14T12:34:46.000Z

If we care about checking (which we should), then we should either deprecate as (is there any use case where it's the only good option?) or at least advise against using it, and move people onto things like TryFrom and TryInto instead, which is what we said we were planning to do back when it was decided to leave as as-is. I don't feel that the cases under discussion are qualitatively different, in the abstract, from the cases where as is already defined not to do any checks. The difference is just that in practice the implementation for these cases is currently incomplete and has UB. A world where you can't rely on as doing checks (because for most types, it doesn't), and you can't rely on it not panicking (because for some types, it would), and it's not consistent, and we still haven't deprecated it seems like the worst of all of them to me.

Answer 55 · 2017-05-17T19:08:21.000Z

So, I think at this point @jorendorff basically enumerated what seems to me to be the best plan:

as will have some deterministic behavior;
we'll pick a behavior based on a combination of how sensible it is, and how efficient it is

He enumerated three possibilities. I think the remaining work is to investigate those possibilities -- or at least investigate one of them. That is, actually implement it, and try to get some feeling for how "slow" or "fast" it is.

Is there anyone out there who feels motivated to take a stab at that? I'm going to tag this as E-help-wanted in hopes of attracting some a person. (@oli-obk?)

Answer 56 · 2017-06-09T13:37:24.000Z

Uh, I'd rather not pay price for cross-platform consistency :/ It's garbage-in, I don't care what garbage goes out (however a debug assertion would be super helpful).

Currently all rounding/truncating functions in Rust are very slow (function calls with painstakingly precise implementations), so the as is my last resort hack for fast float rounding.

If you're going to make as anything more than bare cvttss2si, please also add a stable alternative that is just that.

Answer 57 · 2017-06-09T15:28:12.000Z

@pornel this isn't just UB of the theoretic kind where stuff is okay if you ignore that it is ub, it has real world implications. I've extracted #41799 from a real world code example.

Answer 58 · 2017-06-09T16:26:59.000Z

@est31 I agree that leaving it as UB is wrong, but I've seen freeze proposed as a solution to UB. AFAIK that makes it a defined deterministic value, you just don't get to say which. That behavior is fine with me.

So I'd be fine if e.g. u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.

Answer 59 · 2017-06-09T16:34:54.000Z

freeze would not produce a defined, deterministic, unspecified value. Its result is still "any bit pattern the compiler likes", and it is consistent only across uses of the same operation. This may sidestep the UB-producing examples people have collected above, but it would not give this:

u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.

For example, if LLVM notices that u128::MAX as f32 overflows and replaces it with freeze poison, a valid lowering of fn foo() -> f32 { u128::MAX as f32 } on x86_64 might be this:

foo:
  ret

(that is, just return whatever was last stored in the return register)

Answer 60 · 2017-06-09T16:39:56.000Z

I see. That's still acceptable for my uses (for cases where I expect out of range values, I do clamping beforehand. Where I expect values in range, but they aren't, then I'm not going to get a correct result no matter what).

Answer 61 · 2017-06-09T17:16:25.000Z

I have no problem with out of range float casts returning arbitrary values as long as the values are frozen so they can't cause further undefined behavior.

Answer 62 · 2017-06-09T21:30:07.000Z

Is something like freeze available on LLVM? I thought that was a purely theoretical construct.

Answer 63 · 2017-06-09T21:33:29.000Z

@nikomatsakis I've never seen it used like that (unlike poison) - it's a planned revamp of poison/undef.

Answer 64 · 2017-06-09T21:37:37.000Z

freeze does not exist at all in LLVM today. It's only been proposed (this PLDI paper is a self-contained version, but it's also been discussed a lot on the mailing list). The proposal seems to have considerable buy-in, but of course that is no guarantee it will be adopted, much less adopted in a timely manner. (Removing the pointee types from pointer types has been accepted for years and it's still not done.)

Answer 65 · 2017-06-14T16:25:56.000Z

Do we want to open up an RFC to get more widespread discussion on the changes being proposed here? IMO, anything that potentially impacts the performance of as is going to be contentious, but it will be doubly contentious if we don't give people the chance to make their voice heard.

Answer 66 · 2017-06-14T21:07:06.000Z

I'm a Julia developer, and I've been following this issue for a while, as we share the same LLVM backend and so have similar issues. In case it's of interest, here is what we have settled on (with approximate timings for a single function on my machine):

unsafe_trunc(Int64, x) maps directly to the corresponding LLVM intrinsic fptosi (1.5 ns)
trunc(Int64, x) throws an exception for out of range values (3 ns)
convert(Int64, x) throws an exception for out of range or non-integer values (6 ns)

Also, I have asked on the mailing list about making the undefined behaviour a little more defined, but did not receive a very promising response.

Answer 67 · 2017-06-14T21:59:39.000Z

@bstrie I'm fine with an RFC, but I think it'd definitely be useful to have data! @simonbyrne's comment is super helpful in that regard, however.

Answer 68 · 2017-09-02T06:11:17.000Z

I've toyed around with JS semantics (the modulo @jorendorff mentioned) and the Java semantics which appear to be the "saturation" column. In case those links expire it's JS and Java.

I also whipped up a quick implementation of saturation in Rust which I think (?) is correct. And got some benchmark numbers as well. Interestingly I'm seeing that the saturating implementation is 2-3x slower than the intrinsic, which is different from what @simonbyrne found with only 2x slower.

I'm not entirely sure how to implement the "mod" semantics in Rust...

To me, though, it seems clear that we'll need a slew of f32::as_u32_unchecked() methods and such for those who need the performance.

Answer 69 · 2017-09-02T07:00:12.000Z

it seems clear that we'll need a slew of f32::as_u32_unchecked() methods and such for those who need the performance.

That's a bummer - or do you mean a safe but implementation-defined variant?

Answer 70 · 2017-09-02T07:24:04.000Z

Is there no option for an implementation defined fast default?

Answer 71 · 2017-09-02T15:51:57.000Z

@eddyb I was thinking that we'd just have unsafe fn as_u32_unchecked(self) -> u32 on f32 and such which is a direct analog to what as is today.

I'm certainly not going to claim that the Rust implementation I wrote is optimal, but I was under the impression that when reading this thread determinism and safety were more important than speed in this context most of the time. The unsafe escape hatch is for those on the other side of the fence.

Answer 72 · 2017-09-02T15:58:50.000Z

So there's no cheap platform-dependent variant? I want something that's fast, gives an unspecified value when outside of bounds and is safe. I don't want UB for some inputs and I think that's too dangerous for common use, if we can do better.

Answer 73 · 2017-09-02T16:07:48.000Z

As far as I am aware, on most if not all platforms the canonical way to implement this conversion does something to out-of-range inputs that isn't UB. But LLVM does not seem to have any way to pick that option (whatever it may be) over UB. If we could convince LLVM developers to introduce an intrinsic that yields an "unspecified but not undef/poison" result on out-of-range inputs, we could use that.

But I'd estimate that someone in this thread would have to write a convincing RFC (on the llvm-dev list), get buy-in, and implement it (in backends we care about, and with a fall-back implementation for other targets). Probably easier than convincing llvm-dev to make the existing casts not-UB (because it side-steps questions like "will this make any C and C++ programs slower"), but still not very easy.

Answer 74 · 2017-09-22T11:58:32.000Z

Just in case you will be choosing between these:

Saturation (out-of-range values become IntType::max_value()/min_value())
Modulo (out-of-range values are treated as if by converting to a bigint first, then truncating)

IMO only saturation would make sense here, because absolute precision of floating point quickly drops as values get large, so at some point the modulo would be something useless like all zeroes.

Answer 75 · 2017-09-22T15:00:18.000Z

I marked this as E-needs-mentor and tagged it with WG-compiler-middle since it seems the impl period might be a great time to investigate this further! My existing notes on the plan of record are pretty sparse though, so it'd be great if someone from @rust-lang/compiler wanted to help elaborate a those a bit further!

Answer 76 · 2017-09-24T11:23:11.000Z

@nikomatsakis

IIRC LLVM are planning to eventually implement freeze, which should allow us to deal with the UB by doing a freeze.

Answer 77 · 2017-09-24T14:07:25.000Z

My results so far: https://gist.github.com/s3bk/4bdfbe2acca30fcf587006ebb4811744

The _array variants run a loop of 1024 values.
_cast: x as i32
_clip: x.min(MAX).max(MIN) as i32
_panic: panics if x is out of bounds
_zero: sets the result to zero if out of bounds

test bench_array_cast       ... bench:       1,840 ns/iter (+/- 37)
test bench_array_cast_clip  ... bench:       2,657 ns/iter (+/- 13)
test bench_array_cast_panic ... bench:       2,397 ns/iter (+/- 20)
test bench_array_cast_zero  ... bench:       2,671 ns/iter (+/- 19)
test bench_cast             ... bench:           2 ns/iter (+/- 0)
test bench_cast_clip        ... bench:           2 ns/iter (+/- 0)
test bench_cast_panic       ... bench:           2 ns/iter (+/- 0)
test bench_cast_zero        ... bench:           2 ns/iter (+/- 0)

Answer 78 · 2017-09-25T07:52:54.000Z

Perhaps you need not round the results to integer for individual operations. Clearly there must be some difference behind these 2 ns/iter. Or is it really like this, exactly 2 ns for all 4 variants?

Answer 79 · 2017-09-25T07:55:49.000Z

@sp-1234 I wonder if it's partially optimized out.

Answer 80 · 2017-09-25T09:35:50.000Z

@sp-1234 It is too fast to measure. The non-array benchmarks are basically useless.
If you force the single-value functions to be functions via #[inline(never)], you get 2ns vs 3ns.

Answer 81 · 2017-09-25T09:52:58.000Z

@arielb1
I have some reservations regarding freeze. If I understand correctly, a frozen undef can still contain any arbitrary value, it just won't change between uses. In practice, the compiler will probably reuse a register or stack slot.

However this means that we can now read uninitialized memory from safe code. This could lead to secret data being leaked, somewhat like Heartbleed. It's debatable whether this is truly considered UB from the point of view of Rust, but it clearly seems undesirable.

Answer 82 · 2017-09-26T09:21:33.000Z

I ran @s3bk's benchmark locally. I can confirm the scalar versions are optimized out completely, and the asm for the array variants also looks suspiciously well-optimized: for example, the loops are vectorized, which is nice but makes it hard to extrapolate performance to scalar code.

Unfortunately, spamming black_box doesn't seem to help. I do see the asm doing useful work, but running the benchmark still consistently gives 0ns for the scalar benchmarks (except cast_zero, which shows 1ns). I see that @alexcrichton performed the comparison 100 times in their benchmarks, so I adopted the same hack. I'm now seeing these numbers (source code):

test bench_cast             ... bench:          53 ns/iter (+/- 0)
test bench_cast_clip        ... bench:         164 ns/iter (+/- 1)
test bench_cast_panic       ... bench:         172 ns/iter (+/- 2)
test bench_cast_zero        ... bench:         100 ns/iter (+/- 0)

The array benchmarks vary too much for me to trust them. Well, truth to be told, I'm skeptical of the test benchmarking infrastructure anyway, especially after seeing the above numbers compared to the flat 0ns I got previously. Furthermore, even just 100 iterations of black_box(x); (as a baseline) takes 34ns, which makes it even harder to reliably interpret those numbers.

Two points worth noting:

Despite not handling NaN specifically (it returns -inf instead of 0?), the cast_clip implementation appears to be slower than @alexcrichton's saturating cast (note that their run and mine have roughly the same timing for as casts, 53-54ns).
Unlike @s3bk's array results, I'm seeing cast_panic being slower than the other checked casts. I also see an even greater slowdown on the array benchmarks. Maybe these things are just highly dependent on microarchitectural details and/or optimizer mood?

For the record, I've measured with rustc 1.21.0-nightly (d692a91 2017-08-04), -C opt-level=3, on an i7-6700K under light load.

In conclusion, I conclude that don't have reliable data so far and that getting more reliable data seems hard. Furthermore, I strongly doubt that any real application spends even 1% of its wall clock time on this operation. Therefore, I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag, and then running some non-artificial benchmarks with and without this flag to determine the impact on realistic applications.

Edit: I would also recommend to run such benchmarks on a variety of architectures (e.g., including ARM) and microarchitectures, if at all possible.

Answer 83 · 2017-09-27T18:52:23.000Z

I admit I'm not that familiar with rust, but I think this line is subtly incorrect: std::i32::MAX (2^31-1) is not exactly representable as a Float32, so std::i32::MAX as f32 will be rounded to the nearest representable value (2^31). If this value is used as the argument x, the result is technically undefined. Replacing with a strict inequality should fix this case.

Answer 84 · 2017-09-27T19:05:10.000Z

Yeah, we had exactly that problem in Servo before. The final solution was to cast to f64 and then clamp.

There are other solutions but they're pretty tricky and rust doesn't expose nice APIs for dealing with this well.

Answer 85 · 2017-09-27T19:35:50.000Z

using 0x7FFF_FF80i32 as upper limit and -0x8000_0000i32 should solve this without casting to f64.
edit: use the correct value.

Answer 86 · 2017-09-27T20:00:51.000Z

I think you mean 0x7fff_ff80, but simply using a strict inequality would probably make the intention of the code clearer.

Answer 87 · 2017-09-27T20:12:11.000Z

as in x < 0x8000_0000u32 as f32 ? That would probably be a good idea.

Answer 88 · 2017-09-28T20:50:32.000Z

I think of all the suggested deterministic options, clamping is geneally most useful one because I think it's done often anyway. If the conversion type would actually be documentated to be saturating, manual clamping would become unnecessary.

I am only a little worried about the suggested implementation because it doesn't properly translate to machine instructions and it relies heavily on branching. Branching makes the performance dependent on specific data patterns. In the test cases given above everything looks (comparatively) fast because the same branch is taken always and the processor has good branch prediction data from many previous loop iterations. The real world will probably not look like that. Additionally the branching hurts the ability of the compiler to vectorize the code. I disagree with the opinion of @rkruppe , that the operation shouldn't also be tested in combination with vectorization. Vectorization is important in high performance code and being able to vectorize simple casts on common architectures should be a crucial requirement.

For the reasons given above, I played around with an alternative branchless and data flow oriented version of @alexcrichton 's cast with saturation semantics and @simonbyrne 's fix. I implemented it for u16, i16 and i32 since they all have to cover slightly different cases which result in varying performance.

The results:

test i16_bench_array_cast       ... bench:          99 ns/iter (+/- 2)
test i16_bench_array_cast_clip  ... bench:         197 ns/iter (+/- 3)
test i16_bench_array_cast_clip2 ... bench:         113 ns/iter (+/- 3)
test i16_bench_cast             ... bench:          76 ns/iter (+/- 1)
test i16_bench_cast_clip        ... bench:         218 ns/iter (+/- 25)
test i16_bench_cast_clip2       ... bench:         148 ns/iter (+/- 4)
test i16_bench_rng_cast         ... bench:       1,181 ns/iter (+/- 17)
test i16_bench_rng_cast_clip    ... bench:       1,952 ns/iter (+/- 27)
test i16_bench_rng_cast_clip2   ... bench:       1,287 ns/iter (+/- 19)

test i32_bench_array_cast       ... bench:         114 ns/iter (+/- 1)
test i32_bench_array_cast_clip  ... bench:         200 ns/iter (+/- 3)
test i32_bench_array_cast_clip2 ... bench:         128 ns/iter (+/- 3)
test i32_bench_cast             ... bench:          74 ns/iter (+/- 1)
test i32_bench_cast_clip        ... bench:         168 ns/iter (+/- 3)
test i32_bench_cast_clip2       ... bench:         189 ns/iter (+/- 3)
test i32_bench_rng_cast         ... bench:       1,184 ns/iter (+/- 13)
test i32_bench_rng_cast_clip    ... bench:       2,398 ns/iter (+/- 41)
test i32_bench_rng_cast_clip2   ... bench:       1,349 ns/iter (+/- 19)

test u16_bench_array_cast       ... bench:          99 ns/iter (+/- 1)
test u16_bench_array_cast_clip  ... bench:         136 ns/iter (+/- 3)
test u16_bench_array_cast_clip2 ... bench:         105 ns/iter (+/- 3)
test u16_bench_cast             ... bench:          76 ns/iter (+/- 2)
test u16_bench_cast_clip        ... bench:         184 ns/iter (+/- 7)
test u16_bench_cast_clip2       ... bench:         110 ns/iter (+/- 0)
test u16_bench_rng_cast         ... bench:       1,178 ns/iter (+/- 22)
test u16_bench_rng_cast_clip    ... bench:       1,336 ns/iter (+/- 26)
test u16_bench_rng_cast_clip2   ... bench:       1,207 ns/iter (+/- 21)

The test was run on an Intel Haswell i5-4570 CPU and Rust 1.22.0-nightly.
clip2 is the new branchless implementation. It agrees with clip on all 2^32 possible f32 input values.

For the rng benchmarks, random input values are used that hit different cases often. This uncovers the extreme performance cost (roughly 10 times the normal cost!!!) that occurs if branch prediction fails. I think it is very important to consider this. It's not the average real world performance either, but it's still a possible case and some applications will hit this. People will expect a f32 cast to have consistent performance.

Assmbly comparison on x86: https://godbolt.org/g/AhdF71
The branchless version very nicely maps to the minss/maxss instructions.

Unfortunately I wasn't able to make godbolt generate ARM assembly from Rust, but here is a ARM comparison of the methods with Clang: https://godbolt.org/g/s7ronw
Without being able to test the code and knowing much of ARM: The code size seems smaller too and LLVM mostly generates vmax/vmin which looks promising. Maybe LLVM could be teached eventually to fold most of the code into a single instruction?

Answer 89 · 2017-09-28T21:23:34.000Z

@ActuallyaDeviloper The asm and the benchmark results look very good! Furthermore, branchless code like yours is probably easier to generate in rustc than the nested conditionals of other solutions (for the record, I am assuming we want to generate inline IR instead of calling a lang item function). Thank you so much for writing this.

I have a question about u16_cast_clip2: it doesn't seem to handle NaN?! There is a comment talking about NaN, but I believe the function will pass NaN through unmodified and attempt to cast it to f32 (and even if it didn't, it would produce one of the boundary values rather than 0).

PS: To be clear, I was not trying to imply that it's unimportant whether the cast can be vectorized. It clearly is important if the surrounding code is otherwise vectorizable. But scalar performance is also important, as vectorization is often not applicable, and the benchmarks I was commenting on were not making any statement about scalar performance. Out of interest, have you checked the asm of the *array* benchmarks to see if they're still vectorized with your implementation?

Answer 90 · 2017-09-28T23:52:19.000Z

@rkruppe You are right, I accidently swapped the sides of the if and forgot about that. f32 as u16 happend to do the right thing by truncating the upper 0x8000 away, so the tests didn't catch it either. I fixed the problem now by swapping the branches again and testing all methods with if (y.is_nan()) { panic!("NaN"); } this time.

I updated my previous post. The x86 code did not change significantly at all but unfortunately the change stops LLVM from generating vmax in the u16 ARM case for some reason. I assume this has to do with some details about NaN handling of that ARM instruction or maybe it's a LLVM limitation.

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

The array versions are vectorized.
Godbolt: https://godbolt.org/g/HnmsSV

Answer 91 · 2017-09-29T15:21:33.000Z

Re: the ARM asm, I believe the reason vmax is not used any more is that it returns NaN if either operand is NaN. The code is still branchless, though, it just uses predicated moves (vmovgt, referring to to the result of the earlier vcmp with 0).

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

Ohhh, right. Nice.

Answer 92 · 2017-10-08T23:20:10.000Z

I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag

I have implemented this and will file a PR once I also fixed #41799 and have a lot more tests.

Answer 93 · 2017-10-09T21:31:50.000Z

#45134 has pointed out a code path that I missed (generation of LLVM constant expressions – this is separate from rustc's own constant evaluation). I'll roll a fix for that into the same PR, but it will take a little while longer.

Answer 94 · 2017-10-10T16:39:48.000Z

@rkruppe You should coordinate with @oli-obk so that miri ends up with the same changes.

Answer 95 · 2017-10-11T16:31:44.000Z

Pull request is up: #45205

Answer 96 · 2017-11-08T22:49:14.000Z

#45205 has been merged, so anyone can now (well, starting with the next nightly) measure the performance impact of saturation by passing -Z saturating-float-casts via RUSTFLAGS. [1] Such measurements would be very valuable for deciding how to proceed with this issue.

[1] Strictly speaking, this won't affect the non-generic, non-#[inline] portions of the standard library, so to be 100% accurate you'd want to locally build std with Xargo. However, I don't expect that there will be a lot of code affected by this (the various conversion trait impls are #[inline], for example).

Answer 97 · 2017-11-08T23:00:17.000Z

@rkruppe I suggest starting an internals/users page to collect data, in the same vein as https://internals.rust-lang.org/t/help-us-benchmark-incremental-compilation/6153/ (we can then also link people to that, rather than some random comments in our issue tracker)

Answer 98 · 2017-11-08T23:05:35.000Z

@rkruppe you should create a tracking issue. This discussion is split up into two issues already. That's not good!

Answer 99 · 2017-11-09T00:01:55.000Z

@gankro Yeah I agree, but it may be a few days before I find the time to write that post properly, so I figured I'd solicit feedback from the people subscribed to this issue in the mean time.

@est31 Hmm. Although the -Z flag covers both cast directions (which may have been a mistake, in retrospect), it seems unlikely that we'll flip the switch on both at the same time, and there's little overlap between the two in terms of what must be discussed (e.g., this issue hinges on the performance of saturation, while in #41799 it's agreed upon what the right solution is).
It is a bit silly that benchmarks primarily targeted at this issue would also measure the impact of the fix to #41799, but that can at most lead to overreporting of performance regressions, so I'm sort of okay with that. (But if anyone is motivated to split the -Z flag into two, go ahead.)

I've considered a tracking issue for the task of removing the flag once it has outlived its usefulness, but I don't see the need to merging the discussions occuring here and in #41799.

Answer 100 · 2017-11-09T18:18:49.000Z

I have drafted up an internals post: https://gist.github.com/Gankro/feab9fb0c42881984caf93c7ad494ebd

Feel free to copy that, or just give me notes so I can post it. (note I'm a bit confused about the const fn behaviour)