rust-lang/unsafe-code-guidelines

Validity of integers and floating point

RalfJung opened this issue Β· 99 comments

Discussing the validity invariant of integer and floating point types.

Clearly, every possible bit pattern is allowed. For integers they all have a distinct and meaningful interpretation, and we have a safe NOP-conversion between f32 and u32, and f64 and u64, through to_bits and from_bits.

The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value? We could reasonably decide either way. Also, when an integer is partially uninitialized, does that "infect" the entire integer or do we exactly preserve which byte is initialized?

2022-09-07: This has now pretty much been answered.

Arguments for allowing uninitialized bits:

  • There does not seem to be a reasonable optimization that is broken by this.
  • It does not contradict anything we currently tell LLVM.
  • It simplifies writing code with mem::unintiailized()/MaybeUninit::uninitialized() when the type is known to consist of integers only.

Arguments for disallowing uninitialized bits:

  • If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question.
  • "No uninitialized data outside of unions" is a nice and and easy to teach principle.
  • Ruling out uninitialized bits in integers and floating points makes it easier to find bugs where people accidentally have such variables not properly initialized: A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately. Also see https://www.youtube.com/watch?v=yG1OZ69H_-o: The fact that C makes integer overflow on unsigned variables defined to wrap-around actually has lead to real-world bugs (caused by unintended overflow) not being detected by bug-finding tools, because you cannot just mark every overflow as a bug. If there is a rare exception to a common principle ("integer overflow is a bug", "uninitialized data is a bug"), then it is preferable to have the programmer state their intent of violating that principle explicitly (use explicitly overflowing arithmetic operations like wrapping_add, use types like MaybeUninit for explicitly handling uninitialized data).

Here's an example of a piece of code that relies on uninitialized integers (and raw pointers, and AtomicPtr) being valid: https://github.com/carllerche/bytes/blob/456221d16521cf54cea0e6569669e47120a1b738/src/bytes.rs#L1810

An interesting use of uninitialized bits in integers is in crossbeam's AtomicCell: for types of certain sizes, it will use the AtomicU* types.

But this means that with a type like (u8, u16), it will use AtomicU32 and thus carry uninitialized bits in a u32. Also see crossbeam-rs/crossbeam#315.

Cc @stjepang

If we allow uninitialized bits, we have to define what happens with arithmetic operations when an input contains an uninitialized bit. Is that insta-UB? Is the result fully uninitialized? Does uninitializedness somehow propagate bitwise? These are hard choices, because it is easy to break arithmetic equations here. By disallowing uninitialized bits, we successfully avoid the question.

If we were to allow uninitialized bits, it might be reasonable to say, at least initially, that any operation that's not a read or a write is UB. That would allow us to define those operations later on. For example, that would mean that adding an initialized integer with an uninitialized one is UB, but later on, we could define that to something else, like the result of that operation being uninitialized.

That is, we wouldn't need to answer these hard questions right now.


I wonder whether it is possible to allow uninitialized bits later on in a backwards compatible way or whether we do have to make this decision right now ?

I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption.

In the same way, if we allow uninitialized bits, we can't later on disallow them without breaking code that uses them.


"No uninitialized data outside of unions" is a nice and and easy to teach principle.

Agreed.

A bug-finding tool can flag any occurrence of uninitialized memory in integers/floating points as a bug immediately.

I don't think it can flag every occurence, for example, extern fn foo() -> MaybeUninit<i32>, but it would be able to do so for all code that the tool is able to instrument at least.

I think we have to make this decision right now, because, e.g. if we forbid uninitialized bits, all unsafe code assumes that integers are always initialized, and we can't change that later to support uninitialized bits without breaking that assumption.

I don't see how. If a type's validity invariant rules out certain initialized bit patterns, unsafe code can use those bit patterns for its own purposes, which will clash with later allowing those bit patterns in the validity invariant. However, uninitialized bits -- even if they are valid -- cannot be detected by the program: reading them is either UB or perhaps produces some string or zeros or ones (possibly a non-deterministic one). So that kind of counter-example is right out.

Moreover, because uninitialized bits will never be safe, unsafe code can't be fed them from unknown/external sources. IOW the only way a library/module/etc. will ever see uninitialized bits is if it produces them itself or obtains them from its "trusted base" (e.g., a function whose functional correctness is relevant to the soundness of the library/module/etc.), in which case it's UB today and its problem is an internal bug, not the interface with the rest of the world.

I would be inclined to permit uninitialized data in integers.

My reasoning is as follows:

I think that backwards compatibility around mem::uninitialized is a real concern. It is a very common pattern to allocate an uninitialized [u8] array in practice, and I am reluctant to declare all of that pre-existing code UB (even if we deprecate uninitialized).

Second, I think that the crossbeam example which @RalfJung raised earlier feels like the kind of tricky thing that people shouldn't have to fret about. In particular, if you have uninitialized padding bits in structs and things like that which are known to be less than word size, I think you should be able to treat them as integers for convenience, and it seems like that would be Insta-UB under this proposal.

Basically, I think people are likely to do things (like crossbeam) that wind up using uninitialized bits in integers but which don't necessarily "feel" like cases where MaybeUninit should be required. I also think that people will commonly reason about uninitialized data informally and hence write code that is UB -- of course, a tool that detects such usage can help, but it's also nice if "common things" people write are not UB even if they neglect to run the tool.

That said, I find the most compelling argument against permitting uninitialized bits to be that it would allow us to declare such uses an error. But it seems like we could still have a sort of "lint", where we say "ah, you are using uninitialized data outside of a MaybeUninit struct -- while not technically UB, it is recommended to use MaybeUninit.

One meta-question:

Suppose that I do have a clever algorithm that makes use of uninitialized integers. For example, a trick I have used in a past life was to have an integer set that had O(1) initialization cost regardless of its capacity. This worked by having two vectors of integers. One of them started out with size N but had uninitialized data. The other started out as size 0 but had initialized data. To add to the set, you checked one against the other (I can give details if desired). The key point is that you had to read and use an uninitialized integer in the process and compare it against initialized data -- is reading such an integer UB?

The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits. So if we want to allow them, we need to not only allow reading uninitalized memory but also make arithmetic and comparisons on them defined (and reasonably deterministic! undef results are worse than useless). That is not possible at all with current LLVM (except by initializing all memory, or pretending to do so in ways that will block lots of unrelated optimizations) and even if/when freeze and poison is adopted, it'll have some negative codegen impact and I see no way to restrict that impact only to modules that really need it.

So even though I agree that it would be best to support these things, I do not see a reasonable way to achieve that.

The crossbeam example and the "O(1) initialization" integer set, as well as all other clever uses of uninitalized memory that I'm aware of (i.e., not just allocating uninitialized memory and initializing it at your own pace before using it) require operating on uninitialized bits.

Notice that this only applies to the compare_exchange part of crossbeam. Loads and stores do not operate on uninitialized data and hence are mostly fine.

So we have a two-level discussion here:

  • Can we load/store uninitialized bits into integers?
  • If yes, is there any other operation we can perform on them?

I'd prefer the answer to the first question to be "no" so that the second question doesn't even need answering, but unfortunately @nikomatsakis has some pretty good arguments. ;)
Just one word on these: at least under some proposed poison semantics for LLVM, poison acts on a value. That means if you load a (u8, u16) into a u32, if the padding was indeed uninitialized, you now have a fully uninitialized u32 -- if one byte of the memory is poison, the entire u32 becomes poison. So that would still not allow was crossbeam does. This behavior is, from what I know, one of the reasons why the proposal isn't officially adopted yet -- but if you don't do this, you lose some other optimizations.
However, this question of the "granularity" of poison/uninit (per-value, per-byte, per-bit) is somewhat orthogonal to whether or not a u32 must be initialized, and there are uses of uninitialized integers that never run into this.

Coming to the second question, as @rkruppe said these patterns (I think you are talking about https://research.swtch.com/sparse, right?) are still not legal. Comparing an uninitialized integer with anything is either UB or produces an uninitialized boolean, branching on which is UB. But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that.

But once LLVM has freeze, we could make them legal very easily and (I think) without bad impact elsewhere by adding a freeze intrinsic and telling people to use that.

Hm, if getting people to use such an intrinsic is acceptable (I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking), then we can already build such an intrinsic today, it just has to do something that all LLVM optimizations have to assume could initialized the memory (e.g., some inline asm). This will have some unfortunate impact on optimizations unrelated to uninitialized memory, but it will still be localized to uses of that intrinsic.

I think the biggest source of worry is people reasoning naively about uninitalized as "initialized to an arbitrary bit string" and not even checking

Fair, but I see no model that makes this work.

This is actually also a reason why I'd prefer to not allow uninitialized data in integers -- that may be more surprising, but it is easier to explain and very concise: "No uninitialized data outside MaybeUninit".

If we allow uninitialized bits but then almost all operations are UB, it becomes something more like "No uninitialized data outside MaybeUninit, except if it is an integer or a raw pointer and you are not actually looking at the data, just loading and storing" and then we still have to explain that x * 0 is also UB even though it doesn't really have to "look at" x -- and then people will be "screw this, that's too complicated, I will just do whatever and write some tests".

Basically, we are breaking expectations anyway, and maybe it is better to break them more but in simpler ways, than to figure out how to break them as little as possible while still breaking them in ways that are much more complicated to explain. That seems plausible to me. Not sure if it makes any sense.^^

CAD97 commented

If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation?

It seems like it should be possible to use an atomic (u8, u16) on a platform supporting 32 bit atomics. The only way I can see without requiring valid(T) \in initialized(u32) (and thus a very unsafe AtomicCell) is to provide some sort of mem::initialize_padding(&T).

I'm personally on the side of making uninitialized integers invalid so long as we don't lose anything (but mem::uninitialized) for it.

The case of AtomicCell<(u8, u16)> does indeed depend on UB at the moment. C++ solved this by essentially making it the responsibility of std::atomic<T> to magically ignore padding bytes (see this page, at the bottom). Note that this magic only works for structs: compare_exchange_strong is not guaranteed to ever converge with unions.

I've been toying with the idea of a clear_padding_bytes intrinsic that would support this use case, as well as things like writing a struct to a file without leaking information through the padding bytes:

/// Writes 0 to all padding bytes in `val` and returns a mutable slice of the bytes in `val`.
fn clear_padding_bytes<T>(val: &mut T) -> &mut [u8];

If uninitialized bits in an integer are made instantly invalid, is it possible to do the crossbeam::AtomicCell trick of making an atomic (u8, u16) by treating it as AtomicU32 at all? Is it possible to do that minus the CAS operation?

No.

Notice though that the state of this trick (even the load/store part) is dubious in LLVM as well -- poison is infecting the entire value when just one of the bytes loaded from memory is poisoned, and there are proposals to replace undef by poison.

C++ solved this by essentially making it the responsibility of std::atomic to magically ignore padding bytes

Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution.
One possible solution is to say that values get frozen before being compared, that at least removes the UB -- but it doesn't guarantee that a compare-exchange loop will ever terminate as they could get frozen to a different value each time. Or maybe they actually get frozen in memory, but that conflicts tons of real optimizations.

Writes 0 to all padding bytes in val and returns a mutable slice of the bytes in val.

I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes. But a version of this which just picks arbitrary bit patterns for all uninitialized and padding bytes would be implementable, that's exactly what freeze does.

I don't think this is implementable -- at run-time there is no way to distinguish padding bytes from initialized bytes.

I don't see what the problem is? The compiler knows the layout of type T and therefore knows where all the padding holes are.

Well, "solved". As usual with C++, it is somewhat unclear what this actually means in an operational way. I would not call this a solution.

In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation. This will cause padding bytes to be "ignored" by the compare_exchange hardware instruction since they will always be 0.

The compiler knows the layout of type T and therefore knows where all the padding holes are.

Oh, you mean statically -- well, for enums that'll require dynamic checks as well. And of course this is hopeless for unions.

In an operational sense this would mean setting the padding bytes to 0 on all input values to an atomic operation.

That would also guarantee that the value you read has 0 for all padding bytes, which I am fairly sure they do not want to guarantee. And anyway I see no way to implement this behavior even remotely efficiently.

I think a more reasonable operational version amounts to saying that you freeze both values before comparing -- that at least avoids comparing uninitialized bytes, and it makes compiling to a simple CAS correct. But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop.

Also this assumes the atomic operation knows the correct type, whereas AFAIK for LLVM atomic operations only work on integer types -- at which point you cannot know which bytes are padding.

But it allows spurious comparison failures and makes no guarantees that your retrying CAS loop will ever succeed, because you could see a different frozen value each time around the loop.

I believe spurious comparison failures can be fixed like so:

// Assume `T` can be transmuted into `usize`.
fn compare_and_swap(&self, current: T, new: T) -> T {
    // Freeze `current` and `new` and transmute them into `usize`s.
    let mut current: usize = freeze_and_transmute(current);
    let new: usize = freeze_and_transmute(new);

    loop {
        unsafe {
            // `previous` is already frozen because we only store frozen values into `inner`.
            let previous = self.inner.compare_and_swap(current, new, SeqCst);

            // If `previous` and `current` are byte-equal, then CAS succeeded.
            if previous == current {
                return transmute(previous);
            }

            // If `previous` and `current` are semantically equal, but differ in uninitialized bits...
            let previous_t: T = transmute(previous);
            let current_t: T = transmute(current);
            if previous_t == current_t {
                // Then try again, but substitute `current` for `previous`.
                current = previous;
                continue;
            }

            // Otherwise, CAS failed and we return `previous`.
            return transmute(previous);
        }
    }
}

Now it's still possible to have a spurious failure in the first iteration of the loop, but the second one will definitely succeed (unless the atomic was concurrently modified). In fact, this is exactly how CAS in AtomicCell<T> already works, except we don't have a way to freeze values.

previous is already frozen because we only store frozen values into inner.

I think this is the key invariant here -- if you can make that work, then yes I can think there is a consistent semantics here.
This requires making sure the contents are initialized, and also freezing before writing in store (and any other modifying instruction).

However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.

We seem to be trying to answer two different questions here that are entangled.

One is whether integers and such can be uninitialized or not.

The other one, which is most fundamental, is whether we want to support using uninitialized memory outside unions or not, in general. I think we should answer this question first, and use the answer to drive the rest of the design.

We can't think of allowing uninitialized memory on integers without also considering that the validity of integers and raw pointers is going to be alike, so we are also allowing uninitialized memory on raw pointers. Raw pointers can point to DSTs, so we also need to be thinking whether we want to allow uninitialized memory in pointers to DSTs (for the whole pointer, some part of it, etc.).

I agree with @nikomatsakis that a lot of code is using mem::uninitialized today and we should weight the impact on that. I don't share their reluctance (yet), because I think deprecating mem::uninitialized doesn't break the world (the compiler and tools are going to be the same as the day before the deprecation). People code will continue working "as is", and they will have time to upgrade. I don't know what the cost of the upgrade will be, but reasoning about mem::uninitialized is hard, and the rule "no uninitialized memory outside unions" turns something that is hard into something that's relatively easy. It also potentially simplifies our "rules" for all other types (e.g. raw pointers, integers, etc.), and there is value in that.

The issue that some algorithms require uninitialized memory has shown up. The question that hasn't been answered yet AFAICT is whether MaybeUninit can be used to implement those algorithms or not. If it can't, then IMO that might hint at a design flaw in MaybeUninit (or maybe we need more helper types, e.g., AtomicMaybeUninit or similar), but that doesn't necessarily imply that banning uninitialized memory outside unions is a bad choice.

However, notice that you have AtomicCell::get_mut, so I do not believe it is possible to maintain the invariant that inner will always be frozen.

Just to clarify for everyone else following this thread: @RalfJung and I discussed this and the conclusion was that we'll simply remove get_mut() if that is necessary to make AtomicCell sound.

About whether uninitialized data is okay in integers at all: we discussed this at the all-hands.

The general consensus seems to be that we should permit uninitialized data in integers and raw pointers. There is just too much existing code doing stuff like

let x: [u8; 256] = mem::uninitialized();
// go on

or

let x: SomeFfiStruct = mem::uninitialized();
// go on

Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri.

Incidentally, that matches what Miri already currently implements, mostly for pragmatic reasons (libstd already violated the rules about uninitialized integers when I wrote the checks -- I think I have since moved it to MaybeUninit).

Both of these patterns would be insta-UB if we disallow uninitialized integers. That doesn't seem worth the benefit of better error-checking with Miri.

Was it also discussed whether this was worth the benefit of a simpler and more teachable correctness model for unsafe code ? (independently of whether this model can be better checked with miri or not?).

think I have since moved it to MaybeUninit).

I think so too.

Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?

Is "integers must be initialized" really that much simpler than "integers are allowed to not be initialized"?

No, but I do think that "uninitialized memory is only allowed inside unions" is much much simpler and teachable than all other alternatives that are currently being discussed.

CAD97 commented

Another (drastic?) option to consider that allows let x: u32 = mem::uninitialized();:

Disallow uninitialized (poison) bits in integers (and pointers?). Make mem::uninitialized return freeze(poison). This makes mem::uninitialized behave as most initially intuit (a constant nondeterministic bit string) and gains us the "(true) uninitialized memory only inside unions (or padding)" property.

This may degrade some performance around uses of mem::uninitialized, but the intent is to move those over to mem::MaybeUninit anyway, right? And as previously mentioned, mem::uninitialized is a very dangerous tool to start with.

Also it could degrade talking about "uninitialized memory", because it then introduces both the "frozen uninitialized" behind mem::uninitialized and "poison uninitialized" behind mem::MaybeUninit and padding as being called "uninitialized" in the language. This could be just documented on the deprecated big-fat-warning-label mem::uninitialized for most of the points back, though. Example:

mem::uninitialized

Deprecated, do not use. Use mem::MaybeUninit for its clearer semantics instead.

For safety reasons, this function now returns "frozen" memory -- memory set to
nondeterministic but consistent bits -- rather than truly uninitialized memory.
Otherwise, merely storing the returned value in a variable would immediately
break the validity invariants of any type except mem::MaybeUninit.

@CAD97 Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze. That would make all the existing code that creates uninitialized integer arrays or FFI structs valid even if we decide that integers must be properly initialized. (I guess we could say we require them to be "frozen".)

@Amanieu suggested we should allow uninitialized values in integers only for "backwards compatibility reasons". This would have a similar effect. It might, however, incur a performance cost on such existing code. Porting code to MaybeUninit enables it use "unfrozen memory" and gain back the performance.

However, I suspect people will still want to keep memory unfrozen when e.g. calling a known Read::read implementation. That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)

Good idea! Basically, once we have rust-lang/rust#58363, we could reimplement mem::uninitialized in terms of MaybeUninit::initialized and then call ptr::freeze.

We should definitely do this.

That would mean that if we disallow uninitialized integers, we have to make the validity invariant of references shallow. (I am in favor of that anyway, but this is a contentious point.)

+1.

There is just too much existing code doing stuff like

let x: [u8; 256] = mem::uninitialized();

The freeze fix makes this "work" for Rust 2015 and Rust 2018, but unless I missed something the plan is still is to deprecate mem::uninitialized. If that's the case, we could prevent Rust 2021 crates from accessing it (even though the std lib still needs to ship it for inter-edition compatibility).

So I find the argument that "integers and pointers shall support uninitialized bit-patterns because there is too much code using mem::uninitialized()" weak. In Rust < 2021 uninitialized would be changed to freeze, and in Rust >= 2021 there could be no mem::uninitialized.

CAD97 commented

What about let x: [u8; mem::size_of::<T>()] = mem::transmute(t);? That's the other way to easily accidentally put padding bytes in an integer.

Do we care about that use pattern? (What's the sound equivalent if we don't allow this? [MaybeUninit<u8>; mem::size_of::<T>()]? The sooner MaybeUninit stabilizes the better.)

As another example for a use-case for moving around (references to) uninitialized integers: https://github.com/tbu-/buffer

The longer this goes on the more I get convinced that if we can avoid making things UB, we should. And if we can make something a requirement only "when the data actually gets used", we should (see #133 for a similar problem in a totally different context). Both of these point towards not making uninitialized bits in an integer insta-UB.

In particular, when a disappointed unsafe code author asks "but why oh WHY did you make this UB", I think it is good to be able to point to something. And for ruling out integers with uninitialized bits in them, we can't point at a lot.

From what I remember, we have "The formal semantics are nicer" and "Some sanitizers get more effective". I expect many people will not be impressed... worst-case, they'll just say "well but it works in practice so I will just ignore your UB" (and to my knowledge, it will work in practice, even if we rule it out).

I think at this point I'd like to see the wording for the "integers with uninitialized bits are not UB" case.

The main advantage I see of making integers with uninitialized bits UB, is that miri and similar tools will point users at the source of the UB: where did that integer become uninitialized.

Without this, we have to define what are users allowed to do with an uninitialized integer and what they are not allowed to do. A tool will trivially report UB when the operations that are not allowed happen, but tracking and pin pointing to the point in the users source code that made that integer UB can be tricky.

they'll just say "well but it works in practice so I will just ignore your UB" (and to my knowledge, it will work in practice, even if we rule it out).

A huge downside of making uninitialized integers UB is that there is no way for us to, e.g., diagnose that in debug builds. We can't insert debug_assert!s or similar to check for that.

Wait, I think this is wrong:

The main advantage I see of making integers with uninitialized bits UB, is that miri and similar tools will point users at the source of the UB: where did that integer become uninitialized.

How would miri diagnose, e.g.,

let mut v = Vec::<u8>::with_capacity(100);
unsafe { v.set_len(100) } // UB

If uninitialized integers are UB, then the Vec::set_len invokes UB because it creates 100 uninitialized integers, but it does so by changing the Vec's len field, so how would miri diagnose that ? That appears impossibly hard.

My understanding is that miri wouldn't check validity of the Vec's elements int that code snippet (regardless of whether in that specific case that includes checking for uninitialized-ness). It might when coercing the Vec to a slice (depending on how we define validity for references), but certainly not just from changing the len field.

That's a bit of an interesting side question: Vec::with_capacity is calling RawVec::with_capacity which calls alloc on the generator that the RawVec is using. And the return pointer from alloc "may or may not have its contents initialized.", so isn't with_capacity at fault there? If there's any fault to be had at all, that it.

@rkruppe

My understanding is that miri wouldn't check validity of the Vec's elements int that code snippet (regardless of whether in that specific case that includes checking for uninitialized-ness). It might when coercing the Vec to a slice (depending on how we define validity for references), but certainly not just from changing the len field.

So IIUC, set_len doesn't create any values. When the *mut T inside the vector is dereferenced is when the values are created (e.g. that would happen if one accesses the vector by index: v[0]). So if uninitialized integers are not ok, dereferencing the pointer would trigger the UB. If uninitialized integers are ok, dereferencing the pointer would not trigger the UB, but then taking a reference to the value would.

I think at this point I'd like to see the wording for the "integers with uninitialized bits are not UB" case.

One of my points is that the burden should generally be on whoever wants more UB to justify their desire. ;) The default should be to err on the side of less UB, all else being equal.

That said, my case is basically: the benefits are small, but there is a real cost.

On the benefit side, we don't get any new optimizations. We get slightly better "blame tracking" for when an incorrect use of an uninitialized integer does cause UB, but only with a very expensive analysis (valgrind/Miri). We also get slightly simpler formal rules. Don't get me wrong, I love simpler formal rules, but they don't outweigh any possible other consideration. ;)

On the cost side, we have the problem that many programmers have the intuition, for better or worse, that "any bit pattern is allowed in an integer", and hence you can e.g. transmute any 8-byte-sized datum to u64 for purpose of atomic accesses or serialization or whatever. Where possible, there is a lot of benefit in aligning our rules with programmer's intuition. Moreover, it's also useful to be able to use integers for small fixed-size chunks of arbitrary data: the alternatives are either unergonomic (MaybeUninit<[u8; 8]> or so) or non-existent (for atomic memory accesses). And there is also the cost of people basically giving up on UB and considering it a "purely theoretical concept" if we tell them that yes, their code has UB, but no, there's no way that will break their code (because no optimizations actually exploit it).

Also see Niko's case here.


Without this, we have to define what are users allowed to do with an uninitialized integer and what they are not allowed to do.

Literally anything except for a memcpy. Which in terms of MIR means: any cast, as well as applying any unary or binary operand to it, and using it as an index in a slice/array access.

A tool will trivially report UB when the operations that are not allowed happen, but tracking and pin pointing to the point in the users source code that made that integer UB can be tricky.

You mean "made the integer undef", I assume?
So basically, you are saying that "blaming" whoever is responsible for creating the "bad" integer is harder when we only disallow it once an actual operation happened? I agree. That's what I meant above when I referred to "more effective sanitizers".

If uninitialized integers are UB, then the Vec::set_len invokes UB because it creates 100 uninitialized integers, but it does so by changing the Vec's len field, so how would miri diagnose that ? That appears impossibly hard.

Uh... I have no idea how you got there. Our memory is untyped. Only operations are typed. When you change the length of the vector, that's the only thing that happens. When you actually access the vector, that's when you "create an integer" (you are loading a range of bytes from memory at integer type).

The following code does not have UB:

let v: Vec<!> = Vec::new();
v.set_len(100);
mem::forget(v);

So IIUC, set_len doesn't create any values. When the *mut T inside the vector is dereferenced is when the values are created (e.g. that would happen if one accesses the vector by index: v[0]). So if uninitialized integers are not ok, dereferencing the pointer would trigger the UB. If uninitialized integers are ok, dereferencing the pointer would not trigger the UB, but then taking a reference to the value would.

Exactly. :) Except for the "taking a reference" part. If unintiailized integers are valid, then certainly are references to uninitialized integers.

But then adding 2 to that integer is UB.

Exactly. :) Except for the "taking a reference" part. If unintiailized integers are valid, then certainly are references to uninitialized integers.

Yes, that's wrong, @rkruppe raised it on Discord. I was thinking "reference to uninitialized is invalid", but that's not correct if uninitialized is valid.

Looks like the widely-used bytes crate also deals with references to uninitialized integers, which would be made legal by deciding uninitialized integers are legal (taking some load off the reference discussion). [Source]

However, another thing I realized is that if we allow uninitialized integers, then calling this the "initialization invariant" becomes... weird.^^

Given that on Itanium, "uninitialized" (NaT, not-a-thing) registers can cause traps and whatnot, I wonder if saying that integers have to be initialized would be necessary and/or sufficient to make Rust compatible with hardware that can trap when using uninitialized registers the wrong way.

MaybeUninit<u*> can still hold uninitialized data, so it is not clear that disallowing this only for integers helps. And conversely, I suppose the compiler could make sure that the program does not ever see a "NaT" on Itanium, though I am not sure.

If we ever have to care about an Itanium-like design (I really hope we won't), I don't think the resolution of this question will have any impact on that. As you note, "uninitialized data" is definitely allowed from other sources (and LLVM declares it legal to store undef/poison to memory, too) and so "uninitalized data" must not be NaT in any case. From the other direction, if necessary it should be possible to generate code that initialized possibly-NaT registers where necessary, and hopefully the costs of this or other workarounds will persuade architects that this is a misfeature.

Since there appears to be consensus that uninitialized integer and floating-point values should be valid, I'm marking this as writeup-needed.

Ideally, the first step of the write up would be the rationale section. There has been a lot of discussion here and on internals, so we need to go through all of it, collecting pros / cons for allowing / not allowing valid uninitialized integers, and summarize those up.

Ideally we'll post such a summary here first so that everybody can see it and make sure that we do have strong consensus about uninitialized integers being valid.

Afterwards we can write down the value representation relation for these types.

One thing I am not sure we talked about: if we allow uninitialized bytes in integers, do we specify them to preserve which bytes are uninitialized? Like, when reading a u32 of which 3 bytes are initialized but 1 is not, and then storing it elsewhere, are the same 3 bytes initialized now or are all bytes uninitialized?

Some proposals for poison in LLVM involve making the entire integer poison if any input byte is poison; that would force us to say that the entire integer is uninitialized if any input byte is.

(EDIT: stupid GH decided to submit my post)

Good question. I don't know, but one consequence of either is that, if a single uninitialized byte makes the whole integer uninitialized, the values that integers can take are [0..int::MAX, uninitialized]. If it doesn't, what would happen? For a given fully-initialized integer, would the different combinations of uninitialized bytes of it that one could construct be distinct values?

Is there any reason not to go with "if any of the bytes in memory is uninit, the whole integer value read out is uninit" for now? It seems strictly more conservative (in particular re: LLVM poison semantics), and since programs can't test whether something is uninit (other than by causing UB if it is), I expect it would be backwards compatible to move to partially-uninit-integers later.

For a given fully-initialized integer, would the different combinations of uninitialized bytes of it that one could construct be distinct values?

They would have to to preserve that information, yes.

Is there any reason not to go with "if any of the bytes in memory is uninit, the whole integer value read out is uninit" for now? It seems strictly more conservative (in particular re: LLVM poison semantics), and since programs can't test whether something is uninit (other than by causing UB if it is), I expect it would be backwards compatible to move to partially-uninit-integers later.

Agreed. I just wanted to bring it up.

Would this mean that a) creating a &mut [u8] from uninitialized memory would be fine, and b) even reading data from there would be ok? That would seem rather unfortunate as that would mean that safe Rust code would not always be valgrind-clean anymore.

I'm thinking of the case of passing an uninitialized buffer to Read::read() or AsyncRead for optimization purposes, where the implementer of the traits could do anything it wants with the passed in buffer, including to read from it.

Would this mean that a) creating a &mut [u8] from uninitialized memory would be fine, and b) even reading data from there would be ok? That would seem rather unfortunate as that would mean that safe Rust code would not always be valgrind-clean anymore.

Since "reading" means different things to different people, please give an example. :)

But notice that this is possible in safe Rust already, and might be considered "reading uninit memory" (I don't know what valgrind has to say about this):

let array = [MaybeUninit::<u8>::uninit(); 128];
let elem = array[42];

I'm thinking of the case of passing an uninitialized buffer to Read::read() or AsyncRead for optimization purposes, where the implementer of the traits could do anything it wants with the passed in buffer, including to read from it.

That is unsound and will be unsound under any option discussed here. To make this sound, we need something like "freeze". See rust-lang/rust#58363.

Since "reading" means different things to different people, please give an example. :)

Actually reading the data behind the uninitialized memory :)

let array = [MaybeUninit::<u8>::uninit(); 128];
let elem = &*(array[42].as_ptr());
println!("{}", elem);

As long as it's passed around as a MaybeUninit<u8> I assume it's all fine, but creating a &u8 from it (or a &[u8] from the whole array) is not fine anymore unless it's pointing at initialized data? Or is that also still fine as long as no dereferencing happens (I hope not)?

But notice that this is possible in safe Rust already, and might be considered "reading uninit memory" (I don't know what valgrind has to say about this):

valgrind is fine with that until we do something like ptr::read() or what my example above does.

(I don't know what valgrind has to say about this)

Valgrind only complains if the program performs a conditional branch which depends on uninitialized data.

Actually reading the data behind the uninitialized memory :)

Okay so you mean actually performing arithmetic/logic operations on it, not just reading it, stuffing it into a u8 and never using that again. I mean, memcpy is arguably reading, right...?

As long as it's passed around as a MaybeUninit I assume it's all fine, but creating a &u8 from it (or a &[u8] from the whole array) is not fine anymore unless it's pointing at initialized data? Or is that also still fine as long as no dereferencing happens (I hope not)?

Right now, conservatively, creating a &[u8] to uninit memory is UB.

The result of the discussion in this issue was that most likely this should not be UB.

However, either way, your code is UB. There are no plans to allow code to do anything with uninit integers besides loading and storing them from/in memory. This is where freeze comes in -- but that's a whole separate discussion, not fitting into this issue. (But yes, with "freeze" one could likely write safe Rust code that valgrind complains about. That is one of its many drawbacks. It also has some big advantages though.)

However, either way, your code is UB.

Great, thanks a lot for the clarification. That's exactly what I would expect!

Right now, conservatively, creating a &[u8] to uninit memory is UB.

This SHOULD be changed.
There is nothing wrong with uninit & or &[T] as long as user doesn't read it or attempt mutation that would lead to Drop
It SHOULD be UB only when you read uninit memory.
Creating reference or slice is never should result in UB

A reference is always allowed to be read from. No exceptions! In fact we explicitely tell LLVM this using dereferencable. LLVM is thus always allowed to insert reads to &[u8].

@bjorn3 doesn't matter whether it is allowed to read or not.
What matters that &T creation is fine even if it points to uninit storage.
UB should happen if you read from such &T

@DoumanAsh that discussion is happening in #76 and #77, let's keep this issue on topic.

I would like to back-paddle a bit on my previous agreement with "uninitialized integers are fine" that I raised in this thread. Yeah, sorry...

My concern is related to what I already mentioned in the OP:

when an integer is partially uninitialized, does that "infect" the entire integer or do we exactly preserve which byte is initialized?

Let's assume we say an uninitialized u32 is okay. Then what would people expect the following program to do:

let zeroe = MaybeUninit::new(0u8);
let uninit = MaybeUninit::<u8>::uninit();
let x: u32 = mem::transmute([zero, uninit, zero, zero]);
let y: [MaybeUninit<u8>; 4] = mem::transmute(x);
println!("{}", y[0]);

Currently in Miri, this is UB as y[0] is not initialized. The reason for this is that an integer must be either fully initialized or it is considered not initialized at all, so when x gets created from partially initialized data, it becomes fully initialized. This actually (to my knowledge) matches LLVM's treatment of poison: when you do an i32 load of 4 bytes of memory and a least one byte is poison, the result is a poison integer. So fixing this would require changing more than just Rust documentation and Miri.

I am also not sure that conceptually, we really want integers to be able to hold partially initialized data. At that point, the Rust abstract domain of "values that can be stored in an u32" really has nothing to do with mathematical integers modulo 2^32 any more, and becomes just "4 Rust Abstract Machine Bytes with all their weird extra states". That's not very... clean. When integers can just be fully uninitialized at least the abstract domain is roughly Option<MathematicalIntMod2Pow32>, which is not quite so awful.

But then if we are willing to accept that an integer cannot be partially initialized, saying "integers can be uninitialized" could be really confusing, and maybe we are better off just teaching people to never use integers for (partially or entirely) uninitialized data, period.

We could give privileged status to u8.

But uninit seems of low value to me, so to me just saying "use MaybeUninit that's what it's there for" is also fine.

Is there anyone who still strongly wants to have uninit integers, and also do any such people strongly want uninit integers other than u8?

I do think it would be confusing if it was OK to copy uninit memory byte-by-byte, but copying it word-by-word was UB (requiring that you use MaybeUninit).

Actually that's a good point

Another possible argument for disallowing uninit/poison in integers would be rust-lang/rust#74378: it would let us use the noundef attribute in more places.

Of course, that begs the questions how beneifitial that attribute really is.

I do think it would be confusing if it was OK to copy uninit memory byte-by-byte, but copying it word-by-word was UB (requiring that you use MaybeUninit).

Would it really tho'? My impression based on what I've read so far is that the idea is that the Rust Abstract Machine's memory array is untyped, but accessing it IS typed, so that when you copy data out of the RAM's RAM using u8s you are making a different type assertion. It is potentially harder to teach at first, but if it's more correct, it could be easier in the long run.

CAD97 commented

I believe @Diggsey is referring to if it would be ok to memcpy uninitialized memory using u8 but not usize.

Of course it would be allowed with MaybeUninit<u8> and MaybeUninit<usize>; the potential confusion mentioned there would be if it were allowed to do so with u8 but not with usize. The more correct way of requiring MaybeUninit is the less confusing way, compared to the more potentially confusing world with a privileged u8.

I think the current RAM proposal is only harder to teach to some portion of C++ devs (and possibly other langs) who have the pre-conceived notion of memory being typed at all.

All of memory being a bag of bytes, with only accesses being typed, is much easier to think about and to teach to beginners.

Oh, no then, I might have gotten a bit confused here? My assertion is mostly

0 1 2 3
5 3 4 UN

accessing 0-2 as u8s is fine.
accessing 0-3 as a u32 is obviously UB
accessing 3 as a u8 is also UB.
I would find this pretty teachable.

None of the proposals on the table meaningfully differentiate u8 and usize. They should certainly follow the exact same rules, no strange C-style exception for our 1-byte type.

@workingjubilee that is exactly what I am proposing when I say "integers must be fully initialized".

Currently in Miri, this is UB as y[0] is not initialized.

I assume it'd be fine if, instead of going through u32, it went through MaybeUninit<u32> or Simd<u8, 4>?

(Hmm, this also makes me wonder whether it's ok to go u32 -> MaybeUninit<(u8, u16)> -> u32...)

I assume it'd be fine if, instead of going through u32, it went through MaybeUninit or Simd<u8, 4>?

For the latter, yes. (Well, you are creating an uninitialized u8 which is wrong, but assuming we permit uninitialized integers that is not a problem.) For the former, Miri complains but that is a bug (rust-lang/rust#69488).

(Hmm, this also makes me wonder whether it's ok to go u32 -> MaybeUninit<(u8, u16)> -> u32...)

Ah... that's another huge discussion, see #156 and #73. The summary is that I'd prefer it to be okay, but the way real calling conventions work, if we want MaybeUninit to be repr(transparent) (and we do, for perf reasons), we probably have to declare that MaybeUninit<T> does not preserve values stored in the padding of T.

None of the proposals on the table meaningfully differentiate u8 and usize. They should certainly follow the exact same rules, no strange C-style exception for our 1-byte type.
that is exactly what I am proposing when I say "integers must be fully initialized".

Ah, OK. Then now I understand and agree.

It seems like the remaining catch with resolving towards poison is that there's a strong desire to have the ability to take a given type, turn it into a bitfield, shove that bitfield inside a uint type, and then upon extracting it, to deliberately ignore an unused part of the bitfield for e.g. alignment purposes, akin to making something like this valid:

    let x = [5, 3, U, 4];
    let y = [5, 3, U, 4];
    
    unsafe {
        assert_eq!(
            std::mem::transmute::<[u8; 4], u32>(x) & 0xFF00FFFF, // lol little-endian
            std::mem::transmute::<[u8; 4], u32>(y) & 0xFF00FFFF,
        );
    }

which is basically a desire to specify that a given uint is now going to be treated as a byte array and that specific precise accesses against the byte array are valid and not UB, so actually a bit more like

    let x = [5, 3, U, 4];
    let y = [5, 3, U, 4];
    unsafe /* maybe? */ {
      let [x0, x1, _, x3] = x;
      let [y0, y1, _, y3] = y;
      assert_eq!([x0, x1, x3], [y0, y1, y3]);
    }

Given that this is an explicit desire in the cases where it arises, it seems like making this explicitly expressible in those cases could be done without pessimizing the general case.

Well, this line is already UB: let x = [5, 3, U, 4];. You have an uninitialized u8 here.

I am not sure what the best way would be to express that pattern, or how common it really is to do bit-level operations with partially initialized data. As far as I know, anything like this is also UB in C.

Qwaz commented

I think the complexity of corner cases that arise by allowing uninitialized values outweighs the gain. There were great comments explaining the trickiness of undefinedness propagation. Another thing that I would like to bring up is the location of unsafe keyword for uninitialized values.

  • It is safe to construct a raw pointer, but it is unsafe to dereference them.
  • It is safe to call MaybeUninit::uninit(), but it is unsafe to call MyabeUninit::assume_init().
  • It is unsafe to create uninitialized value, but it is safe to call arithmetic operations on them.

Not only the location of unsafe keyword is the opposite from other language constructs, it's also bad in terms of locality and reasoning. When handling raw pointers, programmers can check whether all invariants are fulfilled before dereferencing it However, in case of uninitialized values, the unsafe block guarantees up front that all future use of this value will not cause undefined behavior. This looks too error-prone, and allowing uninitialized value behind numeric types doesn't seem to match well with other parts of unsafe Rust design.

nikic commented

I'd like to give this topic a bump. For correctness reasons, LLVM is increasingly blocking or pessimizing optimizations on values that aren't known to be well-defined. As such, this bullet point...

There does not seem to be a reasonable optimization that is broken by this.

...is no longer true.

With the LLVM 14 release, clang will mark ~all parameters and return values as noundef, and you can expect that LLVM will take less care about preserving optimization quality in the absence of noundef now.

Ah, good point, I hadn't realized this connection yet.

Out of curiosity, does "all" include char? If yes that would make it impossible to implement memcpy in C which, uh, I don't think is a good idea...

Yes: https://clang.godbolt.org/z/898Y1YzPn memcpy doesn't require char arguments and return values It takes pointer arguments. A simple memcpy implementation I wrote doesn't put noundef on the loads and stores of copied bytes: https://clang.godbolt.org/z/cd6Y48KhW

Wait so it makes a difference whether the char is just kept in a local variable vs passed as an argument to another function? Ugh...

Honestly I am not convinced that the C standard allows them to put noundef on char arguments. But anyway, that is not really our business.

nikic commented

@RalfJung Clang currently only implements noundef annotations for arguments and return values, though I expect that it will also emit !noundef load metadata in the future.

As for char (or rather, unsigned char), the standard does treat this type specially in [basic.indet] such that a memcpy implementation would be allowed (paragraph 2.2). I don't believe the type is special for the purposes of argument passing.

As for char (or rather, unsigned char), the standard does treat this type specially in [basic.indet] such that a memcpy implementation would be allowed (paragraph 2.2). I don't believe the type is special for the purposes of argument passing.

Indeed -- which is why I think noundef on char is wrong, including for arguments of that type.

Lucky enough, in Rust we don't treat u8 special, so this does not affect us. However it means that MaybeUninit<u8> must have different attributes than u8 even though MaybeUninit is repr(transparent).

Hi, is there any updates?

As far as I am concerned, this is pretty much resolved: integers and floats may not be uninitialized. That is also what Miri now implements by default.

For those who might be a bit confused because the conversation bounced around to lots of different places, a lot of the interesting cases are spelled out very explicitly in MaybeUninit's docs. For instance, the "I want to Read into an uninitialized buffer" case is explicitly called out as incorrect here:

https://doc.rust-lang.org/std/mem/union.MaybeUninit.html#incorrect-usages-of-this-method-1

@Gankra That is more a question for #77, and I happen to think that we should allow the examples shown in the docs...^^ (and Miri currently allows them).

In that issue when someone asked about the exact situation I referenced you told them it was off topic and to discuss it in this issue.

You are quoting me incorrectly. :)
That is a reply to this example which uses uninit ints by-value. However, the examples you pointed to above do not do that, they only use uninit memory behind a reference. So they are two very different (but related questions).

So, no, this is not "the exact situation" you referenced. It is a very different situation.
To wit: that other example errors in Miri, the examples you referenced do not.

rust-lang/rust#98919 asks the lang team to decide that uninit integers (and floats and raw pointers) are UB.

98919 was closed with "yes, uninit ints/floats/raw pointers are instant UB to move" so the answer to

The remaining open question is: is it ever allowed to have an uninitialized bit in an integer or floating point value?

is "no", so we don't need to answer how it behaves. Should this get closed, then? Not sure when UCG issues get closed, but this looks like it's been decided by the lang team.

They get closed when we have done the writeup.

... which we haven't done in years...

From reading through this thread, there seems to have been a pretty dramatic reversal from the original loose consensus of "it isn't worth effort/complexity to make uninitialized integers UB". As far as I can tell, this was partially due to the fact that LLVM is making its optimization passes more conservative, and will require an explicit noundef to keep the codegen that we currently expect.

I'm not trying to re-litigate the already-completed FCP. However, I'm concerned that this is going to be difficult to teach - I suspect people will fall back to their (incorrect) mental model of "a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine".

Is there a concrete example of an optimization we can point to that explains why this behavior is useful? Ideally, it would:

  • Break if the any of the values involved are initialized
  • Work if the values came from frozen uninitialized memory
  • Have some sort of performance issue if we made all MaybeUninit::assume_unit have an implicit freeze - I think many people are going to wonder why we don't "just" do this.

My overall concern is that people might (implicitly) decide that this is too "weird" or "theoretical" without a concrete justification for this "weirdness", and write incorrect code that never gets run under Miri. Of course, undefined behavior is defined by the behavior of the Abstract Machine, but I think being able to say "this 'weirdness' is required for the following useful snippet to be optimized well" will really help to reinforce that in practice.

CAD97 commented

The obvious example is the Itanium Not a Thing, where using uninitialized data will actually cause a CPU-level trap.

The usual concrete example of uninitialized memory is reading from MADV_FREEd pages returning nondeterministic results based on what the state of the page allocation is.

AIUI, the main benefit of assume_init not doing a freeze is being able to optimize it out. If you do *p = freeze(*p), that isn't a removable noΓΆp read/write pair, because it could actually change the state of memory. memcpy elision is very important, and freeze basically means to not do that.

This probably wouldn't be that big of a deal for just assume_init, since those will probably be comparatively rare β€” but the real problem is normal pointer reads. Introducing a freeze on every pointer read is almost certainly a nonstarter.

There's also the fact that we will replace by-value passing with by-ref passing when the value is large. This can be optimized into a continued move β€” but if one of the steps is freeze, that forces an actual memcpy at that point, rather than continued reference passing.

And it's not a super strong motivator, but it'd be an annoying footgun if MaybeUninit::assume_init froze, but Box::assume_init doesn't (because it can't; freeze is a by-value operation). At least part of the reasoning is simpler overall semantics of the language.

I don't really find your examples that convincing, TBH:

  • That's not quite how NaT works on Itanium -- a load of uninit data into a NaT register is considered initialized from from the persective of NaT. I also don't think we should base our semantics on what would be good for a mostly-unused architecture (and I say that as someone who owned an Itanium in the past).
  • We can already eliminiate *a = freeze(*a) usually (due to &mut T and &T), and memcpy in most cases (where src and dst cannot overlap).
  • MADV_FREE is not relevant for uninitialized values on the stack, which is what this is largely talking about.
  • For the same reason, your statement about being unable to remove by-ref passing seems incorrect to me.

One example is running the code under Valgrind. Since freeze generates no actual instruction, this info is not conveyed to Valgrind. For the "uninitialized u8 is fine" model to work with Valgrind we would have to generate a lot of calls of VALGRIND_MAKE_MEM_DEFINED to make Valgrind happy.

"a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine"

Note that this is wrong even if we ignore uninit memory -- u8 cannot hold a byte with provenance. If you try to put provenance into u8, it will be stripped. As a consequence, a memcpy implemented with u8 is incorrect. I don't think there is any way to salvage the model of u8 being a universal byte type -- that ship sailed before Rust even started.

The model of u8 being a universal byte type does actually exist in C/C++, so I wouldn't call it a nonstarter exactly, but it leads to a lot of weird special cases where memcpy of char* is magic compared to any other type.

But I assume that people who make the assumption that u8 can store all possible AM bytes would make the same assumption about u16, so the C/C++ solution isn't desirable even for those people.

The model of u8 being a universal byte type does actually exist in C/C++,

Not quite -- clang adds noundef for char arguments in C/C++. See llvm/llvm-project#56551 for how that is justified. At last for C++, their argument seems solid. For C, the standard is too murky to tell.

At this point I've mostly given up on this topic and accepted that it's going to be UB, but I do feel like it is very close to a breaking change -- it feels like there's almost no way that someone could predict in, say, 2016 that uninitialized [u8; N] would be UB, given the existence of mem::uninitialized() (and I don't think our 0x01-initialization fixes this, since IIUC it's still considered UB, for example).

Even after that, for a long time (for most of the period this thread has been open, for example), the concensus was on the side of "this is probably fine" too... It does feel like soon after LLVM adds optimizations based on noundef opinions changed to it needing to be UB. That's pretty frustrating, and is part of why it's still very hard (or impossible) to write unsafe code that can last for a long time. This is part of why I try to support the work to hammer these things out, so that in the future we can't make the decision based on what's most convenient at the time, and can instead have firmer semantics that can be programmed against.

(Separately, I do (strongly) think there is a need to support an unsafe MaybeUninit::<T>::freeze(self) -> T function though, but that's separate as it would not produce uninit bytes from the perspective of the AM)

If someone had designed a proper op.sem in 2016, I think they would have quickly come to the conclusion that at least this is an unanswered question and we should clearly tell people to avoid doing this until there is an answer. But sadly unsafe Rust started with a pre-rigorous phase and it'll take a while for that to shake out. This was definitely predictable when this paper got published, but it took a while for that realization to spread and even longer until suggesting 'uninit integers are UB' seemed like a proposal that has realistic chances of being accepted in Rust. I thought for a long time that maybe we can allow 'fully uninit integers' without too much cost, but then it turns out that even that already blocks a bunch of optimizations in safe code that we really want. I know it's painful but I also don't know what else we could have done, other than stabilizing MaybeUninit sooner and pushing people harder to use it. Maybe the lesson is that we should have been more aggressive at pushing the notion that uninit integers are UB -- I was going slow here for fear of losing everyone if I go too quickly, and due to having a small hope that maybe this can be avoided (which was clearly a misjudgement on my side).

Regarding optimizations, note that this is less "LLVM becoming more conservative" and more "LLVM fixing optimizations that caused real-world miscompilations". I don't know why you framed this in such a strange way, @Aaron1011.
@nikic would know better than me where those noundef annotations are crucial, but @scottmcm did have a concrete example during the invalid-value-lint-FCP.

@RalfJung: I didn't mean to imply that the reasons for noundef were purely theoretical, or that there were no actual miscompilations caused by the current behavior. I only meant that LLVM is going to (correctly) stop performing certain optimizations on the IR we currently emit.

We can already eliminiate *a = freeze(*a) usually (due to &mut T and &T)

I don't think we can, what does this have to do with references?

MADV_FREE is not relevant for uninitialized values on the stack, which is what this is largely talking about.

So should we say that uninit integers are fine only on the stack but not on the heap? That sounds quite terrible.

We could declare MADV_FREE memory entirely unsupported in Rust. But this is going down the train of not having poison/undef in the language, which I think is he wrong direction -- there's a reason basically every compiler has such a concept, they are crucial for getting good codegen.

I also think a magic assume_init would have caused people to think that this is how all uninit memory works, so when people use unions or raw pointers they would then still get things wrong. There's a benefit to having a clear line that can be taught consistently.

This is part of why I try to support the work to hammer these things out, so that in the future we can't make the decision based on what's most convenient at the time, and can instead have firmer semantics that can be programmed against.

I think it's a bit unfair to say that we are doing this based on what's "most convenient". If we had declared that integer types can hold uninit memory, we would have had bugreports saying "I wrote this safe code and codegen didn't produce the results I wanted", and we'd have to explain each time that it's because other people write unsafe code that we wanted to allow and we can't have our cake and eat it, too. There's no easy answer here, and "make safe code pay in performance for things unsafe code sometimes wants to do" is not obviously a good outcome, either.

I fully agree though that we need to have firmer semantics that people can rely on, that's the entire point of the vast majority of my work on Rust.

nikic commented

I'm not trying to re-litigate the already-completed FCP. However, I'm concerned that this is going to be difficult to teach - I suspect people will fall back to their (incorrect) mental model of "a u8 can hold any initialized byte, and an uninitialized byte is 'just an arbitrary byte', so an uninitialized u8 is fine".

FWIW, I think it's kind of the contrary: I find the MaybeUninit model that Rust ended up adopting to be very easy to understand, because you explicitly say where uninitialized memory is possible and where it's not. There's a very clear and easy to reason about dividing line. You don't have to think about which types support uninitialized values and which don't, and what behavior you get when you try to perform operations on uninitialized values.

Anyway, regarding freeze, I think something that people don't realize is that freeze poison is, for most practical purposes, equivalent to zeroinitializer, because that's what LLVM will convert it into. Making mem::uninitialized() return frozen memory is effectively equivalent to returning mem::zeroed(). (And this is kinda what we ended up doing, just with a different, safer fixed bit pattern.)

Doing a freeze in assume_init() would have similar effects. In the lucky case, it would zero-initialize everything that has not been explicitly initialized. In the unlucky case (which is actually more likely for non-trivial initialization) it would perform a memcpy. The whole point of using MaybeUninit is to prevent these things from happening.

There is, at present, no way to support "uninitialized memory" in LLVM that is both efficient (i.e. does not end up just initializing memory anyway) and does not leak things like undef/poison semantics into Rust's semantics (with all the weirdness that comes with it, like x == x returning false for an uninitialized u8). And I don't think there is any interest in making this possible on the LLVM side either.

So just to spell this out clearly, I believe the options you get (from an LLVM perspective) are:

  • There is no efficient uninitialized memory.
  • There are poison semantics in the Rust AM.
  • Uninitialized memory is nicely encapsulated in MaybeUninit.

Closing as answered