rust-lang/unsafe-code-guidelines

What about: Pointer-to-integer transmutes?

Opened this issue ยท 130 comments

Transmuting pointers to integers (i.e., not going through the regular cast) is a problem. This is demonstrated by the following silly example:

fn example(ptr: *const i32, cmp: usize) -> usize { unsafe {
  let mut storage: usize = 0;
  *(&mut storage as *mut _ as *mut *const i32) = ptr; // write at ptr type
  let val = storage; // read at int type (0)
  storage = val; // redundant write back (1)
  external_function(&storage); // just making sure the value in `storage` can be observed
  if val == cmp {
    return cmp; // could exploit integer equivalence (2)
  }
  return 0;
} }

Imagine executing this code on the Abstract Machine, taking into account that pointers have provenance, i.e., a ptr-to-int conversion loses information. Now what happens at point (0)? Here we read the data stored in storage at type usize. That data however is the ptr ptr, i.e., it has provenance. What should happen with that provenance at (0)?

  1. We could drop the provenance. That would basically mean that the load of storage acts like an implicit ptr-to-int cast. The problem with this approach is that we cannot remove the redundant write at (1): the value in val is different from what is stored in storage, since val has no provenance but the ptr stored in storage does! This is basically another version of https://bugs.llvm.org/show_bug.cgi?id=34548: ptr-to-int casts are not NOPs, and a ptr-int-ptr roundtrip cannot be optimized away. If a load, like at (0), can perform a ptr-to-int cast, now the same concerns apply here.
  2. We could preserve the provenance. Then, however, we end up with val having type usize and also having provenance, which is a big problem: the compiler might decide, at program point (2), to return val instead of return cmp (based on the fact that val == cmp), but if val could have provenance then this transformation is wrong! This is basically the isue at the heart of my blog post on provenance: == ignores provenance, so just because two values are equal according to == does not mean they can be used interchangeably in all circumstances.
  3. What other option is there? Well, we might make the load return poison -- effectively declaring ptr-to-int transmutes as UB.

The last option is what is being proposed to LLVM, along with a new "byte" type such that loading at type bN would preserve provenance, but loading at type iN would turn bytes with provenance into poison. On the flipside, no arithmetic or logical operations are possible on bN; that type represents "opaque bytes" with the only possible operations being load and store (and explicit casts to remove any provenance that might exist). This leads to a consistent model in which both redundant store elimination and GVN substitution on integer types (the optimizations mentioned above) are possible. I don't know any other way to resolve the contradiction that otherwise arises from doing both of these optimizations. However, the LLVM discussion is still in its early stages, and there were already a lot of responses that I have not read in detail yet. If this ends up being accepted, we on the Rust side will have to figure out if and how we can make use of the new "byte" type and its explicit casts (to pointers or integers).

This thread is about discussing how we need to restrict ptr-to-int transmutes when pointers have provenance but integers do not. See #287 for a discussion with the goal of avoiding provenance in the first place.

(And both violate the implication in the docs that this is well defined, just using unnecessary power compared to as)

Where do the docs imply that this is well-defined? Many transmutes are not well-defined, including in my view some transmutes to integer type (like transmuting (u16, u8) to u32 is UB due to the uninitialized padding).

https://doc.rust-lang.org/1.59.0/std/mem/fn.transmute.html#alternatives really strongly implies that all of the examples before the Vec one are well-defined, just bad style.

comex commented

I posit that ptr-to-int transmutes are UB, and that's why the program is wrong. Anyone who wants to allow ptr-to-int transmutes has to find some other reason for why this program has UB.

Which part of this argument fails if you replace the transmutes with casts?

Which part of this argument fails if you replace the transmutes with casts?

Casts don't leave the underlying data unchanged, and they have side-effects. ptr2int strips provenance and has a global side-effect of 'broadcasting' or 'exposing' this provenance, int2ptr determines a suitable previously exposed provenance for this pointer.

https://doc.rust-lang.org/1.59.0/std/mem/fn.transmute.html#alternatives really strongly implies that all of the examples before the Vec one are well-defined, just bad style.

Ugh, I didn't know someone added ptr-to-usize there. That's no good. :(

Whether it is what we want is a different question. This model still means that memcpy implemented with u64 chunks is buggy since it loses provenance of pointers in the copied data. IOW, this still does not let you use integer types as "universal containers" for losslessly transfering any kind of data (including pointers with their provenance).

On cheri that is buggy anyway as you have to copy 128bit chunks at a time in a way that preserves physical provenance, right? Can we just give up on the requirement that memcpy should be user implementable in C/Rust and require it to be provided using either libc or compiler-builtins. (using a special platform specific attribute or inline asm in case of compiler-builtins)

I don't think we can treat transmutes of structs maybe containing pointers to/from bytes as causing UB (whether this is immediate or delayed via a strict provenance interpretation of the bytes, such that reconstituting them into pointers yields unusable zombie pointers pretending to be the originals). This kind of code is widespread, and we have no suitable replacement for people who want to do this. It seems like a variation on the evergreen C issue of "how to write memcpy".

How feasible would it be to allow "integers with provenance" but only in the weak sense that reading a pointer at integer type yields a pointer value with provenance intact and no escaping, but doing anything with this value other than storing it back into memory results in a ptr2int (not sure if broadcasting or not is more appropriate here). You could call this a "lazy" ptr2int, and it would be good enough to support use cases like FFI as well as memcpy (and serialization, if it broadcasts).

but doing anything with this value other than storing it back into memory results in a ptr2int (not sure if broadcasting or not is more appropriate here)

If it results in a ptr2int with broadcasting it becomes impossible to optimize *foo + 1 away even if the whole expression is unused as it might broadcast provenance if *foo has provenance.

This is often said, but the sense of it seems to be wrong to me. It can certainly be optimized, to broadcast_if_ptr(*foo) (which broadcasts *foo if it is a (piece of a) pointer value and does nothing if it is an integer), which can be eliminated in a late IR pass. It will be an optimization barrier for other things though, for sure, although the examples I can think of usually involve passing a pointer to a function that is doing this *foo + 1 operation which would normally be treated as potentially broadcasting the address anyway, assuming that it can't be fully analyzed and inlined.

Actually on second thought, Ralf's example above shows why even lazy ptr2int isn't good enough: it means that GVN style reasoning about equality of integers can lead to replacing one pointer-masquerading-as-integer with another, resulting in UB. So it's probably not a viable option.

I don't think we can treat transmutes of structs maybe containing pointers to/from bytes as causing UB (whether this is immediate or delayed via a strict provenance interpretation of the bytes, such that reconstituting them into pointers yields unusable zombie pointers pretending to be the originals). This kind of code is widespread, and we have no suitable replacement for people who want to do this. It seems like a variation on the evergreen C issue of "how to write memcpy".

My take on this is, if you want to do this, use MaybeUninit<u8> (or whatever integer type matches your chunk size). You need to do that anyway for memcpy because of padding. That will also fix all provenance concerns.
(So, we do have a suitable replacement.)

I don't see any reason (except for legacy code) for why we should allow transmutes of pointers to integer types.

Given all these problems with ptr2int transmutes, I suggest we start cautioning against them in the mem::transmute docs: rust-lang/rust#95547.

If just using byte reinterpretation is a valid way to "transmute" things with pointers in them, couldn't we implement transmute for things containing pointers (either automatically via transmute or as a separate, explicit function) as that?

Additionally, I still have user sided concerns, a top priority should be protecting users from accidentally doing this seemingly correct thing, I think that transmute should emit a compile error when used on structs containing pointers or pointers themselves. We should still also consider special casing transmute::<ptr, int>() and vice versa as ptrtoint and inttoptr since it's just an easy protection to the user.

Compile time guards seem fine, but the execution semantics of a transmute that's allowed to compile should stay as simple as

(&t as *const T).cast::<U>().read()

If just using byte reinterpretation is a valid way to "transmute" things with pointers in them, couldn't we implement transmute for things containing pointers (either automatically via transmute or as a separate, explicit function) as that?

The point of the last example is that byte reinterpretation is not a valid way to 'transmute'. Other parts of the compiler assume that bytes at integer type do not have any provenance.

One thing to mention again with this being discussed again and general agreement on the other int/ptr casts is that the C provenance draft explicitly allows some of these. In particular, it explicitly calls out

uintptr_t p2i_union(void *p) {
    union {
        void *p;
        uintptr_t i;
    } foo;
    foo.p = p;
    return foo.i;
}

as permitted for PNVI variants (PVI impl defined), and

uintptr_t p2i_char(void *x) {
    uintptr_t ret;
    char *i = (char*)&p;
    char *o = (char*)&ret;
    char *e = i + sizeof(p);
    while (i < e)
        *o++ = *i++;
    return ret;
}

as ok for all of them (the example in the draft is "read the low byte, set a bit, write it back, later mask it out", and a user_memcpy is also called out as exposing in PNVI*).

Explicit literal memcpy(&int, &ptr, size) is not mentioned anywhere that I can find. (Though there are various callouts for some way to annotate user_memcpy as non-exposing or otherwise preserve provenance in char in very limited ways even in PNVI*).

Since clang lowers that union pun as exactly the T**->int* bitcast that is called out here, either clang will need to change its lowering (somehow), or any problem is an LLVM bug.

@talchas I am very curious how backends are going to support union-based type punning between integers and pointers. What clang and LLVM currently do is definitely very wrong. Basically, this requires every load from a union that might load a ptr at int type to also have the 'expose' side-effect, and thus not be removed even when it is a dead load. (Yes it can be turned into a mere 'expose' marker but it remains an optimization barrier for any potentially-aliasing store.)

Even worse than that, p2i_char means that every load of a char type also needs that treatment. Now remember that Rust doesn't have strict aliasing so for Rust this becomes every load period. ("preserve provenance in char" is even wilder, it means LLVM must represent char as b8 with the byte type proposal -- and again to do the same in Rust we'd have to use bN for all our integer types. I don't think anyone wants that.)

So, until LLVM makes indications that they intend to support these patterns, I would rather not bet on anything like that and instead ask people to use MaybeUninit or casts instead. So far nobody has provided any good reason for why one would write such code given these alternatives. So I'd rather treat that code as explicitly outside the realm of well-defined Rust, and handle breakage of such code in the real world on a case-by-case basis.

Basically, C doesn't really have a choice here as they have way more legacy code than we do and fewer abilities to do anything else (e.g., no MaybeUninit). So out of shear desperation, they declare those patterns legal, no matter the cost for backends. It remains to be seen how much backends are willing to sacrifice in order to be PNVI-ae-udi-compliant. But in Rust we have a very different situation; we do have good alternatives and we have much less legacy code. We thus can make a serious attempt at doing something here that actually works well for codegen. If it turns out backends are becoming serious about guaranteeing all corner cases of PNVI and they manage to do that without significant optimization cost -- then we can always revisit our decision and allow more code. But given that this is global optimization cost, not limited to the code that does these cursed things, I would rather not bet on that happening.

Another way to put this is that I don't think we have to just repeat every mistake that C made.

Yeah, the "preserve provenance in char" bit they mentioned is just for memcpy stuff, and the vague suggestion in the draft is just dropping it the moment you do anything integer-like, which is fine for a primarily performance focused thing. (And of course, this is something implementations can partially implement to whatever extent they want)

The main obvious (still meh) use case is &[u8] -> &ActualType where you don't have &[MaybeUninit<u8>] because Read/AsyncRead/etc (and the better solution there is on the other end anyways). And doing this with pointers involved is still ugly, but semi-reasonable for things like "load VM core image"

And the other use case is other sorts of "library/compiler is missing something", like I just saw in portable_simd, where the intrinsics just don't exist for Simd<*const (), N> but do for Simd<usize, N>. (The workaround I guess is to provide to/from usize implementations for every N as part of the trait on N). These cases are of course better solved in the library/whatever, but also are never going to all go away.

That's just a case of "language missing something", and that needs to be added. Using transmute to make the type checker happy is not actually a substitute for equipping the language with the operations it needs. When the language lacks async fn we need to do the work to add support for them; when the language lacks simd_offset we need to do the work to add support for that.

So I would say these cases are only served by adding a new suitable primitive. (This is similar to extern fn unwinding. The language was missing a way to import or export functions that can unwind. Some people 'worked around' that limitation in creative ways but really they were just causing UB and the only actual fix is to add support for such functions to the language.)

The point of the #286 (comment) is that byte reinterpretation is not a valid way to 'transmute'. Other parts of the compiler assume that bytes at integer type do not have any provenance.

It also wouldn't work at runtime for CHERI with permissive provenance. To make *mut usize as *mut *const () work, we'd have to load the integer, resolve into a capability, and write it back to memory. This side-effect is observable to other parts of the program via the CLoadTags instruction.

(I know not everyone cares about permissive provenance on CHERI, but I think there'll be demand and it's feasible and useful)

My newest blog post also talks quite a bit about pointer-integer transmutation. :)

One new insight I had while chatting with Zoxc and @JakobDegen on Zulip is that making pointer-integer transmutes UB violates the "monotonicity property" that adding more provenance to a datum (in the sense of provenance that can access more memory) can only make your program "more defined": by adding provenance to an integer, we can introduce UB!

This makes me slightly more partial towards saying that pointer-integer transmutation strips provenance (without exposing it), at least when designing an IR intended for optimizations. On the level of the surface language, the argument still applies but is IMO less strong.

(Interestingly, this option relies on that very same monotonicity property if we want to still be allowed to eliminate redundant self-assignments like x=x;. Those self-assignments could strip provenance, so removing them means not stripping provenance and thus that has to be a legal transformation.)

My main problem with stripping provenance implicitly is that it breaks the entire framework I have set up here: in that framework, if a list of bytes bytes can be "decoded" to a value val (like, a list of bytes with provenance can be decoded to an integer value), then during "encoding" val, it is legal for the machine to pick bytes as the encoding. There is a perfect symmetry here. I quite like that.

If we don't have that symmetry, we double the size of that file by having to define "encode" and "decode" separately for each type. That also means more work if we have to ensure any properties that relate these operations.

Note that having the symmetry while doing implicit provenance stripping is not an option: that means encoding an integer could synthesize arbitrary provenance, which violates the most fundamental property of provenance -- that it is "private" and can be "owned" by a function. This property is the reason why provenance exists in the first place (from a program logic / reasoning perspective), it is not negotiable.

making pointer-integer transmutes UB violates the "monotonicity property" that adding more provenance to a datum (in the sense of provenance that can access more memory) can only make your program "more defined": by adding provenance to an integer, we can introduce UB!

I also just now wondered if we can even have that property. Specifically, the property I was thinking of is: given an Abstract Machine state, if we alter that state by changing a single byte in memory to "increase" its provenance, then all executions that are possible from the altered machine state were also possible from the original machine state. This property certainly holds (has to hold!) when replacing an "uninit" byte by an arbitrary byte; do we also have it for increasing provenance?

I think the answer is no. The key question is how we define the "decode" operation for pointers when not all bytes have the same provenance:

  • We could say that is invalid. (That's what MiniRust does for now.) However, this means if we start with 8 bytes that have no provenance, and then add some provenance to one of those bytes, we turned a valid encoding of a ptr into an invalid encoding, thus violating the above property.
  • We could say that if there exists a "joint" provenance that merges them, then that is the provenance we use. As a special case, let's just say if some bytes have provenance and others do not, but for the bytes that do have provenance it's all the same provenance, then we use that provenance for the decoded ptr. But now we again violate the property: start with 8 bytes where the first 4 have no provenance and the last 4 have some provenance p1. Now change the machine state to give the first byte some provenance p2 (that is incompatible with p1, i.e., there exists no "joint" provenance). This turns a valid encoding of a ptr into an invalid one, thus violating the property.

If we truly don't have this property, then I am feeling strongly that we can not implicitly strip provenance when decoding integers. The combination of not having the property and implicit provenance stripping means that we cannot remove x=x; redundant assignments, and that sounds bad.

@digama0 proposed a different "decode" operation that uses the meet of all provenances (the "least common provenance") when decoding a pointer and the bytes do not all have the same provenance. (Basically: if any byte's provenance is None, the resulting pointer has None provenance.) That "decode" I think indeed satisfies the monotonicity property: if we increase the provenance in a byte, then either the meet with the other involved provenances will stay the same, or it will increase itself, which is fine.

Algebraically this means we have to require a "meet" to exist on Option<Provenance>, but since one can always use None that does not seem like a strong restriction

So maybe we have to choose between "encode and decode form an adjoint" and "provenance monotonicity".
I cannot make up my mind wrt which property I consider more natural / more important.

So maybe we have to choose between "encode and decode form an adjoint" and "provenance monotonicity".
I cannot make up my mind wrt which property I consider more natural / more important.

Turns out there is a way to also get an adjoint property with implicit provenance stripping. So the algebraic property is not at risk.

What remains is the question whether having provenance stripped implicitly is a good idea. (Also see this Zulip discussion.)
Both @JakobDegen and @digama0 are in favor of that. What is less clear to me is the motivation for this: is it "just" to make fewer programs UB? That's a good reason of course, but in this case I am not sure if those programs are actually very relevant. Ptr-int-ptr transmutation roundtrips are UB as there is no avoiding that, so the programs we are saving here are those that do such transmutes but never deref the resulting pointer.
Specifically, the counterexample against roundtrips here has to be UB, the question is just:

  • are the ptr-to-int transmutes UB (when computing left_int, right_int)
  • or is dereferencing left_ptr UB (because the pointer has None provenance)

I am slightly worried we will be confusing people if we tell them these transmutes are okay but the resulting pointer is garbage. Though maybe I am overestimating the difference between that, and telling them the transmutes are UB -- many will also find that confusing.
On the other hand, I do like the appeal of implementing addr via transmute, which "transmutation strips provenance" would let us do. It is nicely symmetric with implementing ptr::invalid via transmute as well. Basically this would mean the strict provenance operations can all be implemented with the base features of the language, no new intrinsic required. For me personally that is so far the best argument in favor of implicit provenance stripping.

If I had to pick, "the transmute is UB" is probably easier to teach.

This could be a good case to apply https://gankra.github.io/blah/tower-of-weakenings/ though, since it sounds a lot like the difference between "a painfully strict and simple model that you can teach and check" and "messier to allow for Useful Crimes".

Yeah, I had the same thought -- we can tell people ptr2int transmutes are "tricky, don't do them", and then if people care about the fine-print we can explain the details of what exactly is and is not allowed. In this case we don't know the Useful Crimes yet, but who knows what people will come up with.

After all these discussions, I think I now agree with those who said that ptr2int transmutation should strip provenance, i.e., it should be equivalent to ptr.addr(). I have updated MiniRust accordingly.

So, is this pretty much settled as "pointers->integers is not expected to be UB, but you should assume it strips provenance", or is it more up in the air than that?

I have personally pretty much settled on that, yes. It's also what Miri implements now, modulo rust-lang/miri#2182.

I'd love to make this "official", but I am not sure what the right process is for that. Probably an RFC? Or just FCP'ing the strict provenance APIs into stabilization?