rust-lang/rust

Add `get` and `get_unchecked` to `*const [T]` and `*mut [T]`

oli-obk opened this issue ยท 23 comments

If one has a *const [T] and wants a pointer to a specific element of that slice, one needs to go through a reference and index, or cast the pointer to *const T and offset that by the number of elements (which doesn't do a length check, so you need that, too). I propose that we add two new methods to *const [T] and *mut [T]:

impl<T> *const [T] {
    fn get(self, idx: usize) -> Option<*const T>;
    unsafe fn get_unchecked(self, idx: usize) -> *const T;
} 

get_unchecked is unsafe, because if a large index is used, one may overflow the pointer, which is UB

To make the implementation of get simpler, I propose to additionally add a len method:

impl<T> *const [T] {
    fn len(self) -> usize;
}

cc @RalfJung @shepmaster

My specific case would have been for *mut [T] โ€” is that deliberately omitted?

no, I'm just lazy and didn't want to type everything twice. There should be an equivalent of all the above for *mut [T]

Pardon me if It's stupid but why not just let the user go from *const [T] to &[T] ?

fn main() {
    let x_ptr = &[1, 2, 4] as *const [_];
    
    unsafe {
        let x = &*x_ptr;
        println!("{:?}", x.get(1).map(|x| x as *const _));
        println!("{:?}", x.get_unchecked(1) as *const _);
    }
}

Is that really to long to write ? Also, want to ask, when do you need a *const [T] ?

The original context was this code:

use std::collections::BTreeSet;

fn uniq_refs<'i, 'd: 'i, T>(
    data: &'d mut [T],
    indices: &'i BTreeSet<usize>,
) -> impl Iterator<Item = &'d mut T> + 'i {
    let in_bounds_indices = indices.range(0..data.len());

    in_bounds_indices.map(move |&i| {
        // I copied this from a Stack Overflow answer 
        // without reading the text that explains why this is safe
        unsafe {
            let r = data.get_unchecked_mut(i);
            let p: *mut T = r;
            &mut *p
        }
    })
}

Miri detects that this usage triggers undefined behavior:

use std::iter::FromIterator;

fn main() {
    let mut scores = vec![1, 2, 3];
    let indices = BTreeSet::from_iter(vec![0, 2]);

    let selected_scores: Vec<_> = uniq_refs(&mut scores, &indices).collect();
    for score in selected_scores {
        *score += 1;
    }
}

I wanted to make use of get_unchecked_mut, but that function requires unique access to data. That's invalid here because there's an outstanding mutable reference to a T from inside the slice. The non-UB form of this involves getting a raw pointer to the data and performing offsets:

fn uniq_refs<'i, 'd: 'i, T>(
    data: &'d mut [T],
    indices: &'i BTreeSet<usize>,
) -> impl Iterator<Item = &'d mut T> + 'i {
    let start = data.as_mut_ptr();
    let in_bounds_indices = indices.range(0..data.len());
    in_bounds_indices.map(move |&i| unsafe { &mut *start.add(i) })
}

With the proposed functions, that code could look like

fn uniq_refs<'i, 'd: 'i, T>(
    data: &'d mut [T],
    indices: &'i BTreeSet<usize>,
) -> impl Iterator<Item = &'d mut T> + 'i {
    let start = data as *mut [T];
    let in_bounds_indices = indices.range(0..data.len());
    in_bounds_indices.map(move |&i| unsafe { &mut *data.get_unchecked(i) })
}

Yeah, this seems reasonable. However, I think it should be seen in the context of our general need for a better story for raw pointers.

impl<T> *const [T] {
    fn get(self, idx: usize) -> Option<*const T>;
}

Wouldn't that need to be unsafe for the same reason as offset?

Since it would be bounds-checked, you can't overflow, so get should be safe

Ah no, you're right. One can create a wrong slice that would overflow for some indices...

It could use wrapping_offset, though. Then it could be safe.

just to note that in C, it's UB to create a out of bound pointer of more than len, for exemple:

char *ptr = malloc(42); // assert no null
char *ub = ptr + 43;
char *also_ub = ptr - 1;

Look like Rust allow this, but use raw pointer is often use with C binding, I just through this could be useful to know.

It could use wrapping_offset, though. Then it could be safe.

But then people would have a performance incentive to not use the clearer and more convenient method. And what for? If one uses this method they'll almost always go on to access memory through the resulting pointer, and then they'd almost certainly have UB anyway if the index is too large or goes otherwise out of bounds. While wrapping_offset does allow "allocation crossing" with a suitably crafted offset, this is extremely rare and so I don't see why we should prioritize it in a convenience method.

But then people would have a performance incentive to not use the clearer and more convenient method.

If they want the performance, wouldn't they use get_unchecked anyway? Seems like if you care about whatever LLVM gets from knowing there is no overflow, you'll also want no bounds check.

Put differently, what is even the point of having get if it is not safe?

Put differently, what is even the point of having get if it is not safe?

This is a good question. I feel like the answer might well be "there is no point to having get". Is there anything useful one can do with get that doesn't already require knowing the preconditions of get_unchecked are met? If the resulting pointer is not good to dereference (e.g., because the *[T] was dangling) one could still compare the resulting pointer safely, but is that ever useful by itself? (The aforementioned jumping between allocations in ways that still result in dereferenceable pointer is too niche to justify the existence of the method.) (edit: I just realized: the pointer wouldn't be derefenceable when crossing allocations, so this isn't even a different case.)

I can see some potential value in how an unsafe get could simplify proof obligations, though. Assume get does bounds checking against the claimed length of the slice, but uses offset if the bounds check passes and hence has UB if the ptr/length pair is "wrong" (e.g., dangling pointer, excessive length). Using this method soundly only requires only knowing that the raw slice pointer you have is "correct", you don't have to worry at all about the values you use as indices because all your accesses are bounds-checked against the length of the slice. A complete characterization of when the slice is "correct" for this purpose could get complicated, but many simple common cases are obviously fine, e.g.:

  • if you got the pointer from a safe slice reference (that's not been invalidated since)
  • if you called alloc(Layout::array<T>(n)) and turned the resulting pointer into a slice of length n (and haven't deallocated the memory since then)

I believe this distinction between trusting the slice and trusting the indices also answers this question:

If they want the performance, wouldn't they use get_unchecked anyway?

They might need the bounds-check anyway (because the indices are untrusted) and then they might as well leave that to the standard library rather than open-coding it.

To make the implementation of get simpler, I propose to additionally add a len method:

For that len method to be safe, it would need to panic if *mut [T] is null, and it would either need to panic if the pointer is not properly aligned, or do an unaligned ptr read of the len.

@gnzlbg why that? A *mut [T] is a (*mut T, usize), and to read the usize we don't care if the pointer is NULL or unaligned. And we can assume that this pair itself is well-aligned (we are taking it by-value, anyway).

And we can assume that this pair itself is well-aligned

I was missing this, makes sense.

to read the usize we don't care if the pointer is NULL

So does this mean that if the pointer is NULL, the len must be zero, or else the pointer is invalid ?

It's a raw pointer, I don't think we are making any such requirements. Both pointer and integer can be just about anything that a *mut T or usize can be.

So, in terms of implementing this as an unstable feature... the main problem is that we want these to be inherent methods on *const [T], right? AFAIK this needs some rustc magic and lang items as those types are not defined in any crate, and thus cannot usually get inherent methods?

Yea, we need to add the same kind of magic that the integer types have in

rust/src/libcore/num/mod.rs

Lines 2380 to 2381 in 1572c43

#[lang = "i8"]
impl i8 {

Though in this case it may be more difficult due to the involved generics, so you should look at the implementation of the magic for slices for inspiration:

#[lang = "slice"]
#[cfg(not(test))]
impl<T> [T] {

#71082 does all the hard parts here -- now adding more raw slice methods (like indexing) is a libcore-only patch, rustc does not need any further changes.

#73986 should resolve this.

The checked version,

impl<T> *const [T] {
    fn get(self, idx: usize) -> Option<*const T>;
}

wasn't implemented. Why?

As far as I am concerned, mostly due to lack of imagination -- I figured if you'd want checks you'd use references, not raw pointers. The return type Option<*const T> is also strange, note that this one cannot be enum-niche-optimized.