PyO3/rust-numpy

Safely access Numpy v2 StringDType array data

Opened this issue ยท 7 comments

Numpy v2 added a new variable-width string data type. This is much more performant than variable-width string arrays in Numpy v1 because those stored a Python str for every element.

It would be great to have a safe way to access this data. It looks like currently the only APIs are in the unsafe npyffi module.

cc @ngoldbaum

Perhaps this is slightly more difficult than I expected because of the presence of the na_object, which can be any PyObject, and is not limited just to a &str.

I think we'd need a struct like this:

#[repr(transparent)]
#[derive(Clone, Copy, Debug, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct PyVariableWidthString<Py<T_NA>>(pub &[u8]);

T_NA would parameterize over the Python type of the na_object.

There might also need to be an RAII guard object we expose to make sure the allocator lock is always released, even if there's a panic.

EDIT: deleted incorrect const parameter

Also in the future there might be different encodings besides UTF-8, so maybe &[u8] isn't the right type but also UTF-8 encoded bytes is all numpy supports right now, so going further is probably overkill.

How would the const generic work in practice? The size of the string could be anything, right? How would we know at compile time how large the string buffer will be?

How would the const generic work in practice?

Oh wait, you're totally right. That only makes sense for the fixed-width DTypes. Sorry...

I edited the posts above.

Maybe actually this is better, since we know it's valid UTF-8:

pub struct PyVariableWidthString<Py<T_NA>>(pub &str);

I'm not sure if there are subtleties around a &str which is actually owned outside the Rust runtime. Might need another wrapper type...

I guess you could add e.g. a wrapper for the npy_static_string struct, which right now is just 16 opaque bytes (on 64 bit architectures) and then make the npy_static_string wrapper expose an acquire() method that returns an RAII guard wrapper abound a &str. So something like this:

// NBYTES is 16 on 64 bit architectures and 8 on 32 bit architectures
pub struct NpyPackedString([u8; NBYTES])

pub struct NpyStringGuard(&str)

impl Drop for NpyStringGuard {
    fn drop(&mut self) {
        unsafe {npy_ffi::NpyString_release_allocator(...)};
    }
}

impl NpyPackedString {
    fn acquire(&mut self) {
        NpyStringGuard {
            let data: &[u8] = unsafe { self.get_utf8_data() }
            str::from_utf8(data)
        }
    }
}

Just a sketch, but that's probably the best way to do it and better reflects how the NpyString C API works. It's probably best to force users to explicitly acquire the allocator lock and also provide methods to acquire the allocator lock only once if you have a collection of &NpyPackedString references that share the same array descriptor.