zarrs/zarrs-python

Build Out Store Offerings

Opened this issue ยท 23 comments

We currently only do file system store: https://github.com/ilan-gold/zarrs-python/blob/main/src/codec_pipeline_store_filesystem.rs but should expand to, at the minimum, HTTP.

๐Ÿ‘‹ Long time no see!

I've been working on a library that may be helpful here: pyo3-object_store. It's designed to define builders once, so that many Python-Rust bindings can reuse the same ObjectStore integration.

It enables your functions that use object store to be as simple as:

#[pyfunction]
pub fn use_object_store(store: PyObjectStore) {
    let store: Arc<dyn ObjectStore> = store.into_inner();
}

I've been meaning to update and publish a new release of pyo3-object_store. It would be cool to prototype an integration here.

Hi @kylebarron funny how that happens.

I think (@LDeakin can correct me) that one would need to implement the necessary traits on the object_store object from https://docs.rs/zarrs_storage/0.3.0/zarrs_storage/index.html to get a compatible store for zarrs (to work with https://github.com/ilan-gold/zarrs-python/blob/main/src/store/manager.rs#L17).

But where that happens and how is probably more a matter of discussion. That would be very cool though!

Ah ok, @LDeakin is very comprehensive!

Cool stuff @kylebarron! Yeah, I'd be interested to see an integration

What is the priority to support S3? Is there any interest? Any rough outline of what would need to be done to enable?

Is there any interest?

Yea! At the Zarr summit, there was a lot of interest to support more stores in zarrs-python.

Any rough outline of what would need to be done to enable

Two options:

  1. It could be wrapped similarly to the HTTPStore
    • S3 has quite a few options compared to a HTTP store
    • This is very tedious to do for each store, and is brittle
  2. We integrate with pyo3-object_store in some way so we can just use any obstore store

The latter is preferable, and perhaps @kylebarron could give some advice if you (or anyone else) would be willing to attempt it.

pyo3-object_store provides helpers to give you an Arc<dyn ObjectStore> in your Python API.

So you can create a Python function that creates a Zarr store backend with

use zarrs_object_store::AsyncObjectStore;

#[pyfunction]
pub fn use_object_store(store: AnyObjectStore) {
    let store: Arc<dyn ObjectStore> = store.into_dyn();
    let zarr_backend = AsyncObjectStore::new(store);
}

I don't know the layout of zarrs-python well enough to know how to integrate it there

Sweet--thanks for the guidance! I am a total Rust newb, but looking for a project to help me learn. I might take a stab at this over the next few weeks, but do not let me hold anyone back from implementing.

@ljstrnadiii Something to note here - we do not have an async API at the moment. All io (i.e., from s3) will be parallel but not concurrent. So you may not see any grand performance improvement. The route to an async API is a little hairy but I've poked around at it a bit - I don't see any fundamental limitations here, though, but a first-time rust PR, it might not be.

What's the blocker for async? I don't understand zarr-python well enough to know how this integration works.

It looks like here:

pub enum StoreConfig {
Filesystem(FilesystemStoreConfig),
Http(HttpStoreConfig),
// TODO: Add support for more stores
}

Zarrs-python is storing configs rather than instances.

I think we may have discussed this before; but is there a central requirement to have the configs as you have them? Also does it change anything now that there's an obstore integration in zarr-python?

What's the blocker for async? I don't understand zarr-python well enough to know how this integration works.

More than anything, no one has had the time to do it. Lachlan, Phil, and I all work with local data.

In terms of code, https://github.com/zarrs/zarrs-python/blob/main/src/lib.rs#L298 was giving me problems because of the lifetime + tokio. Maybe someone who has done more async rust could find a way around it, maybe just Arcing it works, not sure. If there's no way around it, we might have to create our own "lazy" NDBuffer class, but I assume this is a me-limitation and not a rust-limitation.

Zarrs-python is storing configs rather than instances.

I think this is just a misdirection so that we can handle stores that can't be mapped directly to rust i.e., FSSpecStore or LocalStore. So yes, we aren't storing instances at the moment.

If we were to make it store the instances themselves behind the python wrappers and use the trick in #44 (comment) for every store i.e., using into_dyn gives the underlying rust object that python wraps (assuming I understand that correctly), I would ask for a wrapper around the current rust FilesystemStore because it has O_DIRECT reading (or a way to get that object back with its O_DIRECT reading configured).

What's the blocker for async? I don't understand zarr-python well enough to know how this integration works.

More than anything, no one has had the time to do it. Lachlan, Phil, and I all work with local data.

In terms of code, main/src/lib.rs#L298 was giving me problems because of the lifetime + tokio. Maybe someone who has done more async rust could find a way around it, maybe just Arcing it works, not sure. If there's no way around it, we might have to create our own "lazy" NDBuffer class, but I assume this is a me-limitation and not a rust-limitation.

The lifetime of the Bound? That's definitely not an issue. You shouldn't need to make a lazy class; it should just work with pyo3-async-runtimes.

I can point you to relevant code in obstore; e.g. this result stream from a get operation.

You might just need to implement IntoPyObject on a struct you return, or something like that

Zarrs-python is storing configs rather than instances.

I think this is just a misdirection so that we can handle stores that can't be mapped directly to rust i.e., FSSpecStore or LocalStore. So yes, we aren't storing instances at the moment.

If we were to make it store the instances themselves behind the python wrappers and use the trick in #44 (comment) for every store i.e., using into_dyn gives the underlying rust object that python wraps (assuming I understand that correctly), I would ask for a wrapper around the current rust FilesystemStore because it has O_DIRECT reading (or a way to get that object back with its O_DIRECT reading configured).

Yeah so obstore, for example, has Python classes that store an Arc<dyn ObjectStore>. Then it also exposes pickle support for serialization, which is commonly used in dask

The lifetime of the Bound? That's definitely not an issue. You shouldn't need to make a lazy class; it should just work with pyo3-async-runtimes.

Sorry I misremembered, just took a quick peak:

zarrs-python/src/lib.rs

Lines 189 to 206 in fbe8827

fn nparray_to_unsafe_cell_slice<'a>(
value: &'a Bound<'_, PyUntypedArray>,
) -> Result<UnsafeCellSlice<'a, u8>, PyErr> {
if !value.is_c_contiguous() {
return Err(PyErr::new::<PyValueError, _>(
"input array must be a C contiguous array".to_string(),
));
}
let array_object: &PyArrayObject = Self::py_untyped_array_to_array_object(value);
let array_data = array_object.data.cast::<u8>();
let array_len = value.len() * value.dtype().itemsize();
let output = unsafe {
// SAFETY: array_data is a valid pointer to a u8 array of length array_len
debug_assert!(!array_data.is_null());
std::slice::from_raw_parts_mut(array_data, array_len)
};
Ok(UnsafeCellSlice::new(output))
}
this is the lifetime that was giving me issues. Maybe this could be done inside tokio::spawn instead of before calling that (which is the opposite of what happens in our current implementation where it is created ahead of the threaded bits)? I don't know

I can't tell from just looking at this code what would be wrong with async. Do you have a branch that's giving compile errors?

I hesitate to share because I don't want to send you down some rabbit hole: https://github.com/zarrs/zarrs-python/tree/ig/async_lifetime while this compiles, I think it's pretty egregiously wrong for a variety of reasons. I think the static declarations are all basically wrong, but this would be the idea modulo that

(It also hasn't had main merged in a while so beware that too, but here are the lines I think are of the main interest: https://github.com/zarrs/zarrs-python/compare/ig/async_lifetime?expand=1#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R477-R592)

I don't understand enough unsafe bits here about how you do numpy casting (I exclusively use safe APIs because I just don't understand enough about what I need to watch out for), so I don't know if there are interactions there about using unsafe with async

And I probably can't reliably help you with any unsafe bits.

That's fair, it's probably more work to unscrew what I have than start from scratch. The unsafe comes from https://docs.rs/unsafe_cell_slice/latest/unsafe_cell_slice/ and not the numpy casting.

And the use of UnsafeCellSlice is zarrs/zarrs#133, so avoiding that would mean refactoring zarrs to no longer use it.

I can give feedback on using pyo3-object_store and on connecting async rust to python with pyo3-async-runtimes, but I don't think I can give any feedback on the UnsafeCellSlice