Build Out Store Offerings
Opened this issue ยท 23 comments
We currently only do file system store: https://github.com/ilan-gold/zarrs-python/blob/main/src/codec_pipeline_store_filesystem.rs but should expand to, at the minimum, HTTP.
- Integrate HTTP Store #61
- Flag for using Opendal (better performance? default for cloud urls like s3:// and gcs:// ?)
- Cache store? Somehow enabling https://docs.rs/zarrs/latest/zarrs/array/trait.ChunkCache.html if we detect an LRUCacheStore on the python side?
- Consolidated metadata: zarrs/zarrs#55
- Investigate exposing the
zarr-pythonByteGetterByteSetterto Rust to potentially support any store (fallback only?)
๐ Long time no see!
I've been working on a library that may be helpful here: pyo3-object_store. It's designed to define builders once, so that many Python-Rust bindings can reuse the same ObjectStore integration.
It enables your functions that use object store to be as simple as:
#[pyfunction]
pub fn use_object_store(store: PyObjectStore) {
let store: Arc<dyn ObjectStore> = store.into_inner();
}I've been meaning to update and publish a new release of pyo3-object_store. It would be cool to prototype an integration here.
Hi @kylebarron funny how that happens.
I think (@LDeakin can correct me) that one would need to implement the necessary traits on the object_store object from https://docs.rs/zarrs_storage/0.3.0/zarrs_storage/index.html to get a compatible store for zarrs (to work with https://github.com/ilan-gold/zarrs-python/blob/main/src/store/manager.rs#L17).
But where that happens and how is probably more a matter of discussion. That would be very cool though!
That's this, right? https://docs.rs/zarrs_object_store/latest/zarrs_object_store/
Cool stuff @kylebarron! Yeah, I'd be interested to see an integration
What is the priority to support S3? Is there any interest? Any rough outline of what would need to be done to enable?
Is there any interest?
Yea! At the Zarr summit, there was a lot of interest to support more stores in zarrs-python.
Any rough outline of what would need to be done to enable
Two options:
- It could be wrapped similarly to the HTTPStore
- S3 has quite a few options compared to a HTTP store
- This is very tedious to do for each store, and is brittle
- We integrate with
pyo3-object_storein some way so we can just use anyobstorestore
The latter is preferable, and perhaps @kylebarron could give some advice if you (or anyone else) would be willing to attempt it.
pyo3-object_store provides helpers to give you an Arc<dyn ObjectStore> in your Python API.
So you can create a Python function that creates a Zarr store backend with
use zarrs_object_store::AsyncObjectStore;
#[pyfunction]
pub fn use_object_store(store: AnyObjectStore) {
let store: Arc<dyn ObjectStore> = store.into_dyn();
let zarr_backend = AsyncObjectStore::new(store);
}I don't know the layout of zarrs-python well enough to know how to integrate it there
Sweet--thanks for the guidance! I am a total Rust newb, but looking for a project to help me learn. I might take a stab at this over the next few weeks, but do not let me hold anyone back from implementing.
@ljstrnadiii Something to note here - we do not have an async API at the moment. All io (i.e., from s3) will be parallel but not concurrent. So you may not see any grand performance improvement. The route to an async API is a little hairy but I've poked around at it a bit - I don't see any fundamental limitations here, though, but a first-time rust PR, it might not be.
What's the blocker for async? I don't understand zarr-python well enough to know how this integration works.
It looks like here:
Lines 22 to 26 in fbe8827
Zarrs-python is storing configs rather than instances.
I think we may have discussed this before; but is there a central requirement to have the configs as you have them? Also does it change anything now that there's an obstore integration in zarr-python?
What's the blocker for async? I don't understand zarr-python well enough to know how this integration works.
More than anything, no one has had the time to do it. Lachlan, Phil, and I all work with local data.
In terms of code, https://github.com/zarrs/zarrs-python/blob/main/src/lib.rs#L298 was giving me problems because of the lifetime + tokio. Maybe someone who has done more async rust could find a way around it, maybe just Arcing it works, not sure. If there's no way around it, we might have to create our own "lazy" NDBuffer class, but I assume this is a me-limitation and not a rust-limitation.
Zarrs-python is storing configs rather than instances.
I think this is just a misdirection so that we can handle stores that can't be mapped directly to rust i.e., FSSpecStore or LocalStore. So yes, we aren't storing instances at the moment.
If we were to make it store the instances themselves behind the python wrappers and use the trick in #44 (comment) for every store i.e., using into_dyn gives the underlying rust object that python wraps (assuming I understand that correctly), I would ask for a wrapper around the current rust FilesystemStore because it has O_DIRECT reading (or a way to get that object back with its O_DIRECT reading configured).
What's the blocker for async? I don't understand zarr-python well enough to know how this integration works.
More than anything, no one has had the time to do it. Lachlan, Phil, and I all work with local data.
In terms of code,
main/src/lib.rs#L298 was giving me problems because of the lifetime + tokio. Maybe someone who has done more async rust could find a way around it, maybe justArcing it works, not sure. If there's no way around it, we might have to create our own "lazy"NDBufferclass, but I assume this is a me-limitation and not a rust-limitation.
The lifetime of the Bound? That's definitely not an issue. You shouldn't need to make a lazy class; it should just work with pyo3-async-runtimes.
I can point you to relevant code in obstore; e.g. this result stream from a get operation.
You might just need to implement IntoPyObject on a struct you return, or something like that
Zarrs-python is storing configs rather than instances.
I think this is just a misdirection so that we can handle stores that can't be mapped directly to rust i.e.,
FSSpecStoreorLocalStore. So yes, we aren't storing instances at the moment.If we were to make it store the instances themselves behind the python wrappers and use the trick in #44 (comment) for every store i.e., using
into_dyngives the underlying rust object that python wraps (assuming I understand that correctly), I would ask for a wrapper around the current rustFilesystemStorebecause it hasO_DIRECTreading (or a way to get that object back with itsO_DIRECTreading configured).
Yeah so obstore, for example, has Python classes that store an Arc<dyn ObjectStore>. Then it also exposes pickle support for serialization, which is commonly used in dask
The lifetime of the Bound? That's definitely not an issue. You shouldn't need to make a lazy class; it should just work with pyo3-async-runtimes.
Sorry I misremembered, just took a quick peak:
Lines 189 to 206 in fbe8827
tokio::spawn instead of before calling that (which is the opposite of what happens in our current implementation where it is created ahead of the threaded bits)? I don't knowI can't tell from just looking at this code what would be wrong with async. Do you have a branch that's giving compile errors?
I hesitate to share because I don't want to send you down some rabbit hole: https://github.com/zarrs/zarrs-python/tree/ig/async_lifetime while this compiles, I think it's pretty egregiously wrong for a variety of reasons. I think the static declarations are all basically wrong, but this would be the idea modulo that
(It also hasn't had main merged in a while so beware that too, but here are the lines I think are of the main interest: https://github.com/zarrs/zarrs-python/compare/ig/async_lifetime?expand=1#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R477-R592)
I don't understand enough unsafe bits here about how you do numpy casting (I exclusively use safe APIs because I just don't understand enough about what I need to watch out for), so I don't know if there are interactions there about using unsafe with async
And I probably can't reliably help you with any unsafe bits.
That's fair, it's probably more work to unscrew what I have than start from scratch. The unsafe comes from https://docs.rs/unsafe_cell_slice/latest/unsafe_cell_slice/ and not the numpy casting.
And the use of UnsafeCellSlice is zarrs/zarrs#133, so avoiding that would mean refactoring zarrs to no longer use it.
I can give feedback on using pyo3-object_store and on connecting async rust to python with pyo3-async-runtimes, but I don't think I can give any feedback on the UnsafeCellSlice