zarr-developers/zarr-python

async zarr

Closed this issue · 9 comments

A quick sketch of how we can couple zarr with async code. This is aimed slightly at pyscript, but can be useful in its own right: for instance, I asked the question a while ago what it would take to be able to fetch concurrent chunks not just from one array but, say, one chunk each from multiple arrays in a dataset.
This sketch is only for reading...

Outline:

  • we subclass Group, so that __getitem__ produces an AsyncArray
  • we subclass Array as AsyncArray and overwrite its _get_selection (which calls IO) up to __getitem__ (which is the user-facing API)
  • we have three stores:
    • a synchronous HTTP one for the dataset metadata. This can be based on requests for standard python or pyfetch under pyodide. Note that sync calls in pyodide are limited to text, which is perfect for this use case.
    • a fake synchronous store which merely stored the paths that are attempted, but returns FileNotFound for all of them
    • a fake synchronous store in which we have prefilled all the keys it will ever need, i.e., this can be a simple dict
  • The flow goes as follows:
    • A zarr AsyncGroup is made by reading JSON files synchronously
    • When we attempt to get data, we make a coroutine in which first we use the fake store and zarr's existing machinery to record all the keys that will be needed (this will temporarily make an array of NaN); then we fetch all these keys concurrently, then we populate a dict and have the existing zarr machinery read from the dict
  • For interest, this is an fsspec async filesystem for pyodide. We don't need it to be this verbose for zarr.
  • Note that in the browser, no fetches can ever be done without considering CORS, but any dataset known to work with zarr.js will work for this case too.

(I'm not sure if this needs any specs discussion, since it provides an alternative user API, but it doesn't actually change what zarr does or what the metadata looks like)

jbms commented

I think this might be more appropriate to discuss under the zarr-python repository, since it is just about the zarr-python API.

You might find it interesting to look at the tensorstore python API for ideas, as tensorstore provides an async API.

An alternative to consider would be to just address the limitation in pyscript directly:

I think with the help of a separate webworker thread it is possible to emulate sync fetch requests.

I think there are also options for compiling c code to "async" web assembly, though that does hurt performance.

I am happy to have this in zarr-python instead.

How do you imagine using the tensorstore model? The problem we are facing, is being forced to call zarr's synchronous code in an async context, so using another Futures abstraction sounds like even more complexity.

I think with the help of a separate webworker thread it is possible to emulate sync fetch requests.

In general, getting IO to work well in pyscript is an unsolved problem, and websorkers-as-thread might be part of the solution. Certainly, that's the only way that the browser allows binary sync connections. To be sure, though: we do not want sync requests and pay the latency cost for every single chunk.

I think there are also options for compiling c code to "async" web assembly, though that does hurt performance.

We are stuck with the sync python API, being called from an async context, so this is a python programming problem. Anything lower level will not help us.

As I said at the start though, pyscript is not the only reason to want this.

jbms commented

I am happy to have this in zarr-python instead.

How do you imagine using the tensorstore model? The problem we are facing, is being forced to call zarr's synchronous code in an async context, so using another Futures abstraction sounds like even more complexity.

I think from an API perspective futures are the most natural choice.

You can always create an async API on top of a sync API using a thread pool. In general it might be best to gradually add async apis to zarr-python from the top down, using thread pools as needed to convert lower-level components from sync to async. Ultimately would want to add async store implementations so that there are no sync i/o components left. The codecs are pure computation and don't need to be converted --- they would just always require a thread pool.

I think with the help of a separate webworker thread it is possible to emulate sync fetch requests.

In general, getting IO to work well in pyscript is an unsolved problem, and websorkers-as-thread might be part of the solution. Certainly, that's the only way that the browser allows binary sync connections. To be sure, though: we do not want sync requests and pay the latency cost for every single chunk.

I think there are also options for compiling c code to "async" web assembly, though that does hurt performance.

We are stuck with the sync python API, being called from an async context, so this is a python programming problem. Anything lower level will not help us.

My understanding is that pyscript is built by compiling cpython to webassembly, where sync Python corresponds to sync webassembly/javascript. I was proposing that instead it could be compiled such that sync Python corresponds to async JavaScript. Thinking about it more though, I realize that in addition to being a major re-architecting of pyscript, it would also come with major restrictions on "re-entering" python during other operations, and therefore wouldn't really be practical.

As I said at the start though, pyscript is not the only reason to want this.

You can always create an async API on top of a sync API using a thread pool. In general it might be best to gradually add async apis to zarr-python from the top down, using thread pools as needed to convert lower-level components from sync to async.

The fsspec store and indeed the JS HTTP fetch methods are async, so we actually already have this already at the bottom of the stack. Making the compute part "concurrent" isn't useful, it's the IO that matters. You are advocating a completely async alternate codepath all the way through zarr? I am trying to make use of zarr's simplicity to implement something that works quickly without changing the core.

Question to everyone: what does the the zarr.js API look like, is there any async there? I would assume there must be.

AIUI PR ( #534 ) was exploring the concurrent.futures.Executor approach

I have started writing a blog about my implementation, might be out this afternoon. That won't say anything new to people who are already on this thread, but might gain more general interest. Specifically for pyodide/pyscript, I think it's still fair to say that the IO story is very far from solved for typical pydata libraries.

This was a great discussion. Pointing folks to the continuation of this idea slated for v3: #1583