zarr-developers/geozarr-spec

Zarr Sprint Topics

Opened this issue · 13 comments

Per our discussions in the bi-weekly GeoZarr SWG meeting, we identified a few focus tracks for the zarr sprint coming up on February 7/8th, 2024. In addition, I reviewed the original brainstorming ideas first discussed a year ago documented https://hackmd.io/t2DWpX1iQEWMKx1Fi4Px7A?both#Let’s-brainstorm. Many of these ideas are captured by the proposed list we discussed on January 24th. The topic of bidirectional interoperability with gdal is another clear theme, although as we discussed at the last SWG meeting, this would be very difficult to tackle in a single sprint and more importantly we may not have someone to lead this. Nevertheless I am listing it as an option to see if we could identify folks in the community to lead.

Here are the topics I have narrowed down to:

  1. pyramiding @maxrjones, virtual
  2. http browsable zarr @rabernat @kbgg
  3. zarr v3 for zarr-python @jhamman, virtual
  4. gdal bidirectional interoperability (???)

Here is the proposed template that I ask the folks who are tagged as leading the tracks above to complete and share below.

As A Zarr Sprint track... 
Our focus is on <Outcome>
We believe it delivers <Impact> to <Whom>
This will be achieved when <some acceptance criteria>
The types of skills we need to complete this task are <some list>
We expect the level of difficulty to complete this to be <low, medium, high>

Topic leaders, if you can fill in the above template by Monday January 29th, then as a community, we provide ranked responses by Wednesday January 31st.

As a Zarr Sprint track focused on enabling support for V3 in Zarr-Python we are joining an ongoing effort working toward Zarr-Python version 3.0 (roadmap).
Our focus is on closing outstanding issues on the roadmap and testing the development branch in common geospatial applications.
Zarr-Python has traditionally been the canonical implementation of Zarr, therefore we believe this effort delivers immediate impact to the largest swatch of users, including those that use Zarr through downstream libraries (e.g. Xarray).
This will be achieved when any of the roadmap issues are closed or some of the following objectives are completed:

  • prototype using Zarr-Python 3's dev branch in Xarray's Zarr-backend
  • prototype using Zarr-Python 3's dev branch in Dask's Zarr IO utilities
  • benchmark typical geospatial read/write operations on groups/arrays/metadata to understand performance

The types of skills we need to complete this task are moderate to advanced familiarity with Python and Zarr.
We expect the level of difficulty to complete this to be medium to high.

A Zarr Sprint track focused on geospatial multi-scales / pyramids.

Our focus is on identifying and addressing shortcomings of the ndpyramid utility, either though development in that library or deciding where else development would need to happen.

ndpyramid is a utility for generating pyramids for Zarr datasets to enable performant visualization. The library was built specifically for use with the @carbonplan/maps toolkit and produces pyramids conforming to this schema. There has been persistent community interest in broader support and standards for pyramids in Zarr. The focus of this sprint is on establishing whether ndpyramid could provide the foundation for this support, and, if so, develop towards that goal, or, if not, establish where else development would happen.

We anticipate that this sprint will progress development towards geospatial pyramids in Zarr that can be used broadly for dynamic client visualization approaches, tiling servers, and multi-scale analysis. This will serve data providers, front-end developers, and researchers.

This will be achieved when:

  • We demonstrate the generation and usage of Zarr pyramids with @carbonplan/maps, titiler-xarray, QGIS, Datashader, and a SRCNN, all sharing the same schema.

Some potentially more attainable goals for the short sprint:

  • Bring your own data: Try out generating pyramids on your dataset and identify where current tools fall short
  • Define how different Tile Matrix Sets would be represented in the pyramid schema
  • Implement a proof of concept for producing a World CRS84 Quad pyramid using ndpyramid
  • Test a proof of concept for loading raw data and pyramids stored separately using a single datatree entrypoint

The types of skills needed complete this task are moderate familiarity with Python and Zarr. We would especially encourage participate from those familiar with geospatial projections, multiscale representations, and metadata conventions.

We expect the level of difficulty to complete this to be medium.

Zarr Linked Hierarchy for HTTP-enabled Browsing

Focus and Outcomes

Our focus is on achieving the ability to explore nested Zarr groups over HTTP or other stores that do not provide a LIST-style operation.

This will enable

  • organizations to more easily share Zarr data on vanilla HTTP servers
  • web-based viewers for Zarr data to traverse nested groups

More Context

Zarr is not a file format; it is a specification for how to organize a nested hierarchy of numerical arrays and metadata in storage. In order to explore the contents of a Zarr hierarchy, clients generally need the ability to list the contents of directories in the storage layer. For filesystems or s3-compatible object storage, this is straightforward. However, most cloud-native geospatial data formats provide first-class read-only support via a vanilla HTTP protocol.

To address this need, Zarr V2 implemented a somewhat hacky “consolidated metadata” approach, in which all the metadata from a hierarchy are condensed into a single json file. This approach does not scale to very large, deeply nested Zarr hierarchies.

Now that Zarr V3 has been ratified, there is an opportunity to develop an extension that supports this HTTP-browsing use case in a more scalable and robust way. Specifically, we imagine developing a STAC-like mechanism for explicit links between parent and child groups that allow an HTTP client to quickly traverse a Zarr hierarchy.

Requirements

  • Zarr hierarchies of any size should be browseable and readable via HTTP without having to download the entire nested metadata
  • Clients entering a Zarr hierarchy should be able to traverse either up, until they reach the root group, or down, until they reach an array node.
  • Links can be generated and written by a client with list capabilities on the store. For example
    • A Zarr hierarchy on disk can be linked by traversing the directory store and adding the appropriate metadata for parent / child links; the entire hierarchy can then be served over an HTTP server.
    • A Zarr hierarchy on S3 can be linked by traversing the S3 store, adding the appropriate metadata for parent / child links; the entire hierarchy can then be accessed via standard HTTP GET requests (rather than S3 protocol).

Non-goals:

  • We do not aim to support writing to Zarr hierarchies via HTTP; this use case is read-only.
  • We assume the Zarr hierarchy exposed in this way is quasi-static; once the links are generated, they will not change frequently.

Implementation Plan and Skills Needed

We will try to implement this capability in zarr-python on the V3 branch. Contributors should be intermediate Python programmers (understand best practices around Python objects, typing, and code structure). Familiarity with the Zarr code base is not required but helpful. Participants should review the V3 roadmap and design document.

I'm also open to implementing this first in a javascript library, rather than Python. For example, in the source.coop viewers package.

Along the lines of my comment above, I have a concrete proposal that could be fun for someone (like @kylebarron 😉) to work on.

Zarr Python's V3 store interface is being redesigned to provide an all-async interface. The idea we have been discussing is to write a store on top of the Rust Object-Store crate. There are already Python bindings for this project but they are not async ready. If this particular plan is successful, it is possible this could become the core store in the zarr-python project.

@jhamman , been readying https://docs.rs/pyo3-asyncio/latest/pyo3_asyncio/index.html very carefully. Given I already did this once in rfsspec, I am prepared to give it a go on top of object-store. rfsspec showed marginal benefits, so while it may be worthwhile, do not expect a big return for the probably substantial effort.

Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple! We also want to enable dask-style access from multiple (python) threads, so... Also, python bytes objects are annoying in rust (numpy buffers would be better, even for bytes output).

(I would be interested in this, because a rust-only zarr and kerchunk solution is very generally interested for those that need a C-level API; however, if we don't use numpy as the storage, and we don't have numcodecs directly, it maybe asks more questions than it answers; cf https://github.com/sci-rs/zarr ).

Note that using rust async (tokio) in python async (asyncio) requires two event loops on two threads; it isn't simple

FWIW the next version of pyo3 is likely to have big progress in async handling, and it sounds like it might no longer need two event loops? PyO3/pyo3#1632 (comment)

Also, python bytes objects are annoying in rust (numpy buffers would be better, even for bytes output).

Why are they annoying? Is it because the memory is Python-allocated instead of Rust-allocated?

the memory is Python-allocated instead of Rust-allocated?

Yes, but also the internal immutability guarantee makes zero-copy handing the memory to/from rust hard. In rfsspec, I already wrote code around the python buffer protocol to cope with this, which appears to work but sidelines rust's memory protections.

An additional, v3 sprint topic idea, this one aimed at @TomNicholas. Manifest storage transformer. Specific goals for this sprint could be to:

  1. Evaluate the proposal - zarr-developers/zarr-specs#287
  2. Hack together a small sample dataset using Kerchunk and bespoke translation code
  3. Break ground on Zarr-Python's first v3 storage transformer - the logic is actually quite easy but the internal hooks are not complete.

In the Zarr pyramids breakout group, Thomas Maschler and I discussed the motivations for following the OGC TileMatrixSet 2.0 specification within the GeoZarr specification, which will be shared as a new issue to supersede #30. We also discussed reading those TMS into rio-tiler using Xarray and started a refactor of ndpyramid to support the TMS specification.

Zarr-Python post-sprint update

  1. @rabernat and @maxrjones worked on Zarr-Python's test environment and CI setup (zarr-developers/zarr-python#1648):
  2. @d-v-b worked on removing the attrs dependency from Zarr-Python (zarr-developers/zarr-python#1624)
  3. @kylebarron worked on a prototype store using new async Python bindings to Rust's object-store project

Thanks all!

In the "chunk manifest / virtual concatenation" group our main outcome was a long technical discussion, which I've written up in ZEP-like form here zarr-developers/zarr-specs#288 (comment)