zarr-developers/geozarr-spec

Per-chunk metadata (e.g., bbox)

benbovy opened this issue · 8 comments

@rabernat mentioned in https://twitter.com/rabernat/status/1617209410702696449 the idea of attaching a "GeoBox" (i.e., bbox + CRS + grid metadata) to a dataset, which is implemented in odc-geo and which is indeed useful for indexing.

Now I'm wondering if it would be possible to reconstruct such GeoBox for each chunk of a Zarr array or dataset? This would require storing a bbox per chunk. I'm not very familiar with Zarr specs, though. Is it possible/easy to store arbitrary metadata per chunk?

One potential use case would be scalable (e.g., dask-friendly) implementation of spatial regridding / resampling algorithms that would work with non-trivial datasets (e.g., curvilinear grids).

There is an interesting, somewhat related discussion in the geoarrow-specs repository: geoarrow/geoarrow#19. As far as I understand, geospatial vector datasets are currently partitioned using multiple parquet files (dask-geopandas parquet IO, dask-geopandas.read_parquet). For GeoZarr, however, we don't want one Zarr dataset per spatial partition I guess.

It could also be defined at the shard level...

Maybe can we have @joshmoore input here?

If images have different spatial extents, I think it would make more sense to store them as distinct arrays, rather than as chunks of the same array.

I want to close this as out of scope.

Zarr does not allow per-chunk metadata. We are not making any Zarr extensions here. So therefore, we need to find a different solution to this use case. The obvious one to me is to just store images with different bbox in separate arrays.

I think that at this stage this is still a good place here (better in a new issue) for discussing if/how in general we can facilitate spatial indexing and/or partitioning of large datasets, even if this would require multiple zarr arrays (groups?) or any kind of zarr extension.

I might be missing something, but this should be possible today without need for per-chunk metadata. As long as you know have something like the geotransform so that you know where the "origin" pixel is and the space between each pixel, and the size of each chunk, you should be able to get the bbox of each chunk with a bit of math.

This should be exactly the same has how GDAL / COG handle reading a single block out of a larger COG, just using multiple files / chunks.

Perhaps it isn't safe to assume that every chunk of this dataset is on the same grid / projection. But in that case, I'd recommend storing them in separate arrays.

As long as you know have something like the geotransform so that you know where the "origin" pixel is and the space between each pixel, and the size of each chunk, you should be able to get the bbox of each chunk with a bit of math

This is interesting. 🤔 So the idea is that you would have an array stack with dimensions

image[item, y, x, band]

and then coordinate variables like

x_origin[item]
y_origin[item]

Then you could construct the geotransforms on the fly for the entire collection, create a geodataframe, etc.

For the image collections we are talking about, it is safe to assume that the images all have the same x and y dimensions? Or could they possibly be different sizes?

briannapagan commented 2 days ago
Maybe can we have @joshmoore input here?

Sorry for the slow response. I don't any definitive responses but ...

rabernat commented 18 hours ago
I want to close this as out of scope. Zarr does not allow per-chunk metadata. We are not making any Zarr extensions here. So therefore, we need to find a different solution to this use case.

Big 👍 for this strategy on this repo with the caveat that the individual convention efforts (GeoZarr, NGFF, etc) will likely identify things that to make it to zarr-specs.

As long as you know have something like the geotransform so that you know where the "origin" pixel

This reminds me somewhat of ome/ngff#138 (which also triggered a discussion in NGFF space about use of cfconventions...)