A convenient way to store a publically-accessible Zarr dataset that is versioned and optionally tied to a Zenodo DOI:
import xarray as xr
import fsspec
uri = 'https://scottyhq.github.io/zarrdata/air_temperature.zarr'
ds = xr.open_dataset(uri, engine="zarr", consolidated=True)
ds.air.isel(time=1).plot(x="lon")
The basic idea is to host a smallish citeable record (<1GB) on a static GitHub pages website so that your tutorial, research code, benchmarking suite, etc. can run against a citeable dataset.
Key limitation of this approach is that Zarr chunks must be less than 100MB, per GitHub repository limits and the total size of the repo/zarr store should be less than 1GB per GitHub Pages limits. If you're dealing with data>1GB or want high-performance you probably want to store the data files on AWS S3, GCS, etc...
-
Add zarr data In the create_zarr.py script I just create a Zarr store from the Xarray tutorial dataset, but if you have data.zarr you just add it to your repo
-
Add a jekyll configuration file GitHub pages automatically deploys your repository and serves static HTTP via Jekyll. Because Jekyll ignores hidden files (.zattrs, .zmetadata, etc) by default you need a _config.yml to ensure those files are added
-
Enable github pages To publish the site you just need to enable GitHub Pages for the repository. It's as simple as going to repository Settings->Pages->Source (select 'main' branch and 'Save')! The you'll have a live HTTP-website with the repo README.md rendered! For this repo https://github.com/scottyhq/zarrdata the website is https://scottyhq.github.io/zarrdata .