xarray-contrib/xbatcher

Integration with Hugging Face Datasets

Opened this issue ยท 5 comments

I've recently been learning about Hugging Face Datasets. It's a great data sharing platform for ML. The datasets package is based on tensorflow datasets.

It would be great to think about how to best integrate Xarray and Xbatcher with huggingface datasets. Opening this issue just as a placeholder. Will update with more detail as I explore.

Not to hijack this thread, but just found out about xbatcher and was wondering how this fits into the ML ecosystem, and if there are ways we can share efforts and avoid reinventing the wheel. I've started a Pull Request recently at microsoft/torchgeo#509 to connect xarray datasets (technically via rioxarray) to torchgeo, and was pleasantly surprised to have found that xbatcher has implemented something similar a year ago at #25 already!

Will be happy to hear any thoughts on this, I might pop in for the Pangeo ML Working Group meeting to discuss this.

cc @adamjstewart

Hi @weiji14! @jhamman added this issue and your discussion points to the agenda for the next Pangeo ML Working Group meeting. I would be excited to discuss opportunities to share efforts. I'm just starting to work on xbatcher and also plan to attend next week's meeting.

Oh hi Meghan! It always surprises me how small the open source world is ๐Ÿ˜† Will definitely see what others are up to next Monday. My initial impression was to think of it in terms of a Pytorch/Tensorflow split, or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects. But the lines aren't quite as clear cut, and given that Pytorch 1.11 recently introduced TorchData/DataPipes, it'll be good to put some smart people together and think about what's the best way forward.

or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects

There might not be such a different between these two approaches, if you remove the "in-memory" part. When you open data with Xarray it is automatically "lazy" about loading it into memory. It just puts a light "lazy indexing" wrapper around the underlying array in a GeoTiff / Zarr / NetCDF / Grib file. A downstream library (xbatcher, pytorch, etc.) can use these arrays in a streaming fashion. The advantage of using Xarray as a loader is that it already speaks all the weird file formats. The disadvantage is that there is some overhead creating Dataset, particularly around eager loading of coordinates. There may be workarounds for those, particularly post-Xarray-indexes-refactor.

Twitter thread related to huggingface and Zarr: https://twitter.com/rabernat/status/1517182069943713792