Integration with Hugging Face Datasets
Opened this issue ยท 5 comments
I've recently been learning about Hugging Face Datasets. It's a great data sharing platform for ML. The datasets
package is based on tensorflow datasets.
It would be great to think about how to best integrate Xarray and Xbatcher with huggingface datasets. Opening this issue just as a placeholder. Will update with more detail as I explore.
Not to hijack this thread, but just found out about xbatcher
and was wondering how this fits into the ML ecosystem, and if there are ways we can share efforts and avoid reinventing the wheel. I've started a Pull Request recently at microsoft/torchgeo#509 to connect xarray datasets (technically via rioxarray) to torchgeo
, and was pleasantly surprised to have found that xbatcher
has implemented something similar a year ago at #25 already!
Will be happy to hear any thoughts on this, I might pop in for the Pangeo ML Working Group meeting to discuss this.
Oh hi Meghan! It always surprises me how small the open source world is ๐ Will definitely see what others are up to next Monday. My initial impression was to think of it in terms of a Pytorch/Tensorflow split, or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects. But the lines aren't quite as clear cut, and given that Pytorch 1.11 recently introduced TorchData/DataPipes, it'll be good to put some smart people together and think about what's the best way forward.
or to have the two libraries specialize in terms of loading from a GeoTIFF/Zarr file vs in-memory xarray objects
There might not be such a different between these two approaches, if you remove the "in-memory" part. When you open data with Xarray it is automatically "lazy" about loading it into memory. It just puts a light "lazy indexing" wrapper around the underlying array in a GeoTiff / Zarr / NetCDF / Grib file. A downstream library (xbatcher, pytorch, etc.) can use these arrays in a streaming fashion. The advantage of using Xarray as a loader is that it already speaks all the weird file formats. The disadvantage is that there is some overhead creating Dataset, particularly around eager loading of coordinates. There may be workarounds for those, particularly post-Xarray-indexes-refactor.
Twitter thread related to huggingface and Zarr: https://twitter.com/rabernat/status/1517182069943713792