xarray-contrib/xbatcher

Add documentation/examples for new data loaders and help with use case

Opened this issue · 0 comments

@jhamman just presented on some updates to xbatcher including the new data loader interfaces from #25. I tried to find a documented way of using it and I don't see one. If some could be added that would be great because I've been helping some people at my work use Satpy to prepare data for their machine learning projects and I think the data loader could be a nice optimization. Their preparation work has always ended with saving to NetCDF or zarr. My understanding of these interfaces in xbatcher is that that saving to disk step shouldn't be needed (except for future caching functionality). Is that correct?

The psuedo-code of the most recent project I helped looks something like this:

dates_of_interest = [...]
geographic_bounds_of_interest = [...]

for dt in dates_of_interest:
    abi_filenames = get_goes16_abi_filenames(dt)
    scn = satpy.Scene(reader='abi_l1b', filenames=abi_filenames)
    scn.load(channels_of_interest)

    for bbox in geographic_bounds_of_interest:
        cropped_scn = scn.crop(xy_bbox=bbox)
        cropped_scn.save_datasets(filename="some_bbox_specific_file.nc")

And then they do their ML work based on those NetCDF files. Satpy is all xarray[dask]-based and the actual code for the above does a lot of client.map work (distributed's Client) to do the individual pieces. I can't speak for the researcher I'm helping, but I think if there is an optimization step here by using a data loader to give these "patches" (that's what they call them) of data to pytorch/tensorflow without needing to save to NetCDF that would be a really good example for a certain NASA project we're a part of.