xarray-contrib/xarray-tutorial

Remote access patterns using xarray.

betolink opened this issue ยท 8 comments

I'm not sure if this will fit in the upcoming (potential) SciPy tutorial or somewhere else, I think it could be helpful to include a mini-guide on access patterns to remote storage. I think that one of the key strengths of xarray is in a way, a weakness. I'm thinking about how powerful the abstractions are when it comes to open a multi-file datasets and how this could hide the nuances of different back-end storage types.

When a new user sees this and they get a data cube, it's like magic!

ds = xr.open_dataset(reference, engine="zarr")

and although this is the cloud-native way, a considerable amount of data is still in archival formats or available through a service like Opendap. In an ideal world, users shouldn't care in which format/location their data is, but I've run into multiple instances where is not that xarray is not doing its job but the data is in HDF on a slow server across the next continent.

Sometimes there are workarounds, from using different sources(e.g. Planetary Computer, GEE) that serve the same data but on a cloud optimized format, to the use of Kerchunk or using clever caching strategies. I feel that some of these topics are buried in threads in Github and not necessarily exposed in the documentation.

The idea would be to quickly illustrate, what xarray would do if I have files of type X and this access pattern:

file_set = [fsspec.open(f) for f in files]
ds = xr.open_mfdataset(file_set) 

What would happen if my files are HDF4, NetCDF, HDF5, what's the step 1, 2, 3... can we make it faster? how?
What if the data is behind OPeNDAP? etc

I also wonder if this information is already out there in the docs and perhaps just needs to be compiled into a single notebook, I volunteer to start one if is not.

I volunteer to start one if is not.

Yes please! This would be a really really great notebook to add.

The docs are here: https://docs.xarray.dev/en/stable/user-guide/io.html but need some reorg.

I'm working on a draft on this topic and wanted to get some feedback from you all, @dcherian @scottyhq and @martindurant warning: probably has many typos.

I'd like to explore our options when we need to access remote data with xarray, fsspec is probably the most common access pattern and thus the notebook is for now entirely dedicated to it. My plan is to expand it to cover how chunking affects performance to remote data and what Dask can do for us when we need to scale our workflows.

https://notebooksharing.space/view/7f86bf333e4905d8bbe4c3c49b59035468e5bbe10cbb6f47124c0162cb6cfbd2

(I think we can add notes if we register to this site from Yuvi)

@betolink that notebook is looking really fantastic! I think it would be an excellent addition, perhaps slot in into a 'section' for Remote Access Patterns with the exiting CMIP6 notebook and this one side by side

xarray-tutorial/_toc.yml

Lines 48 to 51 in cbf6c1a

- file: intermediate/cmip6-cloud
- file: intermediate/data_cleaning/05.1_intro.md
sections:
- file: intermediate/data_cleaning/05.2_examples.md

I like that it starts with a local file, you could add a cell at the top to grab the data locally (or use fsspec)
!wget https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc

The graphic at the bottom is really nice! what about moving it to the top? Once you open a PR we can iterate a bit there, but it seems already in great shape :)

Thanks @scottyhq ! this file is part of the xarray tutorials, I think it should be accessible on a local path and render fine (from what I see in other notebooks in the repo).

!wget https://its-...

Only if we are certain wget is present. We do know fsspec will be in the env.

Now that I'm re-reading I'm like... this has a thousand typos and sentences where I started writing one thing and ended up writing something else, the ideas hold (or so I think) ha! will fix and move the chart to one of the top cells.

@betolink should we close this for now, or would you like to leave it open to follow up on further additions (open_mfdataset, chunking)?

Hi @scottyhq, let's leave this open, I'm working on some minor edits and want to expand on chunking and I/O signatures of the different caching schemes.