google/xarray-beam

Consider omitting unchunked dimensions from Key objects created with DatasetToChunks

shoyer opened this issue · 1 comments

Currently we have (from https://xarray-beam.readthedocs.io/en/latest/read-write.html):

with beam.Pipeline() as p:
    p | xbeam.DatasetToChunks(ds, chunks={'time': 1000}) | beam.MapTuple(print_summary)
Key(offsets={'lat': 0, 'lon': 0, 'time': 0}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 1000, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 1000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 1000, 'lon': 53}>
Key(offsets={'lat': 0, 'lon': 0, 'time': 2000}, vars=None)
  with <xarray.Dataset data_vars=['air'] dims={'lat': 25, 'time': 920, 'lon': 53}>

Should we instead omit lat and lon from these keys? This is less explicit but also more flexible, e.g,. if replacing these dimensions entirely with different dimensions, you don't need to update the keys.

One of my original motivations for this is obviated by #50, which now allows us to handle variables in DatasetToChunks even if they don't include "chunked" dimensions.

It's still an open question whether this change would make Xarray-Beam more usable or not.

If we do not make this change, potentially we could enforce the invariant that key.offsets.keys() == dataset.dims.keys(). This might be convenient for writing new transforms.