google/xarray-beam

Indicate variables in xarray-beam keys

shoyer opened this issue · 0 comments

Currently, we identify chunks only by overall offsets along each dimension. This works OK, but hits scalability limits for some pipelines, such as the ERA5 rechunking example in #8.

It would be nice to be able to have a SplitVariables() transform, that allowed for applying a pipeline in parallel to each data-variable in a Dataset.

To do so, we need some consistent way to identify a limited set of variables, not just chunk offsets. I propose to do so using a new Key class modeled off of the existing ChunkKey:

  • Key(offset={'x': 0, 'y': 1}, vars={'foo'}) indicates a chunk of a dataset at positional offset x=0, y=1 and with only the variable foo.
  • Key(offset={'x': 0, 'y': 1}, vars=None) indicates variables are not split.
  • Key(offset=None, vars={'foo'}) or Key(offset={}, vars={'foo'}) indicates dimensions are not split.

Key should support most of the user facing API of ChunkKey, e.g., key | {'time': 0} should still work. However:

  • Key now is a frozen dataclass consisting of a frozen dict and a frozen set (rather than a mapping itself), so key[dim] will have to become key.offsets[dim].
  • Key.to_slices doesn't really make sense (it could apply only to some variables).
  • To support modification without mutation, we'll add a new replace() method, e.g., key.replace(vars=None).