Indicate variables in xarray-beam keys
shoyer opened this issue · 0 comments
shoyer commented
Currently, we identify chunks only by overall offsets along each dimension. This works OK, but hits scalability limits for some pipelines, such as the ERA5 rechunking example in #8.
It would be nice to be able to have a SplitVariables()
transform, that allowed for applying a pipeline in parallel to each data-variable in a Dataset.
To do so, we need some consistent way to identify a limited set of variables, not just chunk offsets. I propose to do so using a new Key
class modeled off of the existing ChunkKey
:
Key(offset={'x': 0, 'y': 1}, vars={'foo'})
indicates a chunk of a dataset at positional offsetx=0, y=1
and with only the variablefoo
.Key(offset={'x': 0, 'y': 1}, vars=None)
indicates variables are not split.Key(offset=None, vars={'foo'})
orKey(offset={}, vars={'foo'})
indicates dimensions are not split.
Key
should support most of the user facing API of ChunkKey
, e.g., key | {'time': 0}
should still work. However:
Key
now is a frozen dataclass consisting of a frozen dict and a frozen set (rather than a mapping itself), sokey[dim]
will have to becomekey.offsets[dim]
.Key.to_slices
doesn't really make sense (it could apply only to some variables).- To support modification without mutation, we'll add a new
replace()
method, e.g.,key.replace(vars=None)
.