Simultaneously read multiple Datasets into an Xarray-Beam pipeline

Question

Simultaneously read multiple Datasets into an Xarray-Beam pipeline

shoyer opened this issue 2 years ago · 2 comments

It is relatively common to need to load multiple xarray.Dataset objects, e.g., to compare two different models.

This currently can be done by loading data with separate calls to xbeam.DatasetToChunks, and by joining together the result with beam.CoGroupBykey. This works but is rather inefficient, involving an extra write of the data to disk. Ideally we could load the data in a single beam transform instead, e.g., xbeam.DatasetToChunks([ds1, ds2], chunks) would return a PCollection with elements of type tuple[xbeam.Key, tuple[xarray.Dataset, xarray.Dataset]].

CC @alxmrs

Answer 1 · 2022-12-15T02:45:01.000Z

I'm looking into this now.

I have a design question, though: What is the best PCollection interface? tuple[xbeam.Key, tuple[xarray.Dataset, xarray.Dataset]] or tuple[xbeam.Key, xarray.Dataset, xarray.Dataset]? I have a slight preference for the latter (and, this implementation would not be so bad). This version also seems fairly natural for operations like beam.MapTuple() (can handle n-ary tuples) and beam.GroupByKey() (will just use the first value in the tuple). It feels more zen to me, too. :)

WDYT?

Answer 2 · 2022-12-15T03:29:03.000Z

A small update -- the former does seem to offer a better typing story (python/typing#180), I now am leaning that way.