Parallelize PartitionConfig in weather_dl

Question

Parallelize PartitionConfig in weather_dl

shoyer opened this issue 2 years ago · 0 comments

I am noticing that some large-scale Weather-DL Beam pipelines are taking multiple days to complete the PartitionConfig stage (before downloading any data), which only runs on a single worker.

There appear to be two bottlenecks:

"Prepare partitions" (1/3 of the time) produces a gigantic Cartesian product, and it seems to be somewhat slow just to create this many config objects. To speed this up, Beam needs some way of being able to parallelize this task, which it cannot do when there is only one input (the sole config file). A more scalable approach would be to use beam.Create() to directly create inputs corresponding to some of the partitioning (or even just a statisc number of partitions).
"Skip existing downloads" (2/3 of the time) checks for existing files, which is limited by the speed of IO, and means that even sole worker is not CPU limited. This could be sped-up by using a thread-pool to batch IO operations. This might be important because if workers are not CPU limited, Beam runners may not decide to add more workers.

For reference, we do analogs of each of these steps in xarray-beam:

Sharded data generation: https://github.com/google/xarray-beam/blob/84ed9eb6d9dcf99a0c5edddc2d7ce96ca4bcd50b/xarray_beam/_src/core.py#L401
Using a thread pool for IO: https://github.com/google/xarray-beam/blob/84ed9eb6d9dcf99a0c5edddc2d7ce96ca4bcd50b/xarray_beam/_src/threadmap.py