xarray-contrib/xarray-simlab

Run batches of simulations in parallel using dask

benbovy opened this issue · 1 comments

Running batches of simulations is being implemented in #115.

There are a couple of things to address in order to run those batches in parallel (using dask):

  1. Ignore chunk as encoding key in model variables. This is too ambiguous since we don't know yet what will be the number of actual dimensions of the zarr dataset (e.g., if dims is a list or whether or not the dataset will have clock and/or batch dimensions). Moreover, zarr auto-chunking is not yet as advanced as with dask array (zarr-developers/zarr-python#270). It's fine to keep chunk as valid encoding key for Dataset.xsimlab.run(), since the dimensions (and the length of each of them) are a-priori better known when calling it.

    • Allow synchronizer as a valid encoding key for Dataset.xsimlab.run()? This could be useful if users want to specify chunk size > 1 along the batch dimension.
  2. Chunk along the batch dim should always be equal to 1 (otherwise parallel writes won't work well).

    • I've tried Zarr's thread and process synchronizers with Dask distributed and with chunks > 1 but I still obtain inconsistent results. EDIT: actually results look good and consistent! I forgot here that dask's distributed cluster may use both multiple processes and multiple threads. Synchronizers currently available in Zarr don't support this case I think.
  3. Check performance and implementation complexity of one single zarr group (i.e., a batch dimension in zarr datasets) vs. one group per batch (i.e., open groups as separate xarray Datasets then xr.concat).

    • A single zarr group seems more efficient.
    • With a single Zarr group, each simulation will try to create the same Zarr dataset. See 6 below.
  4. In both cases in 3, arrays differing in shape from one simulation to another will cause issues (either because zarr resize doesn't seem to lock?, or because xr.concat will not work). So it's probably better not to support this for now?

  5. Dask delayed vs. Futures? The latter is more flexible but it works only with distributed.

  6. Handle concurrent attempts of creating the same new zarr dataset. It might be addressed by just checking if the dataset already exists, but two cases must be distinguished: the dataset has been created by another simulation in the batch (pass), or the dataset existed before running the batches (raise).

    • Simply checking whether a Zarr dataset already exists before creating it does not always work. I've tried using distributed.Lock for this but it doesn't always work either. EDIT: try... except works better.
  7. Does zarr in-memory stores support Dask schedulers other than threads? I guess not?

1, 2 and 6 fixed in #117.