pangeo-data/rechunker

Add consolidated metadata to rechunked zarr

Metamess opened this issue · 2 comments

What is the issue?
When opening a zarr with xarray, it really helps to have what is called "consolidated metadata". This is a single file (.zmetadata) at the root of the zarr, which combines all the information of all the various .zarray and .zattrs files inside the zarr. While this file is not required to exist in a valid zarr (nor in fact for the zarr to be opened by xarray), it does greatly speed up the process, especially when the data is being read from a remote location.

Sadly, Rechunker currently does not create such a consolidated metadata file for the resulting rechunked zarr. There is also (as far as I have been able to find, at least) no way to enable such behavior via an option parameter.

What would solve the issue?
One of the following feature requests would be able to resolve this issue:

  1. Add an optional boolean parameter to the rechunk() function which allows to user to specify that a consolidated metadata file should be created. For backwards compatibility, if that is desired, this parameter would default to False. Potential parameter names could be consolidated, mirroring the parameter name in xarray, write_consolidated to be more explicit that it only impacts writing, or consolidate_metadata to mirror the function in zarr
  2. Automatically detect the existence of a consolidated metadata file (.zmetadata) in the source zarr, and create one (or not) in the result zarr accordingly.
  3. The combination of options 1 and 2. The parameter could default to a str value of "auto", resulting in the behavior as described in (2), or be given a boolean value by the user to override this behavior.

I look forward to hearing what people think about this feature request, and to know if others would also like to see this feature added!

Thanks for this suggestion @Metamess!

Creating consolidated metadata after rechunking is done is a one-line operation, e.g.

zarr.consolidate_metadata(target_store)

(https://zarr.readthedocs.io/en/stable/tutorial.html#consolidating-metadata)

This can be run on the target store after the rechunking is complete. Would that meet your needs?

Hey @rabernat , thanks for the reply!

I did already know that you can manually create the consolidated metadata like this afterwards, and it is in fact what I am currently using to work around this issue! But it is a step that I would expect to be possible as part of the rechunk operation. In advocating for this feature, I consider the following:

  • While the consolidated metadata is not part of the zarr specifications, and thus not strictly required for a "valid" zarr, it has seemingly become cornerstone feature. I think it can be safely assumed that a vast amount of zarr users use xarray to work with their data (in fact, even rechunker's tutorial relies on xarray!) and it is clear that xarray has a strong preference for the use of consolidated metadata files. Add to that the fact that Zarr-Python itself contains the functionality to create such files, and this signals to me that the use of consolidated metadata is sufficiently widespread that it would make sense for the rechunker package to support its use.

  • Indeed, adding the extra zarr.consolidate_metadata(target_store) line at every location the rechunk function is called is technically an option, as would be wrapping the two lines in a function of your own to assist with that. But since I think this use case (wanting consolidated metadata) will occur for so many users, and perhaps even in the majority of cases where rechunker is used, it would seem to make sense for the rechunker package to offer this functionality.

  • Additionally, consider the case where the source zarr has consolidated metadata. The lack of this file in the result is technically an undocumented side effect (since on top of rechunking the data, this file is also "lost", i.e. not recreated). It breaks the (in my opinion reasonable) assumption of equivalence between the source and rechunked zarr. (I consider the rechunked zarr to not be functionally equivalent, as opening it will now require many reads, and xarray will now print the warnings associated with opening a zarr without consolidated metadata).

  • Lastly, when zarr.consolidate_metadata(target_store) is called it triggers multiple reads on the result zarr. When the zarr is on a remote machine, this is slow and inefficient (that is after all the entire reason for the consolidated metedata even exists). It seems like it would be way more optimal for the consolidated metadata to be gathered from the various zarr arrays as they are created by the rechunk operation. This is an implementation detail however, and is separate from the previous arguments in favor of the feature in general.

What do you think? Is it worthwhile as a feature in rechunker?