backup and restore apps can mangle metadata of zarr (and parquet?) datasets
Opened this issue · 0 comments
Describe the bug
Restoring a column that was backed up from an MS with different metadata (eg. FIELD_ID) changes the FIELD_ID in the zarr dataset being restored. This can happen for example if we back up a column from a measurement set before splitting and then try to restore it to an MS that has been split with CASA and subsequently converted to zarr. CASA will set the FIELD_ID in the split MS to zero which then doesn't match the FIELD_ID in the backup. Trying to restore the backed up column to an MS will bail out with an error but it happily runs through for the zarr dataset and ends mutating the metadata (in this case also ROWID). This is a bit of an edge case but I thought I would report it for posterity.
To Reproduce
Back up a column (FLAG_ROW will suffice) in field 1 of an MS with multiple fields.
Split that field with CASA split.
Convert it to zarr with dask-ms convert.
Back up the column to the converted dataset.
Expected behavior
The metadata of the dataset should not be mutated. This can be achieved by reading the metadata from the dataset that we are restoring to instead of getting it from the backed up column. It would mean opening the existing dataset and assigning the column we are backing up to it instead of what is currently done here.
Version
main