bcdev/nc2zarr

List input files in history metadata attribute

Closed this issue · 6 comments

At present, Zarrs produced by nc2zarr don't contain any indication of the source files from which they were generated. nc2zarr should optionally include a list of source files in the value of the Zarr's history attribute on first generation, and update this value with the additional input files when appending to an existing Zarr.

Several metadata attributes may be updated in a CF-compliant way, see section Description of File Contents in the CF conventions.

Should be resolved together with #20.

Implementing this will also help with implementation of a related feature request from CloudFerro: the ability to ignore an input file when appending if it has already been ingested into the target Zarr. nc2zarr could check the input pathname or filename against the list in the history attribute before appending.

@pont-us Please note:

  • Append to the history attribute when what has been done: "\n${date}: converted to Zarr using nc2zarr ${version}".
  • Use the sources attribute to list the sources.
  • Resolve #20

Implementing this will also help with implementation of a related feature request from CloudFerro: the ability to ignore an input file when appending if it has already been ingested into the target Zarr. nc2zarr could check the input pathname or filename against the list in the history attribute before appending.

We should not use metadata to make decisions about the data in the dataset. Whether a timeslice has already been processed or not should be detected by looking into the data: time coordinates. Once it is detected there are two options: ignore new data or replace existing. To replace an existing timeslice by a more up-to-date one is a valid use case we have in other scenarios. (Example: Same Sentinel 3 Level-2 data is beeing processed in a fast lane and another one that takes much more time but has higher data quality. When the second data arrives, the first is replaced.)

We should not use metadata to make decisions about the data in the dataset. Whether a timeslice has already been processed or not should be detected by looking into the data: time coordinates.

Agreed -- I've opened Issue #41 to discuss implementation of this functionality.