Versioning and encoding
Closed this issue · 1 comments
We were just discussing that with @wolfv, and since it can take a lot of time to upload a dataset, we should think about what we can save in terms of bandwidth. In particular, "incremental datasets" (which only grow from version v1 to v2, like many datasets used for training machine learning algorithms) should not require to upload everything again at each version, since v2 can depend on v1 through the new data: v2 = v1 + new_data
.
But even for datasets that are refined through versions, like many satellite-based estimates using improved models and new source data, the difference v2 - v1
should be cheaper to encode than the absolute data values.
The new version could be stored in the cloud with the absolute data or the differential data, depending on whether we want to save compute power or storage space (in the latter case we should combine all the previous versions to get the latest).
On a slightly different note (but still with regard to versioning). We just had a conversation in our lab meeting about this topic and @rabernat suggested that older versions of data should be represented by an older recipe version. The older data would be deleted (unless the update is just appending data like above), but could be rebuilt from the older recipe if needed (e.g. to recreate results from a published paper).
I really like the way zenodo handles their DOIs. There are locked versions + a 'latest' one that automatically defaults to a newer version if available. From a user perspective something similar to this would be ideal within pangeo forge.
cc @cisaacstern