GoogleCloudPlatform/covid-19-open-data

"Snapshot ID" Field Implementation (dependency mitigation)

a27cheung opened this issue · 0 comments

As data sources wind down (& perhaps shut down?)...

"I haven't seen any winding down yet, but yes this will be an issue in the future. In our data, we have a hack to work around data sources which stop updating which is to flag them with "skip" which prevents unit tests from failing and forces to fetch the latest valid data. But if we make any other configuration changes, the data would be lost.

A good solution to this would be to add a "snapshot ID" field which, if populated, we wouldn't even try to go to the original data source and instead we fetch the intermediate file from the last successful processing of that data source (which we have saved). Unfortunately, the intermediate files are not externally available so that makes the fetch step of the pipelines not reproducible. I don't think there's a way around that: the original data source is gone, so of course you can't reproduce our work."