GoogleCloudPlatform/covid-19-open-data

Automatic sanity check of data, flagging out-of-range or suspiciously large changes

geening opened this issue · 1 comments

Propose a system that on each run of the pipeline would sanity check data for values that are out of a reasonable range, or with a suspiciously large change from one run of the pipeline to the next. Ideally checks would apply to data at all stages along the pipeline -- input sources, intermediate data, as well as generated data (output cell indexed by table/variable/key/date) -- but we could start by implementing where this is easiest. The results would be reported in a pipeline status report and/or stored (either appending to a log, or in a more structured format or database) for future reference (for instance, when suspicious data is manually discovered, one could look to see when it was introduced).

Some errors that have come up that would likely be caught by such a system:
Regions with confirmed cases > population
Regions with area > area of earth

Related issue for epidemiology data: #186