GoogleCloudPlatform/covid-19-open-data

Validate numerical output produced by individual data sources

geening opened this issue · 0 comments

As far as data source processing goes, we currently test each component of a DataPipeline object.

And we have a dry run in https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/src/test/test_source_run.py to make sure, for each individual data source, that there is at least one output whose location key matches a defined regex.

But we do not validate that individual extensions of DataSource (stored in src/pipelines//.py) actually produce the proper numerical output for particular inputs.

I propose unit testing the parse_dataframes method in each data source. To make this easier, perhaps we could have a framework that accepts input and output dataframes as CSV files to make them easier to specify.