ccodwg/FAIRCovid19DataProject

SUSTAINABILITY: Maintaining the list of datasets for the Canadian COVID-19 Data Archive

Opened this issue · 1 comments

The list of active and inactive datasets in the Canadian COVID-19 Canada Data Archive, along with all associated metadata, is given in datasets.json. It has hundreds of entries.

This is also the data format used by archivist and Covid19CanadaArchive used to produce the nightly automated data updates (#2). It is also used to keep the COVID-19 Canada Open Data Working Group datasets updated (see Covid19CanadaETL, Covid19Canada and CovidTimelineCanada. All datasets are identified with a unique UUID generated by UUID version 4.

This list is maintained manually by the maintainer (me) based on personal knowledge of Canadian COVID-19 datasets as well as tips from data users in the form of personal communications or GitHub issues. Naturally, this is work-intensive and it is not always obvious when a new dataset is available or an old dataset has been retired, leading to (potential) loss of the historical record.

Main areas of improvement

These are the main areas of improvement I see for improving sustainability of the dataset list maintenance:

  • Involve multiple users
    • Perhaps each region could have an assigned "steward" responsible for keeping relevant datasets up-to-date (e.g., one person for Ontario, one person for Quebec, one person for PHAC, one person for Atlantic Canada, etc.)
  • Expand on automation tools to assist with maintenance
    • utils.py currently contains two commonly used functions: (retire_dataset), which moves a dataset from "active" to "inactive" in the list of datasets (datasets.json) and list_inactive_datasets, which creates a list of datasets that have produces identical files for a certain number of days, suggesting the dataset may no longer be updated and can be safely moved to "inactive" status
  • Create web-based interface for editing
    • It may be easier to collaborate if a fool-proof web-based interface for collaboration is created
    • Changes to the underlying file (e.g., datasets.json) must be validated before they are accepted in order to not disrupt tools that rely on the list of datasets
  • Any changes to the underlying file format (e.g., datasets.json) would have to be made compatible with the existing tools that use this file, such as the nightly archive update process and Covid19CanadaETL
    • Is it possible that datasets.json could be converted to some existing format/standard for this sort of data?

It would be helpful to find precedents for a community-maintained dataset archive/scraping list.

A precedent for community-maintained data scraping/scrapers: the Police Data Accessibility Project. They even have some kind of Python GUI for helping users write scrapers (note: I haven't checked this out yet).