ccodwg/FAIRCovid19DataProject

HIRING FOR FAIR DATA PROJECT: Job requirements

Opened this issue · 0 comments

The tasks we need completed fall into three broad categories: back-end, front-end and metadata/data processing.

Back-end

  • The Archive's data need some place to live that isn't an open S3 bucket. Most likely, this means setting up a Dataverse instance (hopefully in collaboration with FRDR (#3).
  • Dataverse has its own built-in data access API, which will be a critically important part of making the data accessible (#9).
  • We may need a layer on top of the basic Dataverse/Dataverse API setup to handle the data processing pipeline (#15) from raw datasets to FAIRified datasets (e.g., making it easy to establish dataset provenance (#12)).
  • Various other sustainability tasks such as improving automated data collection for the Archive via further development of the Python-based archivist package (#2) probably fall under back-end tasks as well.

Front-end

  • The Archive's raw and FAIRified data need to be discoverable (#8). The easiest way to do this would be to setup an instance of a tool like geodisy, which runs on top of a Dataverse instance and uses GeoServer and GeoBlacklight to facilitate geospatial data discovery. On this note, it may be wise to reach out the geodisy/UBC library team, as we may be able to give back to the project by developing additional functionality in the form of plugins, etc.
  • We may need an additional layer/plugins on top of the basic service in order to best present our FAIRified data (#8, #11, #12).
  • Any additional data visualization tasks that may be required.

Metadata/data processing

  • The existing data in the Archive need a metadata taxonomy (#7) and then must be fortified with extensive metadata (#13, #5), both to enhance findability of data as well to use as a basis for the dataset processing pipeline for FAIRifying the data (#10, #15).
  • We must create a data processing pipeline to FAIRify raw datasets into a common format (#10, #15). This may use R, Python, SQL or a combination of these languages.
  • Other sustainability tasks such as maintaining the list of datasets that are being actively archived (#4).

One issue with the classification above is that the sub-tasks and skills required to complete them aren't necessarily cleanly divided into these three separate categories. For example, both setting up Dataverse and Geodisy require a similar skillset, and the entire stack must be integrated. Furthermore, someone with a deep knowledge of the data and subject area who may be the best person to develop a metadata taxonomy and add metadata may not be the best person to actually write the code necessary to integrate each dataset into a data processing pipeline. As such, it may be best not to think of each section as three separate jobs of roughly equal size, and instead develop job descriptions based on the general skillsets required.

Thoughts, @colliand?