force11/force11-scwg

use case: data repository wants to link data, software, and papers in provenance trace

Closed this issue · 8 comments

Domain and institutional data repositories have both data and software artifacts, and want to link these together in a provenance trace that can be cited. Sometimes the software is a separately identified artifact, but at other times software is included inside of data packages, and the researcher wants to cite the combined product. See example of mixed data and software package (containing R code) here: https://knb.ecoinformatics.org/#view/doi:10.5063/F1Z899CZ

Research Compendia (http://researchcompendia.org/) is another example of "mixed bags" of data/software. Most "generic" archives (Zenodo/FigShare/Dataverse) make no requirements and perform no atomic identification of "software" in their preserved research objects.

We (GigaScience) publish such "mixed bags" of software and data (e.g this example of software http://dx.doi.org/10.5524/100046 and related data http://dx.doi.org/10.5524/100045 linked to this paper http://dx.doi.org/10.1186/2047-217X-2-4), but use DataCite metadata to distinguish between software, datasets and workflows (see 10.1/resourceTypeGeneral in the DataCite schema https://schema.datacite.org/meta/kernel-3/doc/DataCite-MetadataKernel_v3.1.pdf). The other repos use DataCite DOIs and can do the same (if they aren't already). Is this something to promote in a metadata system agnostic way?

Do you have suggestions for what basic or recommended metadata might support this use case?

We've been working on a ProvONE as a metadata container for describing these provenance relationships.

Sorry, @mbjones I meant do you have a suggestion for what to add to the document and use case table 😃

@kyleniemeyer I just created PR #136 with a suggested addition to the use case table. The use case could potentially be discussed in more detail, but I wasn't sure where to put it. We could, for example, describe how the software may be part of or embedded within a data package with other research artifacts, and that the software might therefore not be individually citable beyond the data package itself that contains it. If you are happy with this UC addition in the table, then feel free to close this issue. Or let me know where and we can develop a more thorough description in the text.

Thanks @mbjones, I merged the PR (and also added the use case to the Google Doc table).

I think adding this use case is good enough for now, but we should consider giving an example or more discussion in the follow-on implementation examples paper—where we can show how one should cite the software in such a case.