/metadata-in-rcr

code and data related to "The Role of Metadata in Reproducible Computational Research"

Primary LanguageTeXCreative Commons Attribution 4.0 InternationalCC-BY-4.0

The Role of Metadata in Reproducible Computational Research

This is a supplemental resource to Leipzig et al. "The Role of Metadata in Reproducible Computational Research" now published in Cell Patterns https://www.cell.com/patterns/fulltext/S2666-3899(21)00170-7

Contributions are welcome!

Organization

├───data/
│   ├───examples/                  Examples of metadata standards
│   ├───lens/                      Search exports for scimetric journal analysis
│   └───standards.tsv              Raw standards table
├───src/
│   ├───cwl/tools/                 CWL configuration to produce the timeline plot
│   ├───manuscript/                Manuscript revision document
│   ├───secrets/
│   │   └───api.template.py        Replace this with api.py using your NCBI/NCBO keys
│   ├───ontologies/                Scimetric ontology popularity analysis
│   ├───repotutils/                Scripts for automating management of this repository
│   ├───scimetric/                 Scimetric journal meta/rcr frequency analysis in a Jupyter Notebook
│   ├───timeline/                  R Markdown document to produce the RCR case study timeline in the paper, incl. helper files for execution with CWL (wrapper script, Dockerfile)
│   ├───wget2jsonld.py             Helper script to convert wget output to jsonld
│   └───wordcloud/                 R script to produce word cloud from cited abstracts
├───LICENSE                        The LICENSE file
├───README.md                      What you are looking at
├───environment.osx.yaml           OSX pinned Conda depenencies
├───environment.unpinned.yaml      Unpinned Conda depenencies
└───ro-crate-metadata.jsonld       RO Crate config
└───.binder                        Environment configuration files for usage with Binder (mybinder.org)

Examples of RCR metadata standards

In this table we provide links to the authoritative publications and homepages for these metadata standards, as well as examples we have collected. Schema refers the parent structure this standard conforms to, if any. Encoding refers to the markup format used. Note that for schemas such as OWL, which relies on RDF subject–predicate–object triplets, the encoding could be one of at least seven serialization types (RDF/XML, RDF/JSON, JSON-LD, Turtle, N-Triples, N-Quads, N3), so the listed encoding is somewhat arbitrary. For other standards, such as DICOM, the encoding is a custom binary although there are numerous export format and even attempts to serialize JSON within DICOM.

[:books:] Publication [:house:] Homepage [:clipboard:] Example

Standard Layer Domain Encoding Schema Description
CellML 📚 🏠 📋 Input Biology XML RDF mathematical models for biology
CIF2 📚 🏠  Input Crystallography Custom atomic structure
DATS 📚 🏠  Input Biomedical JSON desc metadata (people, org, repo) for data pubs
DICOM 📚 🏠 📋 Input Images Custom Key-Value standard for all medical imaging
EML 📚 🏠  Input Ecology XML eco support for geo, species, pubs used in KNB
FAANG  🏠  Input Specimens Tabular
GBIF 📚 🏠  Input Biodiversity JSON
GO 📚 🏠  Input Genes XML
ISO/TC 276  🏠  Input Biotechnology
MIAME 📚 🏠  Input Microarrays XML
NetCDF 📚 🏠  Input Arrays
OGC  🏠  Input Geospatial
ThermoML 📚 🏠  Input Compounds XML
CRAN  🏠  Tools R packages
Conda  🏠  Tools Dependencies
pip setup.cfg  🏠  Tools Python modules CFG Key-Value Python cfg files have headers and key-value pairs similar to Windows INI files
EDAM 📚 🏠  Tools Bfx data
CodeMeta  🏠  Tools Source code
Biotoolsxsd 📚 🏠  Tools Bfx software XML
DOAP  🏠  Tools Software XML
ontosoft  🏠  Tools Geo software
SWO 📚 🏠  Tools Bfx Software
OBCS 📚 🏠  Reports Biostatistics
STATO  🏠  Reports Statistics
SDMX  🏠  Reports Statistics JSON
DDI  🏠  Reports Studies XML
MEX 📚 🏠  Reports ML XML
MLSchema  🏠  Reports ML
MLFlow  🏠  Reports ML
Rmd  🏠  Reports Docs YAML Key-Value
CWL 📚 🏠  Tools, Pipelines YAML Schema Salad Common Workflow Language specifies how to invoke a command line tool or a pipeline of such tools
CWLProv 📚 🏠  Pipelines YAML, JSON, XML BagIt of Research Object folder containing manifest (JSON-LD), CWL (YAML), PROV (JSON, XML, RDF)
RO-Crate  🏠  Input, Pipelines, Publication JSON-LD RDF, schema.org RO-Crate is a profile of using schema.org to annotate any collections of research data and their real-life origins
RO  🏠  Pipelines Turtle, JSON-LD, XML OWL
WICUS  🏠  Pipelines
OPM  🏠  Pipelines
PROV-O  🏠  Pipelines OWL Several PROV serializations exists; PROV-O is in OWL, which again has many serializations including the RDF syntaxes
ReproZIp  🏠  Pipelines
ProvOne  🏠  Pipelines
WES    Pipelines
BagIt  🏠  Input, Pipelines Text Key-Value For long-term perservation and availability BagIt specifies a fixed folder structure of payload files, their checksums and other metadata tag files. Bags can be archived as zip, tar, etc or remain folders
BCO    Pipelines
ERC 📚 🏠  Pipelines Research Compendia YAML Key-Value
BEL    Publication
DC    Publication
JATS  🏠  Publication Articles XML Tags DTD
ONIX    Publication
MeSH    Publication
LCSH    Publication
MP 📚   Publication Micropublications OWL
Open PHACTS 📚 🏠  Publication Drugs RDF
SWAN 📚   Publication Neuromedicine
SPAR  🏠  Publication Publishing OWL
PWO 📚   Publication Publishing
PAV 📚   Publication Authorship OWL
Manubot   📋 Publication Publishing YAML
ReScience   📋 Publication Publishing YAML
PandocScholar   📋 Publication Publishing YAML

RDF vs OWL https://stackoverflow.com/questions/1740341/what-is-the-difference-between-rdf-and-owl

How to generate the timeline for this article

Install cwltool

pip install cwltool
cwltool src/cwl/tools/timeline.cwl --reportfile timeline.html

Note that the tools requires Docker for runningthe computing environment, see the file timeline/Dockerfile for the definition of the image used in the .cwl file.

Run on Binder

MyBinder is a tool for creating executable computing environments based on standard and widely used dependency management files. You can easily run important parts of the analysis for the manuscript by clicking on the badges below. Binder will create a container using the environment configuration from the directory .binder/ and provide you with an interactive environment to execute notebooks or scripts.

  • Scimetric journal frequency analysis of RCR and metadata terms (opens a Jupyter Notebook) Binder
  • Create Figure 2 from the paper (R Markdown notebook, open the file src/timeline/timeline.Rmd manually in RStudio) Binder
  • Create word cloud from cited abstracts (run R script src/wordcloud/wordcloud.R) Binder

For development purposes, you can also run repo2docker locally in the directory of the repository.

repo2docker --editable .

License

CC0