/smartcontainers

Smart Containers to track Provenance in a Docker ecosystem

Primary LanguagePythonApache License 2.0Apache-2.0

SmartContainers (sc) for docker enabled software and data preservation codecov.io

SmartContainers is python wrapper for docker that facilitates the recording and tracking of provenance information using the W3C recommendation prov-o. SmartConainers is being developed as part of the Data and Software Preservation for Open Science (DASPOS) project.

Current build status build status: Build Status

SmartContainers provides a command line tool, sc, that provides a surrogate for the docker command line tool.

sc --docker <docker command line>

Will create a docker label with provenance metadata using the W3C Prov-o vocabulary with respect to the computational environment created or provided by a particular docker container.

A python setup file is provided for installation of the command line utility. It is recommended to install the tool in a Python virtual environment since the tool is under heavy development.

pip install .

Will install the tool and it's dependencies in a virtual environment.

User Identity setup

On first use after installation, the sc command will guide the user through connecting the tool with the users ORCID. It is recommended to setup a ORCID account to connect to the tool. If the user chooses not to create an ORCID account, the tool with prompt for a First and Last name, email and organization for provenance information. A global configuration file will be created in the user that contains this information so it only needs to be input once. The configuration file will be written to a .sc directory created in your home directory. In the future, the configuration file location will be a user option.

Purpose

For data to be useful to scientists, data must be accompanied by the context of how it is captured, processed, analyzed, and other provenance information that identify the people and tools involved in this process. In the Computational Sciences, some of this context is provided by the identity of software, workflows and the computational environment where these computational activities take place. Smart Containers is a tool that wraps the standard docker commandline tool with the intent to capture some of this context that is naturally associated with a Docker based infrastructure. We capture this metadata using linked open data principles and web standard vocabularies such as the W3C Prov-O recommendation to facilitate interoperability and reuse. This provenance information is attached directly to a docker container label using JSON-LD thus "infecting" containers and images derived from the original container resource with contextual information necessary to understand the identity of the contained computational environment and activities that environment affords.

Use of linked data principles allow us to link to other vocabularies and incorporate other efforts such as Mozilla Science's Code as a Research Object, Schema.org, dbpedia software vocabularies and ORCID to provide broader context for how a Docker container may be provisioned utilizing "Five Stars of Linked Data Vocabulary Use" recommendation. We have extended the Prov-O notion of Activity by creating the formal ontology pattern of Computational Activity and a taxonomy to capture Computational Environment . Lastly, we provide the ability for scientific data to be published and preserved, along with it's provenance using a docker container as a "research bundle". We utilize ideas from the W3C Linked Data Platform recommendation and W3C work on "Linked Data Fragments" using the Hydra Core Vocabulary, that is still in the development stage, to provide metadata for data entry points inside the docker container as well as the ability to attach rdf metadata to non-rdf dataset resources, which is a common use case in the sciences.