Metadata management for the National Microbiome Data Collaborative

The purpose of this repository is to manage metadata for the National Microbiome Data Collaborative (NMDC). The NMDC is a multi-organizational effort to enable integrated microbiome data across diverse areas in medicine, agriculture, bioenergy, and the environment. This integrated platform facilitates comprehensive discovery of and access to multidisciplinary microbiome data in order to unlock new possibilities with microbiome data science.

Tasks managed by the repository are:

Generating the schema
Deploying the documentation
Integrating metadata from multiple environmental data repositories

Background

The NMDC Introduction to metadata and ontologies primer describes the context for this project.

Schema

See the slides describing the schema

The NMDC schema is used during the translation process to specify how metadata elements are related.

The schema is also available as:

Documentation

Documentation for the NMDC schema can be browsed here:

https://microbiomedata.github.io/nmdc-metadata/

NMDC data

A zipped file of the NMDC can be downloaded here (JSON format).

Mapping resource

We use SSSOM to map fields in primary data sources to standard terms. The mapping between the GOLD data and MIxS terms this SSSOM file.

Standardization of characteristics

Entities in the schema are annotated with characteristics. When possible, we use standard terminologies and ontologies to define these characteristics. These standards include:

We are actively involved in updating the MIxS standards (mixs-ng) and creating an RDF version of MIxS (mixs-rdf).

See also our analysis of MIxS descriptors

Metadata sources

At present, we ingest metadata from the Joint Genome Institute (JGI) and the Environmental Molecular Sciences Lab (EMSL).

The NMDC schema and translation process will be modified as more metadata sources become available.

Metadata integration

We use Jupyter notebooks to integrate the metadata sources. This allows us to iterate quickly in a transparent and interactive manner as new metadata sources become available.

Development of more comprehensive ETL pipeline will progress as the metadata sources and schema become more concrete.

Identifiers

See identifiers documentation

microbiomedata/nmdc-metadata