/rcsb-mojave-model

The RCSB data models

Primary LanguageJavaMIT LicenseMIT

RCSB-MOJAVE-MODEL

The RCSB core data models that describe organisation of data available for RCSB services. We use JSON Schema as a declarative language to describe the structure of the data, constraints that apply to it and other metadata that can guide on data transformations and facilitate data use.

Schema-First Development Practice

The philosophy behind the design and implementation of RCSB services that rely on data definitions is represented by schema-first development practice. In order to ensure a mutual understanding of DW content centralized schema establishes a contract between:

In this context JSON schema offers a single source of truth that is used to perform data validation and programmatically generate additional resources, e.g. the Plain Old Java Objects (POJOs), which can be reused throughout the pipeline.

For example, in data warehouse (DW) we use JSON schemas to automatically derive validation constrains that should be applied on DB inserts. DW uses MongoDB as a storage solution where data is persisted as documents (JSON-style objects).

The other example of use is text indexing. We use Elasticsearch and index configuration (mapping) is automatically derived from JSON schemas.

Versioning

Versioning of schema files is handled by adding a tag to a specific commit in the repository’s history referring to a release point. Version numbers should follow Semantic Versioning Specification (SemVer). Release version takes the x.y.z form, where x is the major version, y is the minor version, and z is the patch version (e.g. 0.1.0). Pre-release versions are denoted by appending a hyphen and a series of dot separated identifiers immediately following the patch version (e.g. 1.0.0-3.7, 1.0.0-alpha.3.7, 1.0.0-dev.3.7).

Given a version number major.minor.patch, increment the:

  • major version when changes are incompatible (removing fields, changing fields name or type) with previous version,
  • minor version when changes are backwards-compatible (adding new fields), and
  • patch version when changes are backwards-compatible and related to metadata (changing field description, adding examples, etc.).

Schema Sources

Schema source files are stored in the schemas directory.

This project has 2 types of schema sources:

  • schemas/exchange - automatically generated JSON Schema files that contain definitions for data items that are coming from the Exchange DB. Those files are not sources and should not be updated directly. The definitions need to be updated in the py-rcsb_exdb_assets repository, JSON Schema files need to be generated and pushed to this project
  • schemas/internal - manually curated JSON Schema files that contain definitions for data items that are NOT coming from the Exchange DB. If changes are needed for those definitions, they should be done to the files in this folder directly

Product

This module contains:

  • Automatically generated JSON schemas for core collections: target/generated-sources/schema/core.
  • Automatically generated schemas for MongoDB validation: target/generated-sources/schema/validation.
  • Automatically generated POJOs from combined core JSON schemas: target/generated-sources/classes/org.rcsb.mojave.auto.
  • Automatically generated POJOs from UniProt KB schema: target/generated-sources/classes/org.rcsb.uniprot.auto.
  • Automatically generated Enum Types from combined core JSON schemas: target/generated-sources/classes/org.rcsb.mojave.enumeration.
  • Automatically generated Java class with schema fields defined as constants: target/generated-sources/classes/org.rcsb.mojave.CoreConstants.java
Cardinal Identifier Containers

Cardinal identifier containers provide evidence about how the data described by core schemas are related. In each core you can find a container specific to core:

  • rcsb_assembly_container_identifiers for assembly core
  • rcsb_entry_container_identifiers for entry core
  • rcsb_polymer_entity_container_identifiers for entity core
  • rcsb_uniprot_container_identifiers for uniprot core

The content of these containers are pointers to the entries in related cores.

Schema Integration Rules

As we integrate multiple sources into a single schema, we want to ensure that the original schema is preserved as closely as possible. However, it may be necessary to change the original schema. When this happens a new or modified data item is appended to the original schema under rcsb_ namespace to indicate provenance. Here is a set of rules that govern schema integration:

  • Data mutation (e.g. changing or adding new value) should be added as a new rcsb_ item.
  • Data aggregation (e.g. merging multiple original data items in a new object) should be added as a new rcsb_ item.
  • Data reduction (e.g. filtering of array) should be added as a new rcsb_ item.
  • Schema reduction (e.g. removing fields from original data item) should be integrated with original schema.
Access Schemas

Schema Resolver (org.rcsb.mojave.model.SchemaResolver) allows to fetch requested schema from schema registry.

Schema Registry (org.rcsb.mojave.model.SchemaRegistry) assigns a unique name for each schema 'flavour'. Each type maps to a given schema by a property name defined in 'model.module.properties' resource file. These properties are taken from pom.xml file where schemas names are configured.

Validation with JSON schema

Data in MongoDB has a flexible schema. Collections do not enforce document structure by default. We do want, however, to have a clear idea of what’s going into the database. Since v3.6 MongoDB provides the capability for schema validation during updates and insertions. Redwood uses JSON schemas as an input for MongoDB JSON Schema Validator to introduce validation checks at the database level so that the data integrity is ensured.

Due to MongoDB’s implementation of JSON Schema some of the specification-compliant definitions are not supported. BSON schemas compatible with validator are generated and stored in target/generated-sources/schema/validation.