/prefixmaps

Semantic prefix map registry

Primary LanguagePythonApache License 2.0Apache-2.0

prefixmaps

A Python library for retrieving semantic prefix maps.

A semantic prefix map will map a a prefix (e.g. skos) to a namespace (e.g http://www.w3.org/2004/02/skos/core#).

This repository and the corresponding library is designed to satisfy the following requirements:

  • generation of prefix maps in headers of RDF documents
  • use in tools that expand CURIEs and short-form identifiers to URIs that can be used as subjects of RDF triples
  • coverage of prefixes from multiple different domains
  • no single authoritative source of either prefixes or prefix-namespace mappings (clash-resilient)
  • preferred semantic namespace is prioritized over web URLs
  • authority preferred prefix is prioritized where possible
  • each individual prefix map is case-insensitive bijective
  • prefix map composition and custom ordering of prefixmaps
  • lightweight / low footprint
  • fast (TODO)
  • network-independence / versioned prefix maps
  • optional ability to retrieve latest from external authority on network

What this is NOT intended for:

  • a general source of metadata about either prefixes or namespaces
  • a mechanism for resolving identifiers to web URLs for humans to find information

Installation

pip install prefixmaps

Usage

To use in combination with curies library:

from prefixmaps import load_converter
from curies import Converter

converter: Converter = load_converter(["obo", "bioregistry.upper", "linked_data", "prefixcc"])

>>> converter.expand("CHEBI:1")
'http://purl.obolibrary.org/obo/CHEBI_1'
>>> converter.expand("GEO:1")
'http://purl.obolibrary.org/obo/GEO_1'
>>> converter.expand("owl:Class")
'http://www.w3.org/2002/07/owl#Class'
>>> converter.expand("FlyBase:FBgn123")
'http://identifiers.org/fb/FBgn123'

Alternate orderings / clash resilience

  • prefix.cc uses the prefix geo for geosparql http://www.opengis.net/ont/geosparql#
  • OBO uses prefix GEO for the Geographical Entity Ontology, expanding to http://purl.obolibrary.org/obo/GEO_
  • the Bioregistry uses the prefix geo for NCBI GEO, and "re-mints" a geogeo prefix for the OBO ontology

If we prioritize prefix.cc the OBO prefix is ignored:

converter = load_converter(["prefixcc", "obo"])

>>> converter.expand("GEO:1")
>>> converter.expand("geo:1")
'http://www.opengis.net/ont/geosparql#1'

Even though prefix expansion is case-sensitive, we intentionally block conflicts that differ only in case.

If we push bioregistry at the start of the list then GEOGEO can be used as the prefix for the OBO ontology:

converter = load_converter(["bioregistry", "prefixcc", "obo"])

>>> converter.expand("geo:1")
'http://identifiers.org/geo/1'
>>> converter.expand("GEO:1")
>>> converter.expand("GEOGEO:1")
'http://purl.obolibrary.org/obo/GEO_1'

Note that from the OBO perspective, GEOGEO is non-canonical.

We get similar results using the upper-normalized variant of bioregistry:

converter = load_converter(["bioregistry.upper", "prefixcc", "obo"])

>>> converter.expand("GEO:1")
'http://identifiers.org/geo/1'
>>> converter.expand("geo:1")
>>> converter.expand("GEOGEO:1")
'http://purl.obolibrary.org/obo/GEO_1'

Users of OBO ontologies will want to place OBO at the start of the list:

converter = load_converter(["obo", "bioregistry.upper", "prefixcc"])

>>> converter.expand("geo:1")
>>> converter.expand("GEO:1")
'http://purl.obolibrary.org/obo/GEO_1'
>>> converter.expand("GEOGEO:1")

Note under this ordering there is no prefix for NCBI GEO. This is not a major limitation as there is no canonical semantic rendering of NCBI GEO. This could be added in future with a unique OBO prefix.

You can use the ready-made "merged" prefix set, which prioritizes OBO:

converter = load_converter("merged")

>>> converter.expand("GEOGEO:1")
>>> converter.expand("GEO:1")
'http://purl.obolibrary.org/obo/GEO_1'
>>> converter.expand("geo:1")

Network independence and requesting latest versions

By default, this will make use of metadata distributed alongside the package. This has certain advantages in terms of reproducibility, but it means if a new ontology or prefix is added to an upstream source you won't see this.

To refresh and use the latest upstream:

converter = load_converter("obo", refresh=True)

This will perform a fetch from http://obofoundry.org/registry/obo_prefixes.ttl

Context Metadata

See contexts.curated.yaml

See the description fields

Repository organization

Data files containing pre-build prefix maps using sources like OBO and Bioregistry are distributed alongside the python

Location:

CSV field descriptions

  1. context: a unique handle for this context. This MUST be the same as the basename of the file
  2. prefix: corresponds to http://www.w3.org/ns/shacl#prefix
  3. namespace: corresponds to http://www.w3.org/ns/shacl#namespace
  4. canonical: true if this satisfies bijectivity

Refreshing the Data

The data can be refreshed in several ways:

  1. Locally, you can use tox with:

    pip install tox tox-poetry
    tox -e refresh
  2. Manually running and automatically committing via this GitHub Actions workflow.

  3. Running makefile (warning, this requires some pre-configuration

    make etl

TODO: make a github action that auto-releases new versions

Note that PRs should not be made against the individual CSV files. These are generated from upstream sources.

We temporarily house a small number of curated prefixmaps such as linked_data.yaml, with the CSV generated from the YAML.

Our goal is to ultimately cede these to upstream sources.

Requesting new prefixes

This repo is NOT a prefix registry. Its job is simply to aggregate different prefix maps. Request changes upstream.