Hosting an official "OBO context" of all prefixes relevant to biological data integration
Opened this issue · 0 comments
In our OBO universe, we mostly care about OBO purls and curies, which is one of the most important achievements of our community. For example, the PURL http://purl.obolibrary.org/obo/CL_123
(and the corresponding CURIE CL:123
) represent an entity in the CL ontology.
The reality is that OBO is merging more and more with a wider, interconnected world of biological and biomedical data efforts, whether they are scientific databases, clinical efforts or ontology standards, and it makes sense to try and organise our relationship with these a bit more.
One universal problem we face is the interpretation and representation of cross-references to biological databases (such as reactome, uniprot and more) or medical terminologies (such as MeSH, SNOMED or MEDDRA). We typically represent cross references as "CURIE strings", such as OMIM:231200
and link them to the ontology concepts using the oboInOwl:hasDbXref
relationship.
Now there are two things we may want to do with this cross reference:
- We may wish to provide a linkout to the related resource, so that a user can "look at additional information about this concept"
- We may want to offer the opportunity for data integration efforts to connect information related to both resources (the ontology and the referenced external resources)
Either use case requires the expansion of the CURIE (e.g. OMIM:231200
) to a URL (e.g. https://omim.org/entry/231200 or https://identifiers.org/meddra:10015919). The problem, however, is that not only can the the CURIE prefix have 20 alternatives (omim
, mim
, MIM
), but even worse (for us), there can be dozens of valid URI expansions (https://omim.org/MIM:603903
, https://www.omim.org/entry/603903
and many many more), which means that the datasets we make publicly available need to be, cumbersomely, stiched together using custom ETL pipelines after the fact.
For these kinds of reasons (making integration easier), it makes sense to try and unify the use of CURIE prefixes ("when you provide an omim identifier, you use the prefix OMIM
, not mim
, not omim
, not MIM
") and URI prefixes ("when you expand an OMIM CURIE, you use https://omim.org/MIM:603903
").
@cthoyt has been at the forefront of an effort to trying an bring some order into the current anarchy. One of the concepts he has developed and I personally just find super awesome is the idea of "organisational context" that are hosted as part of bioregistry, which are powered by an enormous database of curie-prefix/uri-prefix combinations, but allow for the flexibility of hardcoding certain preferences such as "we know what the PubMed would like us to use the pubmed
prefix for CURIEs, but we really want to us PMID
for historical reasons and more". To this end, @anitacaron and me are maintaining a context we wholly illegally called the "OBO context", which reflects some of our communities preferences: https://bioregistry.io/context/obo. The second cool piece of the system @cthoyt developed are so call "Extended Prefix Maps (EPM)". Here is an example. The cool thing is that the EPM not only contains a clean prefix map, it also contains all existing synonyms, for example:
"pattern": "^C?\\d+$",
"prefix": "Orphanet",
"prefix_synonyms": [
"ordo",
"orphanet.ordo"
],
"uri_prefix": "http://www.orpha.net/ORDO/Orphanet_",
"uri_prefix_synonyms": [
"http://bioregistry.io/ordo:",
"http://bioregistry.io/orphanet.ordo:",
"http://identifiers.org/orphanet.ordo/",
....
TLDR:
We need a way to contract and expand CURIEs / URIs to facilitate data integration in the OBO domain beyond our OBO PURLs.
I am proposing to host the obo context we have been using in some of our software packages on OBOFoundry.github.io
. The idea is that once per month, a GitHub action will pull updates to the context from Bioregistry, and make a PR here in the repo with the proposed changes. Then, a TWG member will review the changes, and flag controversial changes and reflect them in the bioregistry context. And then, we promote the use of the EPM hosted here, on OBOFoundry.github.io
universally as the source for prefix compression and expansion for the entire OBO community (not as a law, just as a "SHOULD" type of thing).
Let me know what you think!