w3c/hcls-fhir-rdf

Lack of concept URIs for CodableConcepts -- Concept IRIs

Closed this issue · 30 comments

Individual concepts do not necessarily have canonical URIs to identify them. See example. Should we do something about that? Should we concatenate the fhir:Coding.system with the fhir:Coding.code in some way, to produce a canonical URI for the concept?

Note that fhir:Coding.system must be a URI, while fhir:Coding.code must be a code, which may include spaces (as per the spec). Looking through the examples, it looks like system URIs are not intended be used as prefixes. For example (from https://www.hl7.org/fhir/datatypes-examples.html#coding, but please point me to other examples to add here):

Vocabulary fhir:Coding.system fhir:Coding.code Actual prefix Actual URL
ICD-10 http://hl7.org/fhir/sid/icd-10 G44.1 http://purl.bioontology.org/ontology/ICD10/ http://purl.bioontology.org/ontology/ICD10/G44.1
SNOMED CT http://snomed.info/sct 128045006:{363698007=56459004} http://purl.bioontology.org/ontology/SNOMEDCT/ http://purl.bioontology.org/ontology/SNOMEDCT/128045006

(Note that I wasn't able to find an online service that includes the more complex SNOMED code used in the example above.)

Possible outcomes: AFAICT, this means that we can't use the fhir:Coding.system as a prefix, and have to either:

  1. Provide a web service that can be queried with a fhir:Coding.system and fhir:Coding.code combination in order to return a concept IRI, or
  2. Create a definitive mapping of fhir:Coding.system values to IRI prefixes, such that SNOMED CT (with a fhir:Coding.system of http://snomed.info/sct) will always be mapped to http://purl.bioontology.org/ontology/SNOMEDCT/. This could then be concatenated with the fhir:Coding.code (we would need to decide if spaces should be encoded as + or %20) to provide a full concept IRI. This could be stored in GitHub so that changes to it can be tracked, and hopefully could eventually be integrated into the FHIR specification.

It might be useful to make a list of every Coding system used in the FHIR examples, however this list is not exhaustive.

Can we put this into http://registry.fhir.org/ somehow? @gaurav to investigate.

Some healthcare systems also have their own internal coding systems -- how do we handle that?

Harold and I had decided that we could put them in if we had a mapping for them (assuming the mappings were reasonable to code generically). This means we can map SNOMED-CT, LoINC, etc using pretty official URLs. Others, we could "host" in an HL7 namespace until the org behind them saw the value and said "gimme!" At that point, you have a bit of a prob 'cause you don't want to maintain utterly enormous tables of OWL:sameIndividualAs links. I suspect the answer there would be writing custom code for the platform stuck with obsolete URLs.

UML-S could provide some basis for a hosted namespace for un-Web-ified vocabs.

It might be useful to make a list of every Coding system used in the FHIR examples, however this list is not exhaustive.

I haven't had time to extract these yet, however, a list of system URIs that can be used in FHIR Codings is available at https://build.fhir.org/terminologies-systems.html

Some additional code systems are listed on the FHIR Terminology Service at http://tx.fhir.org/r5/ and on the HL7 Terminology Service at https://terminology.hl7.org/codesystems.html

I have learned a few more things:

  • FHIR has two concepts for referring to code systems:
    • A NamingSystem is "A curated namespace that issues unique symbols within that namespace for the identification of concepts, people, devices, etc. Represents a "System" used within the Identifier and Coding data types."
      • Each naming system has a list of identifiers, which can be of four types: oid, uuid, uri, other. It would be pretty cool if this had a prefix type as well!
    • A CodeSystem is "used to declare the existence of and describe a code system or code system supplement and its key properties, and optionally define a part or all of its content." This differs from a NamingSystem as described below: I interpret this to mean that a CodeSystem should define all the codes in a way that can be used for validation and concept mapping, while a NamingSystem describes a system used to define and maintain a CodeSystem. For the purposes of this issue, I think we are more interested in naming systems than code systems.

The CodeSystem resource declares the existence of a code system and its key properties including its preferred identifier. The NamingSystem resource identifies the existence of a code or identifier system, and its possible and preferred identifiers. The key difference between the resources is who creates and manages them - CodeSystem resources are managed by the owner or publisher of the code system, who can properly define the code system features and content. NamingSystem resources, on the other hand, are frequently defined by 3rd parties that encounter the code system in use, and need to describe the use, but do not have the authority to define the features and content. Additionally, there may be multiple authoritative NamingSystem resources for a code system, but ideally there would be only one authoritative CodeSystem resource (identified by its canonical URL) that is provided by the code system publisher, with multiple copies distributed on additional FHIR servers or elsewhere and used where needed.

  • The FHIR standard describes the process used to determine a system code for an identifier or coding, which can be summarized as:
    1. Terminology.hl7.org (THO) - If a code system is listed here, the canonical CodeSystem URL SHALL be used.
    2. Not listed in Terminology.hl7.org , and the code system is external to HL7: the CodeSystem identifier authorized by the HL7 Terminology Authority (HTA) SHALL be used. One can be requested at https://jira.hl7.org/projects/HTA/issues/ (account required).
    3. Not listed in Terminology.hl7.org , and the code system is internal to HL7, and is expected to be used in a production system: create a canonical URL, then start a UTG request to have it added as a CodeSystem (https://confluence.hl7.org/display/VMAH/How+To+Submit+a+UTG+Change+Proposal)
    4. Not listed in Terminology.hl7.org , and the code system is intended to never be used in a production system, and will be used to create a value set bound with Example binding strength: use [ig-base-canonical]/CodeSystem/example-xxxxx.
    5. In the unusual situation where a code system is not resolved by this list, create a temporary identifier following this pattern: terminology.hl7.org/temporary/CodeSystem/xxxx. Contact the HL7 Vocabulary co-chairs.
  • The FHIR standard lists 32 externally published code systems and a further 14 code systems for genetics at https://build.fhir.org/terminologies-systems.html. It also defines dozens (hundreds?) of "code systems defined as part of FHIR", which appear to be code systems defined in the FHIR specification.
    • Some of these externally published code systems have a page in the FHIR specification explaining how they are to be used, such as FHIR 4.3.12 Using NDC and NHRIC Codes with FHIR, which has links to the code system documentation, code specification ("The 10 digit NDC code, with "-" included. Note that different NDC codes have different positions for the "-": 1234-5678-90, 12345-6789-0, or 12345-678-90. The "-" must be correct for each NDC code"), and the system URI (in this case, http://hl7.org/fhir/sid/ndc). However, this does NOT have information on potential prefixes.
  • Machine-readable definitions of code systems are stored in the hl7-terminology NPM (Node.js) package, installable from registry.fhir.org. This most recent version of this package is v2.1.0, which corresponds to FHIR R4, not FHIR R5.
  • FHIR NamingSystem information might also be accessible from http://tx.fhir.org/r5/NamingSystem/, but I haven't been able to figure out how to use that yet.

So, I think there are a series of potential solutions we can implement:

  1. The ideal solution would be to add prefix as an identifier type to NamingSystem.identifier.type and fill in prefixes for the 255 naming systems currently published to terminology.hl7.org. We can then use the hl7-terminology NPM package to read this information and fill in prefixes when given a system and code pair.
  2. If this is not doable, or would take too much time, we can temporarily include a list of these 255 naming systems in our fhircat tool with mappings to prefixes or other information regarding how to construct a concept IRI for coding systems. We can develop tools to compare our list with the list in the hl7-terminology to check for unmapped naming systems.

Do you all think this would cover all our needs?

NamingSystem -> non-authoritative third-party annotation about a code system
CodingSystem -> authoritative annotation by the publisher of a code system

Might want to have the prefix in CodingSystem -- there should only be one authoritative prefix/format for each coding system

CodingSystem URLs are based on hl7.org (e.g. http://hl7.org/fhir/sid/ndc), but the goal is probably to replace this with an authoritative URL when the resource wants to take over.

Gaurav to dig into CodeSystem to figure out where the prefix could go there.

The prefix could potentially go into the CodeSystem.identifier, which is an Identifier with both a IdentifierType (named type) and IdentifierUse (named use). We might consider prefix as a potential value for use. There is also a generic CodeSystem.property field that we could use, but I think Identifier would be more specific.

So I think the next step is to write all of this up somewhere and then submit it to the FHIR writers to see what they think?

type might be better to use here, since it is Extensible -- we can make up new types as needed.

  • Write up this proposed usage of the Identifier field to store prefixes.
  • Figure out where to submit this for discussion (initially to the hcls-fhir-rdf group, and then to some sort of FHIR group?).
  • Develop an example of how the JSON version of the terminology could be used to convert prefixes into RDF:

I downloaded and executed the code in https://github.com/HL7/UTG using Java 11. It generated the HTML documentation you see at https://terminology.hl7.org/. In doing so, it appears to use both tx.fhir.org (“Connect to Terminology Server at http://tx.fhir.org”, “-tx: Connect to http://tx.fhir.org/r4”) and hl7-terminology (“Installing hl7.terminology#3.0.0 to the package cache”, which I haven’t figured out where that is). I'll open an issue at https://github.com/HL7/UTG to hopefully get to the bottom of this, and am hoping that other FHIRCat team members like @ericprud or @dksharma might know as well.

Once I figure out how to modify those CodeSystem/NamingSystem files, I'm planning to create a (forked?) repository with prefixes added to some of those files, and write a little demonstration tool that uses that information to convert FHIR codings into RDF concept URLs and vice versa.

In the meantime, I'm also writing up a more formal description of this issue and possible solutions. This might be useful later on if we do need to explain what we're doing to people outside our team. I'll set it to be view-only since I'm posting that URL publicly, but please do request editing rights to that document if you would like to help!

Current strategy:

  • Step 1. Make sure that modifying the UTG files work.
  • Step 2. Find 10-15 coding system/naming systems with authoritative URLs.
    • SNOMED
    • Pick ones from OBO Foundry (disease ontology, BFO)
    • Talk to Regenstreif about LOINC
    • NDC, CPT, FDB
  • Step 3. As per Graham's comment, propose these changes in chat.fhir.org and discuss them.

Note that the fallback plan -- if HL7/FHIR refuse to put this into terminology.fhir.org -- would probably want to maintain this list separately.

Make sure that this works with US Core terminology: http://www.hl7.org/fhir/us/core/terminology.html -- they require specific URLs in that system, so we don't want to overwrite that or mess with it.

  • Would be useful to check with a CTS-2 expert to see how they would set up a server to map server/code to concept URIs. Would be redundant with a FHIR Terminology server, but might have some additional features that could be useful.
  • Multiple FHIR Terminologies exist (https://digital.nhs.uk/services/terminology-servers, https://www.healthterminologies.gov.au/access/), so:
    • Getting it into the FHIR spec might be enough for other Terminology servers to pick it up, but we need to keep an eye on that
    • We will need to differentiate between SNOMED-CT International/UK/Australia

Here are eight candidates for coding system/naming systems mentioned in the FHIR R5 examples that we can provide prefixes for:

Resource System URI Prefix Example
SNOMED CT http://snomed.info/sct http://snomed.info/id/ 385221006
LOINC http://loinc.org https://loinc.org/ 10160-0
ISO 3166 urn:iso:std:iso:3166 https://www.omg.org/spec/LCC/Countries/ISO3166-1-CountryCodes/ CA (not resolvable, but RDF file at prefix)
DICOM http://dicom.nema.org/resources/ontology/DCM http://dicom.nema.org/resources/ontology/DCM/ 110127 (not resolvable, but see BioPortal)
RxNorm http://www.nlm.nih.gov/research/umls/rxnorm http://purl.bioontology.org/ontology/RXNORM/ 1160593
MeSH https://meshb.nlm.nih.gov/ https://id.nlm.nih.gov/mesh/ D000328
PubMed https://pubmed.ncbi.nlm.nih.gov https://pubmed.ncbi.nlm.nih.gov/ 32876694
NCBI Nucleotide http://www.ncbi.nlm.nih.gov/nuccore https://www.ncbi.nlm.nih.gov/nuccore/ NC_000009.11

All of these have ten or more mentions in the FHIR R5 examples, so we could further check on resolvability by (for e.g.) looking up all the referenced codes to see if they work as expected.

@ericprud @balhoff You both have a lot more experience with RDF prefixes than I do, so if you see something I can do better here, please let me know!

  • @balhoff knows a better prefix for SNOMED CT, we should update this comment with that.

Weekly update:

  • I've modified a fork of https://github.com/HL7/UTG to modify the SNOMED CT CodeSystem and SNOMED CT NamingSystem to insert prefixes. This didn't end up happening because of two reasons:
    • CodeSystem.identifier.type isn't copied over into the output NPM package, so I had to note that this was a prefix in the CodeSystem.identifier.system field instead.
    • NamingSystem.uniqueId.type is only allowed to have one of four values, so I had to note that this was a prefix in the NamingSystem.uniqueId.comment field instead.
  • I've confirmed that this information ends up in the generated documentation as well as the generated NPM "package".

Next steps:

  • Write a JavaScript tool that can convert FHIR examples in JSON back-and-forth to RDF using this information.
  • Do this for all of the FHIR examples in the previous comment.
    • Done for all except MeSH, PubMed and NCBI Nucleotide, which aren't described in UTG's source of truth directory.
  • Complete write-up to FHIR chat explaining all this, with links to the worked examples, and see what they say.

Tasks further down the line:

Re: the SNOMED 128045006:{363698007=56459004} compositional syntax, just URL-encoding it for now seems fine. But note that this is unneeded in FHIR, since you can express this in other ways. Also: it's good to push people towards prefixes rather than trying to do this in a more complicated way.

Do we need to canonicalize blank spaces/pipes/etc in the code value? Probably not -- we can leave them as is and leave it to downstream processing.

I've uploaded to Google Drive the lists of all system codes in R4/R5 (system-codes-r[45].tsv) and the unique system/code pairs (unique-codes-r[45].tsv). I'm trying to figure out some way to validate whether the IRIs being generated are correct -- for now, I'm trying to see whether those IRIs are resolvable (resolved-r[45].tsv). For the FHIR JSON examples for R5, I got 370 unique system values with a total of 1,968 unique system-code pairs, of which I could generate 789 concept IRIs using the five examples described above. Out of 789 IRIs I attempted to resolve, I got 671 successes (HTTP 200), 112 not found (HTTP 404), 3 server errors (HTTP 500) and 2 request timeouts. So it looks like this approach might be worth pursuing? Some of those 404s are IRIs that are not intended to resolve, so we might want to try resolving them against the OLS instead.

I'm going to pause the software development work here to finish writing up the problem discussion I was working on earlier so we can check to see if there's anything missing here.

I've updated the files (see Google Drive directory and resolved-r5 sheet) to include the display field from the FHIR Examples.

I've writing up a brief summary of the problem and our proposed solution on Google Docs -- you can only comment on the document with that link, but please do request editor access if you'd like to help make it better and prepare it for submission to the FHIR chat! Before we submit it there, I'd love to link to it from HL7/UTG#7 and ask Chris Mungall to have a look at it, as he might be interested in this as well.

As per our discussion last Thursday, I've asked chat.fhir.org for suggestions on sources of Coding.system/code pairs that are in use "in the wild": https://chat.fhir.org/#narrow/stream/179202-terminology/topic/Getting.20lists.20of.20CodeSystem.2FNamingSystems.20currently.20in.20use

Grahame suggested checking system/code pairs from Synthea, which is available as software code (https://github.com/synthetichealth/synthea) or synthetic data sets (https://synthea.mitre.org/fhir-api).

  • look at BioPortal IDs
  • Can we look at the codes in UMLS?
  • use flatIRIStem and hierarchicalIRIStem as separate properties to indicate which algorithm we want people to use
  • look into how CodeSystem and NamingSystem would incorporate that

Putting IRI stems into the HL7 repo would only be adding identifiers to that repo, so it does not need to be R5 balloted. But we do need to change the spec for R5 to say that "if the concept IRI is known, then add it to the RDF".

On today's call we made two decisions:

  • AGREED: All agreed to confirm that we will provided commitment and support for managing the IRI stems: 1. Provid initial set of IRI stems. 2. In the future, if we learn of an IRI stem for something that's in terminology.hl7.org, we'll add it. 3. If HTA adds a new code system they cn ask us for an IRI stem.
  • AGREED: Add 3987 to the table at https://build.fhir.org/identifier-registry.html

Now that TSMG and the RDF subgroup have both voted on this, I think these are the next steps:

  1. To add IRIs as an identifier system. I thought this might require modifying the "Identifier Registry" page on FHIR (https://build.fhir.org/identifier-registry.html), but as per https://jira.hl7.org/browse/FHIR-17440 it looks like we need to submit a UTG ticket for this.
  2. I like the idea of submitting the change for a single CodeSystem (e.g. SNOMED) so we can make sure we're complying with UTG's change guidelines correctly.
  3. Once that's done, we can either make a single large change with all the IRI stems we can find for current external terminologies on terminology.hl7.org, or make separate changes for each IRI stem. We can use a Google spreadsheet to coordinate this work. Since the RDF subgroup is currently busy with R5 balloting changes, we'll probably start work on this in earnest once we git the R5 ballot deadline in a few weeks.

Done, though addition of some more IRI stems continues.