pato-ontology/pato

Align or axiomatize PATO/BFO/UO with qudt

Opened this issue ยท 24 comments

This is emerging as the standard for units on the semantic web: http://qudt.org/

How does this affect PATO? Here are some possibilities:

  1. We ignore qudt, as it's not OBO-esque (e.g. units and 'attributes' are individuals)
  2. qudt:Unit replaces UO, and we have unit-of relationships connecting PATO and qudt; PATO is otherwise unaffected
  3. PATO engages more deeply with the qudt:Quantity model, leading to either an axiomatization of one in terms of another, or even a replacement of a chunk of PATO and PATO modeling paradigms with qudt.

I'll briefly explore the last option here. Note the motivation here is primarily pragmatic, not philosophical. Some familiarity with contents of PATO assumed, and in particular with the use cases it aims to solve (which we should really have documented elsewhere...)

Modeling differences

The QUDT schema is shown here:

model

A rough alignment to PATO/BFO/OBO Ontologies is:

  • qud:Quantity corresponds to the value subset of PATO
  • qud:QuantityKind corresponds to the attribute subset of PATO
  • qud:Unit corresponds to UO (using individuals for each unit)
  • qud:QuantityValue has some relationship to IAO measurement datum

Quantities vs qualities

The first difference that should be noted is that PATO's scope includes many inherently non-quantifiable or hard to quantify qualities. There is no 'shape' in qudt. However, it may still be possible to encompass complex multi-dimensional qualities like this using some morphometric formalization.

Another difference is the use of classes vs instances. PATO forms a simple subclass hierarchy. E.g.

 / PATO:0000001 ! quality
  is_a PATO:0001241 ! physical object quality
   is_a PATO:0001018 ! physical quality
    is_a PATO:0001906 ! movement quality
     is_a PATO:0002242 ! velocity
      is_a PATO:0001413 ! angular velocity *** 

In qudt, quantities Velocity and 'Angular Velocity' (I'll use the qudt upper class labels to distinguish from lowercase PATO) are individuals. They are linked by the qudt:generalization OP. The use of SubClass is used to place these individuals into broad categories. For example, Volume is of type 'Spaces and Time' (more will be said of the naming conventions later).

Quantity Kind
   <--[SubClassOf]-- Space and Time
                                      ^
                                      +----[rdf:type]--- Volume <----[generalization] Angular Velocity

While this is clearly has different semantics with different implications for reasoning, it's not clear it's fundamentally incompatible - in that the two systems could live side by side with axioms connecting them.

The use of an ObjectProperty in place of SubClass semantics could be pragmatically difficult for us as we are reliant on OWL reasoning. Some portion of this could be recapitulated via property chains, but this is would probably be insufficient for many PATO use cases, which include axiomatization of phenotype and trait ontologies where are inherently class-based.

One possibility here is to pun qudt and reconstitute the generalization assertions as SubClassOf axioms. The more pragmatic route will likely to be to maintain PATO classes, but to axiomatize them using some pattern such as PATO:Q EquivalentTo P Value qudt:Q. P can be an invented property here, the purpose is to maintain a correspondence (and potentially use for inference). If an interpretation is sought then this can be thought of as a kind of set-extent relation that projects from a class to an individual that is the mereological sum of all instances of that class, and the generalization relation can be interpreted as part-of.

The larger difference here is in the distinction between Quantity and Quantity Kind. The qudt notion of Quantity corresponds well to the value subset of PATO. It is even compatible with realism, for those who care. qudt Quantities can be thought of as mind-independent. However, it is not BFO compliant in that it forces a determinable-determinant distinction (recall in PATO values are subclasses of attributes). This is a topic larger than this ticket. From a pragmatic POV, we can note two things: (1) a shadowing strategy similar to the one above can be followed, and (2) many PATO values are ontological oddities like 'increased length' that are best represented relationally, for example as representing some kind of change between two quantities.

Naming Conventions

Of note there are naming convention differences. qudt uses captitalization, and sometimes arbitrarily CamelCases's.

These could easily be handled with an automatic rewrite (were we to decide to use qudt directly)

There is also the use of labels such as 'Biology' for qudt:Biology when the more appropriate label would be 'Biological Quality Kind'. Again, these could be ignored or rewritten - an annoying but fundamentally minor issue.

Biology-specific quantities in UO

qudt has a class 'Biology' (think of this as being labeled 'Biological Quantity Kind', see above). It has only 4 instances:

  • HeartRate
  • Microbial Formation
  • Respiratory Rate
  • Serum or Plasma Level

Leaving aside the fact these are clearly not instances of 'Biology', this seems a bit ad-hoc. There is no axiomatization of HeartRate, no connection to a generic rate (and thus no abstraction connecting HeartRate and 'Respiratory Rate')

It seems best to leave the biological part alone. We have coverage of this in OBA using EQ patterns.

Most bio users would therefore prefer to use OBA for biological quantities (our strength) and instead leverage qudt for physical, chemical non-bio quantities (their strength).

Measurements vs values

ud:QuantityValue has some relationship to IAO measurement datum, but the difference is that we could have multiple measurements of the same quantity or value. There is not a direct cognate of QuantityValue in IAO or OBI at the moment.

Steps forward

For practical reasons, PATO classes are likely to stick around. However, qudt with it's better support of physical/chemical qualities may form a better basis on which to axiomatize PATO.

Some users may wish to ditch UO altogether, but at the least there should be some kind of linkage between these.

  • loose lexical mapping between PATO classes and qudt:Quantity instance and classes
  • experiment with a logical form for these mappings - e.g. the = R Value pattern above
  • use this to detect inconsistencies between the two, and to provide automatic rewriting between the OBO world and the wider semweb world

I took some time to look into both QUDT and OM, which is produced by Hajo Rijgersperg at U of Wagenigen in the Netherlands: https://github.com/HajoRijgersberg/OM . OM has been developed as an offspring of food research needs (http://www.foodvoc.org/page/om-1.8) , but takes on a broad scope of units like QUDT. Hajo compares OM with QUDT in this paper: http://www.semantic-web-journal.net/sites/default/files/swj177_7.pdf . Another comparison paper puts OM out front in quantity of unit terms: http://www.semantic-web-journal.net/system/files/swj1825.pdf

Here's a comparison diagram I based on Hajo's diagraming, which focuses on specifying values of measurables/estimates etc. ('has unit label' ~= 'has measurement unit label').

OBI-UO and OM and QUDT

Just found this thread.

A few response and comments on the concerns raised above:

  1. Yes, the useful OM and QUDT resources are instances, not classes. So punning would be needed to support OBO-style reasoning. The underlying models are all the same of course.

  2. OM was basically a result of a PhD project, which finished about 3 years ago. It is maintained publicly, but now a bit sleepy - see https://github.com/HajoRijgersberg/OM/issues. I found a few areas where OM risked straying out of its lane, which in turn led me to worry that the scope and theoretical basis was a bit muddy. e.g. HajoRijgersberg/OM#6

  3. Meanwhile, I've been making some contributions to QUDT. QUDT was originally funded by NASA and had a more engineering focus, but NASA's funding ended about 6 years ago. QUDT got stuck in a maintenance-trough, and unfortunately it was not being done in public. That changed about a year ago - the maintenance repo is: https://github.com/qudt/qudt-public-repo and all the vocabularies are delivered as 'linked data' (i.e. per URI) - see http://www.qudt.org/doc/DOC_VOCAB-UNITS.html

  4. Yes, the localID and naming in QUDT was a mess, inconsistent and very US-biased (e.g. 'metric-ton' rather than 'tonne') and the upper-case convention leads to some noise (GM for gram, because G is for Gal). It is being cleaned up. But then again, I don't think OBO-people would be worrying too much about URI fragments ;-) Lets just treat them as opaque.

  5. there remain a few errors in the QUDT dataset, but there is now a small team working through those, https://github.com/qudt/qudt-public-repo/issues and growing engagement from external participants - e.g. look at the varied contributors of issues that have been resolved https://github.com/qudt/qudt-public-repo/issues?q=is%3Aissue+is%3Aclosed

  6. The papers cited above, that compare OM and QUDT, related to version 1 of QUDT. The coverage in v2 is much better, which I think would reverse most of the comparisons. QUDT currently has ~1500 units in its vocabulary. Some of these are strange combos of legacy systems (BTU, pounds-force!) so the useful count is much smaller. Nevertheless, it is the most comprehensive list that I'm aware of.

  7. Both QUDT and OM include currencies, which is clearly silly.

Thanks for the interest in OM! Just ran into this thread and I want to read all of it. I was pointed to the comment of dr. Short Hair above. Please allow me to state a few corrections: OM contains the same order of amount of units as QUDT, and - as far as we know - contains no errors whatsoever. Thought it would be good to let you know.
I hope to read further in this very interesting thread soon! I'll keep you up to date! :)

I have studied your great diagram above, Damion, https://user-images.githubusercontent.com/4000582/56243972-a72e7280-6050-11e9-8277-fcad93ad58e7.png. It seems that OBI+UO and OM resemble a lot, as far as the definition of classes is concerned. For example, 'scalar value specification' matches to om:Measure. What is more, similar to the approach in OBI+UO, the om:hasValue relation is restricted on class level, om: Quantity, to om:Measure (among some other classes). This makes OBI+UO and OM look even more alike.
Moreover, OM also has classes such as om:LengthUnit, like UO. These classes are subclasses of om:Unit. Question, I don't see the class 'unit' in the diagram for OBI + UO; doesn't UO have the class 'unit'?

Differences I see have to do with:

  • Relationships in opposite directions, e.g. 'has quality' compared to om:hasPhenomenon.
  • Is in UO a unit, e.g. kilogram, a class? In OM it's an instance.

UO (https://github.com/bio-ontology-research-group/unit-ontology) does have a top level "unit" class, and has a few versions - one with all units as instances and the other with units as classes. Looks like I could have given "length unit" a parent of "unit".

image

So at the class level, yes I think OM and UO are quite similar. I know OM has modelled out units in terms of SI base units, and so has more smarts about how to do unit conversions?!

P.s. for reference, the OM ontology home: https://github.com/HajoRijgersberg/OM

Thanks, Damion! Clear. With that 'unit' class indeed UO and OM look even more like each other.
Indeed. OM expresses units in terms of base units (SI base units are used for that). It also states the composition of compound units in a formal way, such as that m/s has a numerator 'metre' and a denominator 'second'. Hence indeed conversion of units is supported.

OM contains the same order of amount of units as QUDT

Perhaps I'm not looking in the right place?

However, I don't think this is a competition ;-). The different ontologies are all transformable (the underlying models are necessarily the same), and we have several parallel options in practice for the good reason of history and community expectations. So the coverage does not overlap completely for this reason. Our (larger) community goal should be to

(i) improve quality in all of them, in both the model/ontology and the catalogues of individuals or classes (cross comparisons such as @jmkeil's https://github.com/fusion-jena/abecto-unit-ontology-comparison is a big help here)

(ii) develop rules and scripts to allow us to align and harvest between them

(iii) advertise and educate the broader community that units is a solved problem, so future efforts would be better put into improving the quality of the existing offers, and not in creating yet more options that are essentially duplicative.

OM has 277 individuals of rdf:type om:Unit in https://github.com/HajoRijgersberg/OM/blob/master/om-2.0.rdf

No, you should also count the different compound units (unit multiplications, unit divisions, and unit exponentiations) and prefixed units. In total the same order of amount as QUDT. (See also the publication of Jan Martin.) As a matter of fact, huge ranges of prefixed units and compound units composed of different prefixed units still have to be defined. That must be the case with QUDT too.

And indeed, it's not a competition! :)
Fully agree with (i) and (ii). Not sure about (iii) - perhaps an even better solution is required, I'm not sure.

Good.

IMO the static representations would be better thought of as a 'cache'. Strictly, only the primitive ones need be built manually, and the general case should be to generate semantic descriptions for all the compound units and most of the derived units on-the-fly from these. This is essentially the UCUM approach - only the terminals are stored, and all combinations are generated by rules.

BUT - in semantic/linked-data applications, stable URIs are essential. So dynamically generated representations of units would need a rule for URIs, and thus non-opaque URIs. Which is not the normal OBO approach. Right @cmungall ?

Sounds interesting, but to understand you 100%, what do you mean with 'cache'? And with 'the primitive ones' and 'terminals' you mean the base units such as metre, second, etc.?
Derived units such as joule, newton, etc. can not be generated automatically.
Agreed that stable URIs are essential. For that reason I think we should explicitly define all prefixed units and compound units, since (as far as I know - correct me if I'm unaware) semantic languages do not support the formulation of rules for URI creation.

By 'cache' I mean the units that could be computed, but are actually instantiated as static members of a vocabulary such as the QUDT units catalogue. By referring to them as a 'cache' I'm emphasizing that they should be understood in that way - they are a pre-computed cache of individuals that are known to be used, but which could have been generated on-demand. By having them in a cache it allows some additional annotations and links to be included, and maybe provides some performance optimization.

UCUM is the most mature precedent for the set of 'primitive' ones that must be built manually. The UCUM 'essence' is its list of the things ('terminals') that can't be computed by combination of other terminals using the UCUM production rule. It is composed of

  • 20+4 prefixes
  • 7 base-units (matching SI)
  • 303 other units

This is not complete, but gives a sense of the scale of the required set. (QUDT adds a few more prefixes for the information quantities (Kibi, Mebi, Gibi, etc) )

Clear, thanks.
Still, derived units such as joule, newton, etc. can not be generated automatically. How do you see that?
And because stable URIs are essential, we should explicitly define all prefixed units and compound units, since semantic languages do not support the formulation of rules for URI creation. How do you see that issue?
The amount of primitive prefixes and units in UCUM is indeed roughly that of OM, where OM just like QUDT also includes the IT prefixes.

joule, newton, etc. can not be generated automatically

Indeed. They have to be part of the static set.

On the stable URIs issue - the thing I'm most concerned about is that OBO (which includes PATO/BFO/UO) has a strong preference for opaque URIs. However, because of the combinatorial requirement in units, and the fact that it is an unbounded set but in a predictable way, I suspect that non-opaque URIs are necessary. So I'd like this issue surfaced sooner rather than later.

(BTW - I now see that my count of OM units failed to account for sub-classes. The sizes of the corpuses are very similar, which is reassuring. QUDT has more derived units including some fairly random combos, while OM has a more systematic accounting of the prefixed-units, but a much smaller number of derived units.)

Hey Simon, just checking for my full understanding: an opaque IRI is an IRI with human-understandable terms in it? And a non-opaque IRI is like a code (which human beings can't understand, with no specific meaning)?
If I understand this correctly, I must say I think opaque IRIs are important.. Could there perhaps be a way that concepts in UO will both have a non-opaque and an opaque IRI? (Maybe this is just wild thinking, but I'm very interested in any thoughts about this.)

(BTW: sounds good! :) I think with Kai's algorithms he will create a lot more prefixed units and compound units!)

I think with Kai's algorithms he will create a lot more prefixed units and compound units!

Possibly. But I'm not sure that the algorithm should be run to just create (and cache) all the combos willy-nilly. You don't really want a catalogue of 50,000 units of which only 1,000 ever appear in actual datasets, do you?

Pulling @satra in here since we have poked at the issue of units a number of times when qudt was still in the trough. Creating a catalogue of precomposed units was very much a non-starter for us as well.

I agree. We need coverage of units devices record in, and units humans talk to each other about and share in data. But all other units would only arise as ephemeral intermediate constructs i.e. in intermediate math that doesn't need a name for such constructs.

satra commented

thanks for the ping @tgbugs - if it helps i provide a brief summary of where we went with units.

we went away from the route of enumerating units and unit combos to treating a unit as a special string literal. we decided on using this https://people.csail.mit.edu/jaffer/MIXF/CMIXF-12 and implemented a python parser https://github.com/sensein/cmixf . for us this became more about practicality than ontology as we can write validators and converters around this.

this is not going to satisfy every use-case since it focuses on SI units, but we also needed something that standardizes use in a specific context (the BIDS system). even in our case the units for weight, height, and age remain non-standardized for now. further, age itself has additional considerations of reference value, where the unit is insufficient to disambiguate information.

Personally, I do think we should define "all" possible compound (prefixed) units. If people have to define them themselves, we will encounter similar definitions, that just deviate a bit from each other, especially in their URI (especially if the URIs are opaque, but also with non-opaque URIs formed using rules people will make mistakes).

If all possible combinations are constructed then wouldn't there need to be a function to take the composition rule and the input units and turn it into an identifier to make it possible for people to find the right opaque identifier?

Yes, I think so. That could definitely be automated. Kai could incorporate it into his algorithms, although it anticipates the opaque/non-opaque discussion.

For those opening this old thread, latest work to provide non-opaque, normalized UCUM units, meant to bridge UO, OM, etc. : https://units-of-measurement.org/ and https://github.com/units-of-measurement