INCATools/dead_simple_owl_design_patterns

Extend DOS-DPs to support instance graph templates

Closed this issue · 32 comments

We need to co-ordinate the semantics of LEGO models with each other and with the semantics of the various ontologies used. To support this, curators need access to templates which they can insert into their models and which include variable slots with specified constraints. These templates should be specified in the same documents as design patterns - allowing them to share dictionaries and variables. As for design patterns, that aim here is to maximise ease of reading and editing while maintaining ease of parsing. To this end, the spec uses lookup dicts to allow quoted, human readable names to be used to specify axioms.

The following draft specification extends DOS-DP schema core:

instance_graph:
    nodes: 
         $name : $type
         ...
    edges:
         [ [$subj_node, $rel, $obj_node], ... ]

Where:

  • $name is a readable name assigned a node solely for the purpose of specifying this pattern
  • $rel is a quoted name that is a key in the pattern's relation dictionary
  • $type is either a quoted name that is a key in the pattern's class dictionary OR a var name. Anonymous class expressions are not supported.
  • $subj_node is the quoted name of the subject node
  • $obj_node is the quoted name of the object node.

CC @cmungall @balhoff @thomaspd

Just to clarify, the actual yaml will not have $s. However, there will be %s interpolated in the standard way for dosdps?

So once interpolated we end up with a yaml representation of an abox, with the built in assumption that the keys in the nodes dictionary map to blank nodes

minor comment, we now have 3 semantically more or less equivalent ways to specify an abox fragment:

  • a standard RDF serialization (ttl, json-ld, ...)
  • minervaJSON
  • interpolated dosdp yaml

Just to clarify, the actual yaml will not have $s.

Correct. Just using this as a compact way to specify. (Also working on JSON-LD spec)

However, there will be %s interpolated in the standard way for dosdps?

No need for string interpolation as there are no strings to be interpolated. This is the key bit of spec:

$type is either a quoted name that is a key in the pattern's class dictionary OR a var name

It's up to maintainers of the Design pattern / template to avoid any name key clashes between owl entities and variable names.

So once interpolated we end up with a yaml representation of an abox, with the built in assumption that the keys in the nodes dictionary map to blank nodes.

Don't see any need for blank nodes. nodes in the dict are individuals in the LEGO model.

minor comment, we now have 3 semantically more or less equivalent ways to specify an abox fragment:

  • a standard RDF serialization (ttl, json-ld, ...)
  • minervaJSON
  • interpolated dosdp yaml

I did wonder about re-using an existing standard, but want to keep the easily human-readable pattern. I could make the edges look more like OBO graphs though - having explicit keys subj, pred & obj to represent triples rather than just using tuples. These might be slightly intimidating to non-geeks though.

So once interpolated we end up with a yaml representation of an abox, with the built in assumption that the keys in the nodes dictionary map to blank nodes.
Don't see any need for blank nodes. nodes in the dict are individuals in the LEGO model.

Sorry, I wasn't very clear. I think formalizing in terms of blank nodes/existentials has some advantages. There is no need to invent any new syntactic elements. The entire interpolated pattern is representable in any other RDF syntax. Of course, we choose to replace blank nodes with unique fresh IRIs on the server but this can be separated as an implementation decision.

MinervaJSON also ends up re-inventing blank nodes too. In my mind it would have been cleaner to map directly to RDF, not sure if @balhoff agrees.

Can you give an example - perhaps transducer from MF refactor?

Will post transducer example shortly. Still don't follow re: blank nodes. If everything in LEGO is an individual, we don't need to represent existentials.

Of course, the another alternative is to continue to use Manchester Syntax strings + interpolation. Just seems more efficient to represent simple OPAs using a data structure.

Still don't follow re: blank nodes. If everything in LEGO is an individual, we don't need to represent existentials

You have to name your individuals (assuming you have co-references)

You can't name them using URIs (well you could, but we want the server to mint URIs)

blank nodes provide this ability

Did you also ask about the minerva JSON format? @kltm or @balhoff do you have the link?

You have to name your individuals (assuming you have co-references)

They are named internally for the purpose of specifying their use in the pattern:

$name is a readable name assigned a node solely for the purpose of specifying this pattern

You can't name them using URIs (well you could, but we want the server to mint URIs)
blank nodes provide this ability

Isn't this just an implementation issue? It should be up to the software using a template (e.g. noctua/minerva) to generate URIs for individuals when a template is inserted into a model.

Am I missing something?

kltm commented

@balhoff May have a better memory for what's in there, but I don't believe that the format had a "formal" specification for the graph parts, rather it grew out the boundaries we imposed on the nested response format. Copious tests and examples (at least for the client) are currently held at https://github.com/berkeleybop/bbop-graph-noctua/tree/master/tests. The library is a subclass of the grpha library that we otherwise use.

kltm commented

@cmungall
MinervaJSON also ends up re-inventing blank nodes too. In my mind it would have been cleaner to map directly to RDF, not sure if @balhoff agrees.
What blank nodes are currently in use?
Also, we made the specific design decision to not model it on RDF.

This is what I had in mind:
https://github.com/berkeleybop/bbop-manager-minerva/wiki/MinervaRequestAPI

The request part of this specifies how a client constructs an ABox graph on the server.

kltm commented

Ah! I thought you were thinking of the response, rather than the request--nevermind.

From this:
https://github.com/berkeleybop/bbop-graph-noctua/blob/master/tests/minerva-01.json#L475

instance_graph:
    nodes: 
         $name : $type
         ...
    edges:
         [ [$subj_node, $rel, $obj_node], ... ]

In minerva json (as yaml) would be something like:

    individuals:
          - 
           id: $name
           type: $type
          - 
            ...
    facts: 
         - 
           subject: name1
           property: rel1,
           object: name2
        - 
           ...

Not too bad, but slightly more complicated than my suggestion.

I'll expand more on my concern but maybe lets start with a concrete example.

I'm still not fully grokking:

No need for string interpolation as there are no strings to be interpolated. This is the key bit of spec:
$type is either a quoted name that is a key in the pattern's class dictionary OR a var name

Using a potential pattern for receptor activity as an example

pattern: receptor_activity

relations:
     'has sensor': RO_...
     'has effector': RO_...
     'internally regulates': RO_...

classes:
     'receptor activity': GO_...
     'biochemical activity' : GO_...
     'binding': GO_...

vars:
     ligand_binding : "'binding'"
     effector: "'biochemical activity'"

EquivalentTo:
    text: "'receptor activity' that 'has sensor' some %s and 'has effector' some %s"
    vars:
       - ligand_binding
       - effector

instance_graph:
    nodes:
       receptor1: 'receptor activity'
       effector1: effector
       sensor1: ligand_binding
    edges:
      - ['receptor1', 'has effector', 'effector1']
      - ['receptor1', 'has sensor', 'sensor1']
      - ['sensor1', 'internally regulates', effector1]

In the minerva version:

  • nodes would become individuals with value a list of objects with keys id and type.
  • edges would become facts with value a list of objects with keys subject, property, object

OK, good.

Btw for anyone following, this is the structure:

image

What you have is pleasing in that it's quite yaml introspectable. And it does away with the need for geeky %ss. However, there may be some disadvantages.

so I was thinking something like:

pattern: receptor_activity

relations:
     'has sensor': RO_...
     'has effector': RO_...
     'internally regulates': RO_...

classes:
     'receptor activity': GO_
     'biochemical activity' : GO_....
     'binding': GO_...

vars:
     ligand_binding : "binding"
     effector: "'biochemical activity'"

EquivalentTo:
    text: "'receptor activity' that 'has sensor' some %s and 'has effector' some %s"
    vars:
       - ligand_binding
       - effector

instance_graph:
    text: |
   _:t a 'transducer activity' .
   _:e a %s .
   _:s a %s .
   _t: 'has effector' _:e .
   _t: 'has sensor' _:s .
   _s: 'internally regulates' _:e .
    |
    vars:
       - effector
       - ligand_binding

Note that blank nodes are used. It's up to the generator to decide how to handle these. We'd mint IRIs.

This has the advantage of requiring virtually no extension to the dosdp spec. The only change is that ttl is the format (we could do omn but not really important).

OK the interpolation with %s is a bit geeky but it's the same for the rest of dosdps. E.g if we do something for multiple slots as for #16 it will work equally well here (e.g. protein complexes).

And I think it will deal with edge cases better. What if we want to inject an existing IRI into the pattern (e.g. PMID)? What about evidence? Do you reinvent reification with your list model? What about annotations on individuals? Those with literals vs IRI annotations? Negative property assertions (OK I can't think of a use case for that in a pattern right now, but you never know).

It's tempting to use the more yamly format, but I worry it's overly specific to today's use case.

I feel we went this path with the create subset of minervaJSON, and now it's harder to extend for other things. Now we'll have two mappings to a subset of RDF with their own special features, it's just additional cognitive burden.

(sorry, change transducer in mine to receptor, that was an unintentional change)

And just to expand on the protein complex example. We already have this in noctua:

image

And I love it.. but we can imagine this being easily genericized and driven by a dp (@kltm's 'super grebe'). However, for that we need to solve the cardinality of >1 problem. Shouldn't be hard, we have a ticket #16 for that. If we treat aboxes the same as tboxes, we only have to solve this once..

I liked the original proposal but started feeling more convinced by the generality of @cmungall's turtle. But a consumer would need to write a custom turtle parser for it. Or do some careful string replacement. And is turtle really any different than the lists of triples that @dosumis had? In either case you have to go to the same effort to handle reification.

I guess the rdf:type declarations could be handled as additional triples in the list-of-list form. So I guess if there was an optional interpolation option for the LoL form they'd be interchangeable.

But I'm not really worried about the authoring. It'll be us geeky sorts doing most of it for the core patterns. And most patterns won't have co-references (ie graphs) which means you can nest the ttl to reflect the tree structure.

We will want to make it easier to author later. One thing I always wanted for TG was a "make me one like this" button, i.e. the prototyping approach. In an existing model, click on a node, and use the graph as a model (you can even generate the subsuming class expression provided it's a tree).

Reification - not nice in any format. But if dosdps can support different W3 syntaxes then you can take your pick of tradeoffs.

most patterns won't have co-references (ie graphs)

From last week's discussions, I think co-references may be the rule rather than the exception for compound functions.

The blank node business still feels like an implementation issue to me (the internal identifiers in my proposal work just as well as the turtle _t etc). In your proposal for the individual level, we'd specify chunks of turtle containing many axioms + annotations on them. This is appealing in that it doesn't require much more spec to be designed and written in order to be completely expressive. It's worth noting that this is quite different from the current DOSDP spec, in which each axiom is specified separately and there is an optional field for specifying annotations on an axiom (see below for example). This could potentially be re-used in the instance graph:

instance_graph:
    nodes:
       receptor1: 'receptor activity'
       effector1: effector
       sensor1: ligand_binding
    edges:
      - 
         edge: ['receptor1', 'has effector', 'effector1']
         annotations: 
            - 
              annotationProperty: database_crossreference
              text: "template:fu"  # Example lacks vars
      - 
          ...

This has the advantage of avoiding annoying axiom reification patterns. Add in an extra boolean for negation and I think we have full expressiveness.

Annotation on an axiom - example in dosdp core (bit verbose - need to spec a more compact OBO pragma version).

data_var:  
    ref: xsd_string

annotation_axioms:
    - 
       annotation_property: 'definition'
       text: "Any %s that has a %s as a part"
       vars:
          - fu
          - bar
       annotations: 
          -
             annotation_property: database_crossreference
             text: '%s'
             vars:
                -  ref  # spec needs some more work here.  Better to allow data_vars to take lists.

@cmungall can you post an example of axiom annotation in turtle?

turtle owl-reification is ugly (see for example https://github.com/geneontology/noctua-models/blob/master/models/586fc17a00000961.ttl#L88-L94); the main point is we don't have to reinvent.

OK, you may have convinced me. We should still be sure to work out the edge cases (we are re-inventing rdf syntax...). I think we can probably just piggy back off of JSON-LD conventions. I'm thinking of cases such as where the AnnotationValue is something other than plain literal (rare for us, but you never know). And it has the advantage of being more easily introspectable.

The asymmetry between ABox and TBox seems a little inelegant. I don't want to block us on this but I'm wondering if we were to do the whole thing from scratch we may have gone for a direct YAML representation of class axioms too. The asymmetry may stop us using the same approaches for aboxes and tboxes (e.g. protein complex multi-cardinality slot example).

I like @dosumis's example. I do think we should consider the case you mention where the annotation value is not text. We're already mashing resources into literals in LEGO, e.g. <http://purl.org/dc/elements/1.1/contributor> "http://orcid.org/0000-0003-2689-5511"^^<http://www.w3.org/2001/XMLSchema#string>. This makes it awkward to query people's contributions with SPARQL.

We could replace the text property with value. But we need to allow both literals and resources, so we can't use JSON-LD to say that the value of value is always a resource. So a resource value would always need to be an object like {"@id": "http://orcid.org/0000-0003-2689-5511"}. Kind of verbose.

Alternatively we could keep text and add another property like resource (is there a better term?). People could use one or the other.

draft json schema spec (in YAML)

  instance_graph:
    type: object
    additionalProperties: False
    required: [nodes, edges]
    properties:
      nodes:
         type: object
      edges:
        type: object
        additionalProperties: False
        required: [edge]
        properties:
          edge:
            type: array
            items: { type: string }
          annotations: 
            { $ref: '#/definitions/printf_annotation' }
          not: 
            type: boolean

$ref refers back to core field type def.

The one thing missing in this proposal is a way to specify types as anonymous classes. This is (I think) beyond the scope of LEGO, but might be useful expressiveness elsewhere. (I don't have a use for these in VFB yet, but we type using anonymous classes very extensively). Supporting this would require the nodes field to revert to Manchester syntax sprintf.

I do think we should consider the case you mention where the annotation value is not text.

Here's the current spec for the annotation field. It assumes printf + var sub. When used in regular annotations (e.g. for a label or a def) the assumption is that if an OWL entity is specified (by a var) then the readable identifier will be used (label in our case) will be used in the sub. This field can also take strings specified in data_vars.

  printf_annotation:
    type: object
    additionalProperties: False
    required: [annotationProperty, text, vars]
    properties:
      annotationProperty:
        description: > 
         A string corresponding to the rdfs:label
         of an owl annotation property. If the annotation property has no label,
         the shortForm ID should be used. The annotation property must be listed
         in the annotation property dictionary.'
        type: string
      annotations:
        items: {$ref: '#/definitions/printf_annotation'}
        type: array
      text:
        description: A print format string.
        type: string
      vars:
        description: >
         An ordered list of variables for substitution into the accompanying
         print format string. Each entry must correspond to the name of a variable
         specified in either the 'vars' field or the data_var field of the pattern.
         Where an OWL entity is specified, the label for the OWL entity should be
         used in the substitution.
        items: {type: string}
        type: array

We'd need a new type of annotation field in order to support the value of an annotation being an OWL entity (e.g. as in subset declarations). Presumably this would be passed as a URI string?

  value_annotation:
    type: object
    additionalProperties: False
    required: [annotationProperty, value]
    properties:
        annotation_property:
           type: string
        value:
           type: string  # a string in JSON but taking  var specifying a URI...
        annotations:
           type: array
           items: { oneOf:  [{ $ref: '#/definitions/printf_annotation' }, 
                                      $ref: '#/definitions/value_annotation' }]

The annotations field on printf_annotation should be updated to take both types of annotation too.

I've implemented the following solution in spec/DOSDP_schema_full.yaml. (I'm planning to split this file and use JSON schema imports later. Need to switch to JSON for that to work.)

Instance graph:

  instance_graph:
    type: object
    additionalProperties: False
    required: [nodes, edges]
    properties:
      nodes:
        description: > 
                       Key = name of individual within this pattern doc
                       Value = Type of individual specified using either 
                       the quoted name of a class in the class dictionary of this pattern
                       or a var name.  This field does not support typing via 
                       anonymous class expressions
        type: object
      edges:
        type: object
        additionalProperties: False
        required: [edge]
        properties:
          edge:
            description: >
                          A triple specified as an ordered array with 3 elements
                          [subject, rel, object]
                            * rel must be the quoted name of a relation from the relations
                              (object property) dictionary.
                            * subject and object must be the name of an individual
                              specified in the nodes field.
            type: array
            items: { type: string }
            minItems: 3
            maxItems: 3
          annotations:
            type: array 
            items: { $ref: '#/definitions/annotation' }
          not:
            description: "Optional field for negated OPAs"
            type: boolean

This uses a generic solution for annotating axioms:

  printf_annotation:
    type: object
    additionalProperties: False
    required: [annotationProperty, text, vars]
    properties:
      annotationProperty:
        description: > 
         A string corresponding to the rdfs:label
         of an owl annotation property. If the annotation property has no label,
         the shortForm ID should be used. The annotation property must be listed
         in the annotation property dictionary.'
        type: string
      annotations:
        items: { $ref: "#/definitions/annotation" } 
        type: array
      text:
        description: A print format string.
        type: string
      vars:
        description: >
         An ordered list of variables for substitution into the accompanying
         print format string. Each entry must correspond to the name of a variable
         specified in either the 'vars' field or the data_var field of the pattern.
         Where an OWL entity is specified, the label for the OWL entity should be
         used in the substitution.
        items: {type: string}
        type: array
        
  list_annotation:
    type: object
    additionalProperties: False
    required: [annotationProperty, value]
    properties:
      annotationProperty:
        description: > 
         A string corresponding to the rdfs:label
         of an owl annotation property. If the annotation property has no label,
         the shortForm ID should be used. The annotation property must be listed
         in the annotation property dictionary.'
        type: string
      value:
        description: >
         A single list variable (list_var or data_list_var).  Each item in this list 
         should be used to generate a separate annotation axiom.
        type: string
  annotation:
   oneOf: 
     - { $ref: "#/definitions/printf_annotation" }
     - { $ref: "#/definitions/list_annotation" }

var specification is getting a bit complicated:

  vars:
    type: object
    description: >
     A dictionary of variables ranging over OWL classes.  
     Key = variable name, value = variable range as manchester syntax string.
     
  list_vars:
    type: object
    description: >
     A dictionary of variables referring to lists of owl classes.
     Key = variable name, value = variable range of items in list specified as a valid OWL
     data-type.
  
  data_vars:
    type: object
    description: >
     A dictionary of variables ranging over OWL data-types.
     Key = variable name, value = variable range specified as a valid OWL
     data-type.
     
  data_list_vars:
    description: >
        A dictionary of variables referring to lists of some specified OWL data-types.
        Key = variable name, value = variable range of all items in list, 
        specified as a valid OWL data-type.

This could potentially be simplified to just vars and list with the specification of range for each variable working to distinguish types, but I think this is probably too much of a burden on development. I've also designed some OBO convenience fields for axiom annotation, but these are not (so far) permitted in the instance graph.

@dosumis shouldn't edges: be of type array? I think you need to define an Edge object type to go in the array.

@dosumis shouldn't edges: be of type array? I think you need to define an Edge object type to go in the array.

Ooops. You're right.

I would like to give @DoctorBud something to work with for implementing a generic annoton pattern

pattern: basic_annoton

relations:
     enabled by: RO:0002333
     occurs in: BFO:0000066
     part of: BFO:0000050

classes:
     gene product or complex: TODO
     molecular function : GO:0003674
     biological process: GO:0008150
     cellular component: GO:0005575

vars:
     gene product: "'gene product'"
     molecular function : "'molecular function'"
     biological process: "'biological process'"
     cellular component: "'cellular component'"

instance_graph:
    nodes:
       gp: gene product
       mf: molecular function
       bp: biological process
       cc: cellular component
    edges:
      - edge: [mf, 'enabled by', gp]
        annotations: ?
      - edge: [mf, 'occurs in', cc]
        annotations: ?
      - edge: [mf, 'part of', bp]
        annotations: ?

This is for: geneontology/noctua#461

I wonder if it's necessary to specify the full evidence model every time. Can we just have a generic placeholder for 'insert evidence here'.