ontodev/valve.py

Implement VALVE basics from example Google Sheet

Closed this issue · 10 comments

I've created a Google Sheet with examples and tests for what I want VALVE to be:

https://docs.google.com/spreadsheets/d/1DDKk8vf6IWyOt-4iVUOMif-0yIFk7NwzN1UvaWb1RLk/edit?usp=sharing

  • configuration tables
    • datatype: The rows specify a tree of datatypes, so far just different strings; the
    • field: Each row specifies a table, column, and the datatype that applies to the values in that column
    • rule: Each row specifies a "When" condition and a "Then" condition; if "When" is true then "Then" must be true
  • data tables
    • prefix: Just a prefix and base, for CURIEs
    • external: A tree of ontology terms
    • exposure: An example based on the IEDB immune exposure model
  • testing tables
    • problems: TODO a standard problems table

You can ignore any other sheets. These sheets should be saved in something like tests/example/*.tsv.

The core of the system is a tree of basic datatypes, defined by regular expressions, and then a few functions for combining these: in, from, under, tree, lookup, CURIE.

The tool should read the three configuration tables in order (datatype, field, rule), checking them as it goes, and printing a standard problem table. If there are problems after checking these three, I think the tool should just quit.

Then the tool should read the three data tables in order (prefix, external, exposure). For each table, for each row:

  • field: use the table and column name to find the matching field (and therefore datatype)
    • if there's no matching field, keep going
    • if the datatype for this field is basic (i.e. in the datatype table), then start from the root of the datatype tree and check each datatype from general to specific
      • if a check fails, generate a row for a standard problem table; I'm not sure whether we should continue checking the datatypes or move on to the next cell
    • if the datatype is complex then things get fancier...
      • most of the functions generate sets that we can save and reuse for future checks: in, from, under, tree, lookup
      • some are string matching rules: CURIE
      • the datatype could also include boolean operators: not, or, and
  • rule: use the table and column name to find any matching rules; for each rule
    • if the When condition applies, check the Then condition (which will be another table.column)
      • if the Then condition fails, generate a standard problems table row

The tool should run through all the tables like this, maybe generating a long list of standard problem table rows.

I want the data tables to include invalid data. The 'problem' table should eventually be what we want the output to be, as an integration test.

I'd recommend using a proper parser for handling the fancy conditions, maybe lark.

Here's a first pass at a parser using lark:

from lark import Lark

parser = Lark('''
    start: expression
    expression: negation | disjunction | type
    negation: "not" expression
    disjunction: type "or" type
    type: function | datatype
    function: function_name "(" arguments ")"
    function_name: WORD
    arguments: argument ("," argument)*
    argument: field | label
    field: label "." label
    datatype: label
    label: WORD | ESCAPED_STRING

    %import common.WORD
    %import common.ESCAPED_STRING
    %ignore " "           // Disregard spaces in text
''')

examples = [
        'line',
        '"trimmed line"',
        'blank or label',
        'not blank',
        'in(foo, "bar")',
        'under("Table 1"."Column 2", "bar")',
]

for example in examples:
    print(example, parser.parse(example))

Should under have three args?

under(<parent-column>, <value-column>, <top-level-value>)

Because we want to check for the descendants of the top-level value, correct? But if we only provide the parent column, how will we know what all those values are?

e.g.,

under(external.Parent, external.Label, "occurrence of disease")

Tells us that external.Parent contains the parent of external.Label and we put all external.Label values in the list of descendants. This would return (based on VALVE examples):

  • occurrence of disease
  • occurrence of infectious disease

If we don't include the label column, how will we know to allow "occurrence of infectious disease"?

I'm hoping to keep under() to two arguments: under(some-tree, some-term). The some-tree should point to a tree(some-label) which should point to the label column.

I'm sorry if I'm missing something here, but if we're only passing some-label to tree (e.g, tree(external.Label)), how does it know what other column (e.g, parent column) to use to build the tree?

This is how I'm thinking about it. Looking at the field table:

  1. the row for external.Label has type label, so each cell in external.Label should conform to the label datatype
  2. the row for external.Parent has type tree(external.Label), so we know:
    • the external.Parent column has to have the same type as external.Label, which is label, so each cell in external.Parent should conform to label (actually we should probably allow blank parents when building trees)
    • for each row in external we'll use the pairs of external.Label and external.Parent to form a tree, which we'll associate with the external.Parent field
  3. the row for exposure."Exposure Process Reported" has type under(external.Parent, "exposure process"), so we know:
    • the exposure."Exposure Process Reported" column has the same datatype as external.Parent, which is label
    • each cell in exposure."Exposure Process Reported" should be either equal to "exposure process" or a descendant of "exposure process" in the tree defined for external.Parent

Does that mean in order to use under(external.Parent, ...), you must have already defined tree(external.Label) as the field type for external.Parent?

Yes, that's what I'm thinking.

Thanks! This helps a lot.

Closed by #2