Implement VALVE basics from example Google Sheet
Closed this issue · 10 comments
I've created a Google Sheet with examples and tests for what I want VALVE to be:
https://docs.google.com/spreadsheets/d/1DDKk8vf6IWyOt-4iVUOMif-0yIFk7NwzN1UvaWb1RLk/edit?usp=sharing
- configuration tables
- datatype: The rows specify a tree of datatypes, so far just different strings; the
- field: Each row specifies a table, column, and the datatype that applies to the values in that column
- rule: Each row specifies a "When" condition and a "Then" condition; if "When" is true then "Then" must be true
- data tables
- prefix: Just a prefix and base, for CURIEs
- external: A tree of ontology terms
- exposure: An example based on the IEDB immune exposure model
- testing tables
- problems: TODO a standard problems table
You can ignore any other sheets. These sheets should be saved in something like tests/example/*.tsv
.
The core of the system is a tree of basic datatypes, defined by regular expressions, and then a few functions for combining these: in
, from
, under
, tree
, lookup
, CURIE
.
The tool should read the three configuration tables in order (datatype, field, rule), checking them as it goes, and printing a standard problem table. If there are problems after checking these three, I think the tool should just quit.
Then the tool should read the three data tables in order (prefix, external, exposure). For each table, for each row:
- field: use the table and column name to find the matching field (and therefore datatype)
- if there's no matching field, keep going
- if the datatype for this field is basic (i.e. in the datatype table), then start from the root of the datatype tree and check each datatype from general to specific
- if a check fails, generate a row for a standard problem table; I'm not sure whether we should continue checking the datatypes or move on to the next cell
- if the datatype is complex then things get fancier...
- most of the functions generate sets that we can save and reuse for future checks:
in
,from
,under
,tree
,lookup
- some are string matching rules:
CURIE
- the datatype could also include boolean operators:
not
,or
,and
- most of the functions generate sets that we can save and reuse for future checks:
- rule: use the table and column name to find any matching rules; for each rule
- if the When condition applies, check the Then condition (which will be another table.column)
- if the Then condition fails, generate a standard problems table row
- if the When condition applies, check the Then condition (which will be another table.column)
The tool should run through all the tables like this, maybe generating a long list of standard problem table rows.
I want the data tables to include invalid data. The 'problem' table should eventually be what we want the output to be, as an integration test.
I'd recommend using a proper parser for handling the fancy conditions, maybe lark.
Here's a first pass at a parser using lark:
from lark import Lark
parser = Lark('''
start: expression
expression: negation | disjunction | type
negation: "not" expression
disjunction: type "or" type
type: function | datatype
function: function_name "(" arguments ")"
function_name: WORD
arguments: argument ("," argument)*
argument: field | label
field: label "." label
datatype: label
label: WORD | ESCAPED_STRING
%import common.WORD
%import common.ESCAPED_STRING
%ignore " " // Disregard spaces in text
''')
examples = [
'line',
'"trimmed line"',
'blank or label',
'not blank',
'in(foo, "bar")',
'under("Table 1"."Column 2", "bar")',
]
for example in examples:
print(example, parser.parse(example))
Should under
have three args?
under(<parent-column>, <value-column>, <top-level-value>)
Because we want to check for the descendants of the top-level
value, correct? But if we only provide the parent column, how will we know what all those values are?
e.g.,
under(external.Parent, external.Label, "occurrence of disease")
Tells us that external.Parent
contains the parent of external.Label
and we put all external.Label
values in the list of descendants. This would return (based on VALVE examples):
- occurrence of disease
- occurrence of infectious disease
If we don't include the label column, how will we know to allow "occurrence of infectious disease"?
I'm hoping to keep under()
to two arguments: under(some-tree, some-term)
. The some-tree
should point to a tree(some-label)
which should point to the label column.
I'm sorry if I'm missing something here, but if we're only passing some-label
to tree (e.g, tree(external.Label)
), how does it know what other column (e.g, parent column) to use to build the tree?
This is how I'm thinking about it. Looking at the field
table:
- the row for
external.Label
has typelabel
, so each cell inexternal.Label
should conform to thelabel
datatype - the row for
external.Parent
has typetree(external.Label)
, so we know:- the
external.Parent
column has to have the same type asexternal.Label
, which islabel
, so each cell inexternal.Parent
should conform tolabel
(actually we should probably allow blank parents when building trees) - for each row in
external
we'll use the pairs ofexternal.Label
andexternal.Parent
to form a tree, which we'll associate with theexternal.Parent
field
- the
- the row for
exposure."Exposure Process Reported"
has typeunder(external.Parent, "exposure process")
, so we know:- the
exposure."Exposure Process Reported"
column has the same datatype asexternal.Parent
, which islabel
- each cell in
exposure."Exposure Process Reported"
should be either equal to "exposure process" or a descendant of "exposure process" in the tree defined forexternal.Parent
- the
Does that mean in order to use under(external.Parent, ...)
, you must have already defined tree(external.Label)
as the field type for external.Parent
?
Yes, that's what I'm thinking.
Thanks! This helps a lot.
Closed by #2