WIP: General improvements
Opened this issue · 1 comments
We have a good first implementation in ontodev/valve.py. Now that I've had a chance to use it, I'm considering some revisions and clarifications. As always, I want the user to be able to form a simple mental model of how the VALVE works, making it easy to learn and use, and avoid edge-cases and surprises.
- add regular expression matches to the grammar
/foo/
and interpret this as amatch
function - add regular expression substitutions to the grammar
s/foo/bar/
- ideally enforce that these are implemented as PCREs
- generalize the datatype table to be reusable conditions in a hierarchy -- maybe rename to "condition" table
- generalize datatypes from a tree to a DAG by allowing multiple parents
- maybe enforce that datatype names are single words
- maybe rework
split(pattern, count, expression, ...)
asconcat(slot, slot, slot)
, e.g.concat(cell.label, " & ", gates)
- drop
CURIE
and replace with more generalconcat(prefix.prefix, ":", local_name)
- a tree with a split is not a tree, it's a directed acyclic graph -- I'd like to distinguish
tree
fromdag
(or maybehierarchy
) - I'm worried that the current grammar has a lot of ambiguity: double quoted strings vs double quoted datatypes or column names or table names -- maybe this doesn't matter
A condition defines a list of checks. Each check defines a predicate (function) that takes a string and returns a boolean, as well as a bunch of information about the check: name, parents, level, message, etc. For each cell, we go through the list of checks in order, and ensure that the cell satisfies the predicate.
A predicate can also be thought of as a set of strings for which the predicate is true. A set of strings can be defined extensionally or intensionally. For an extensionally defined set we have a list of all the strings, so we just look up the string in the set -- this is how in
and under
work. For an intensionally defined set we have a rule for determining if the string is in the set -- this is how regex matches and list
work. Even distinct
can be thought of as: this cell is not in the set of other cells in this column.
tree
and lookup
are a bit different. lookup
takes a pair of strings to a boolean. tree
does validate a cell but also defines a structure that under
can use.
add regular expression matches to the grammar /foo/ and interpret this as a match function
How is this different from defining a datatype with a regex pattern?
add regular expression substitutions to the grammar s/foo/bar/
generalize the datatype table to be reusable conditions in a hierarchy -- maybe rename to "condition" table
I think it would make more sense to rename field
to condition
because that's where you're defining the actual conditions for the data. I actually like the datatype
table name.
generalize datatypes from a tree to a DAG by allowing multiple parents
If we enforce one word (below), we can allow comma-separated lists of parents
maybe enforce that datatype names are single words
I think this would be great!
maybe rework split(pattern, count, expression, ...) as concat(slot, slot, slot), e.g. concat(cell.label, " & ", gates)
drop CURIE and replace with more general concat(prefix.prefix, ":", local_name)
I can see this - and I think it would be good to limit the number of functions more. I think in split, the count
is a bit redundant anyway.
A slight change I want to propose: you listed a table-column pair for the prefix example, but I think the values should be expressions (datatypes & functions) or strings. So instead of concat(prefix.prefix, ...)
it would be concat(in(prefix.prefix), ...)
a tree with a split is not a tree, it's a directed acyclic graph -- I'd like to distinguish tree from dag (or maybe hierarchy)
Why don't we just call it hierarchy
? I don't think we need two separate tree
and dag
functions - that might be confusing?
I'm worried that the current grammar has a lot of ambiguity: double quoted strings vs double quoted datatypes or column names or table names -- maybe this doesn't matter
I agree, but I haven't run into a problem with this yet... I guess if you name a datatype "a.b" it could be interpreted as a table-column pair, but if we restrict the datatypes to single words with no special characters (except maybe dash/underscore), that wouldn't be a problem. Also, if we do that, then we do not ever need to surround datatypes with double quotes. And we can specify that string values should always be quoted, even if they're one word.