ontodev/valve

WIP: General improvements

Opened this issue · 1 comments

We have a good first implementation in ontodev/valve.py. Now that I've had a chance to use it, I'm considering some revisions and clarifications. As always, I want the user to be able to form a simple mental model of how the VALVE works, making it easy to learn and use, and avoid edge-cases and surprises.

  • add regular expression matches to the grammar /foo/ and interpret this as a match function
  • add regular expression substitutions to the grammar s/foo/bar/
  • ideally enforce that these are implemented as PCREs
  • generalize the datatype table to be reusable conditions in a hierarchy -- maybe rename to "condition" table
  • generalize datatypes from a tree to a DAG by allowing multiple parents
  • maybe enforce that datatype names are single words
  • maybe rework split(pattern, count, expression, ...) as concat(slot, slot, slot), e.g. concat(cell.label, " & ", gates)
  • drop CURIE and replace with more general concat(prefix.prefix, ":", local_name)
  • a tree with a split is not a tree, it's a directed acyclic graph -- I'd like to distinguish tree from dag (or maybe hierarchy)
  • I'm worried that the current grammar has a lot of ambiguity: double quoted strings vs double quoted datatypes or column names or table names -- maybe this doesn't matter

A condition defines a list of checks. Each check defines a predicate (function) that takes a string and returns a boolean, as well as a bunch of information about the check: name, parents, level, message, etc. For each cell, we go through the list of checks in order, and ensure that the cell satisfies the predicate.

A predicate can also be thought of as a set of strings for which the predicate is true. A set of strings can be defined extensionally or intensionally. For an extensionally defined set we have a list of all the strings, so we just look up the string in the set -- this is how in and under work. For an intensionally defined set we have a rule for determining if the string is in the set -- this is how regex matches and list work. Even distinct can be thought of as: this cell is not in the set of other cells in this column.

tree and lookup are a bit different. lookup takes a pair of strings to a boolean. tree does validate a cell but also defines a structure that under can use.

add regular expression matches to the grammar /foo/ and interpret this as a match function

How is this different from defining a datatype with a regex pattern?

add regular expression substitutions to the grammar s/foo/bar/

ontodev/valve.py#28

generalize the datatype table to be reusable conditions in a hierarchy -- maybe rename to "condition" table

I think it would make more sense to rename field to condition because that's where you're defining the actual conditions for the data. I actually like the datatype table name.

generalize datatypes from a tree to a DAG by allowing multiple parents

If we enforce one word (below), we can allow comma-separated lists of parents

maybe enforce that datatype names are single words

I think this would be great!

maybe rework split(pattern, count, expression, ...) as concat(slot, slot, slot), e.g. concat(cell.label, " & ", gates)
drop CURIE and replace with more general concat(prefix.prefix, ":", local_name)

I can see this - and I think it would be good to limit the number of functions more. I think in split, the count is a bit redundant anyway.

A slight change I want to propose: you listed a table-column pair for the prefix example, but I think the values should be expressions (datatypes & functions) or strings. So instead of concat(prefix.prefix, ...) it would be concat(in(prefix.prefix), ...)

a tree with a split is not a tree, it's a directed acyclic graph -- I'd like to distinguish tree from dag (or maybe hierarchy)

Why don't we just call it hierarchy? I don't think we need two separate tree and dag functions - that might be confusing?

I'm worried that the current grammar has a lot of ambiguity: double quoted strings vs double quoted datatypes or column names or table names -- maybe this doesn't matter

I agree, but I haven't run into a problem with this yet... I guess if you name a datatype "a.b" it could be interpreted as a table-column pair, but if we restrict the datatypes to single words with no special characters (except maybe dash/underscore), that wouldn't be a problem. Also, if we do that, then we do not ever need to surround datatypes with double quotes. And we can specify that string values should always be quoted, even if they're one word.