Scope limitation: increase schematron efficiency, reduce SVRL noise, integrate better, enhance modeling of phase to cross-cut by region and role

Question

Scope limitation: increase schematron efficiency, reduce SVRL noise, integrate better, enhance modeling of phase to cross-cut by region and role

Opened this issue 2 years ago · 0 comments

Problem

Using Schematron to make limited assertions on large documents can involve unnecessary traversal of the document.
E.g. to assert that the top-level element is the correct kind may involve iterating over the whole document.

The current workaround in this case tends to be to use termination, which is a bit brutish.

Furthermore, SVRL reports can contain excessive and otiose reports that can swamp the user, and be regarded as noise and a source of inefficiency.

Furthermore, the only cross-cutting mechanism is the phase, which cross-cuts by pattern. There is no way to cross-cut by document region or by assertion role (e.g. severity.)

Outcome Scenarios

We have a document with a zillion elements. We have a large and complex Schematron schema that takes a lot of processing and we have efficiency, latency, congestion and timeout out constraints. We really want to just check the metadata elements but we cannot use the phase mechanism for it because of bureaucratic reasons. The ideal solution would be to only validate the metadata and not traverse the rest of the document looking for contexts.

We have the same scenario. But we only are interested in testing assertion about severe errors in the metadata. The ideal solution would let us bypass testing any assertions if there was no @role="ERROR" on the assertion.

We have the same scenario. But we want to reduce the noise in the SVRL to humans. So we want to prioritize testing so that we reject documents with serious errors first, and only test other assertions after testing those, and if there are no severe errors. The ideal solution would allow some kind of re-arrangement of validation into two passes: one which tests the assertions with @role="ERROR" and another which tests other assertions.

Proposed Solution

Schema-level scoping

Add an attribute to sch:schema/@scope which limits the nodes which the Schematron schema needs to look at in the main document apart from the document node.

It intended as is a practical parameter, constraining area of interest in a document, not a modeling feature. In other words, even if some node gets excluded from validation, there is no implication that the schema rule do not also apply to that node: merely that we are not interested in looking or knowing at the moment.
It is like a parameter, in that it can be overridden on invocation, if the implementation supports it.

The value is made with the following syntax:
priority ( "from" | "to" | "only" ) ws+ role-clause location
where ws is whitespace and location is an absolute XPath pattern in the QLB

For example:

<sch:schema ... scope="to /*/*/*" ..>

will limit validation to only the document node and the first three levels of nodes. E.g. /law/part/section/clause will not be validated.

"from " uses an initial absolute XPath, which is where validation starts from.
-- E.g. from /book/appendix means do not validate all /book nor /book/node()[not(self::appendix)] nor /book/node()[not(self::appendix)]//node()
"to " uses an initial absolute XPath, which is used to select the nodes which will be validated
-- e.g. to /book/appendix means means do validate all / and /book and /book/node()[not(self::appendix)] and /book/node()[not(self::appendix)]//node()
"only " supplies an XPath, and only elements that match those are validated.
-- e.g. only /criminal/metadata/* meaning validate the document root and the children of metadata.
--`e.g., only //html:* means only elements in the HTML namespace.
In all cases, the document node is validated.
The default is "from /" meaning all nodes in the document including the document node.
The scope does not apply to sch:patterns[@document]. No provision is made for dynamic override of them.

Prioritize

The priority is a hint to perform the in-scope validations before other validations, not instead of. It uses the optional keyword "prioritize". So priorize from is a hint to validate the nodes at and under some path first, prioritize to is to validate the nodes from the top until the path first (e.g. more like a top-down breadth-first traversal), and prioritize only means to validate (the document node and) the specified nodes of that XPath first, then the others.

Where and Until

The role-clause is ("until" | "where") ws+ {"role | "flag"} = {" ws* (role ws+ )* "}") and is a hinting mechanism that ties into the @role and @flag attributes. (Below, when @role is used, @flag is implied as well.)

"where" is a hint to only test assertions where there is an in-scope @role`` attribute with a matching token. "In-scope" means ``@role on the sch:assert, linked-to sch:diagnostics, linked-to sch:property, parent sch:rule, parent sch:pattern (and should be current sch:phase and sch:active too)

For example

<sch:schema ... scope="from when role={fatal error}  /regulations/regulation[@jurisdiction='AU']" ...>

means at this time we are only interested severe errors in Australian regulations. So we only look at and under Australian regulations, and, as a hint, the implementation needn't

"Until" says that once an assertion has failed (or report succeeds) which has in in-scope @role attribute (hat has a token that matches a token on the list, then the implementation can opt-out of processing more. This needs to be implementation-dependent, to not create a burden for implementers. But the actiion could be that if an assertion with the role fails,

we don't test any more assertions on that node,
or we don't traverse to its children,
or we don't test any more of that pattern,
or, we just terminate.

For example

<sch:schema ... scope="from until role={fatal error}  /regulations/regulation[@jurisdiction='AU']" ...>

says to validate the document node plus Australia regulations (the element and its desendents) but provides the hint that as soon as

So the point of "when" and "using" is to reduce noise in the SVRL, and do so in a way that enhances the @role markup.
kind of

Phase-level scoping

Also, add the same attribute to sch:phase and sch:active. It limits the scope of the patterns in the phase, in addition to any scope specified on the schema. If the pattern activated uses @document, this scoping applies to that pattern. It allows phases to "cross-cut" based on region and role.

Pattern level scoping

Also, add the same attribute to sch;pattern. It limits the scope of contexts in the pattern, in addition to any sch:schema/@scope or sch:pattern/@scope or sch:active/@scope. If sch:pattern/@document is specified, it limits the scope of the pattern in that document.

Alternatives Considered

I developed experimental parsers (using PEG and REx) for parsing XPaths, to allow an implementation to determine what kind of nodes it needed to look at, and potentially to know whether there were other limits that could be known from static analysis. It is possible, but a lot of code.

A schema implementation could certainly provide this as a paramter when running a schema validation.

Other Benefits

Furthermore, the feature of allowing it to be stated in the sch:schema element makes things more explicit and easier to implement. Furthermore, it would be a useful general features for users to be able to select the scope of elements.

Furthermore, it could provide a way to enhance phases.

Schematron engine that provide this would be better targetted for integration into IDEs: the IDE could limit interactive validation to the current node by passing the relevant "only" Xpath, for example.

Implementation Considerations

The "from " case is trivially implemented. E.g. in the skeleton code, for each mode, it would involve first validating the document node only, then finding al the nodes that match the scope then priming the validation with those.

The "to " case can be implemented, e.g. in the skeleton implementation, by creating a variable with all the nodes that that match the scope XPath, then for each mode, first validating the document node only, the validate the nodes that match the scope.

We are not particularly concerned about efficiency in cases like only //html:: because the aim is to provide efficiency in cases where we don't want to have to process the entire document because of something where there is an obvious fast way to get the to information needed.

The when role = { x y z } can be ignored if the implementer desires. It is a hint. It could be faked up by post-processing the generated SVRL to remove all failed-asserts etc whose in-scope roles do not match: this would reduce noise if not efficiency.

The until role = { x y z } can be ignored if the implementer desires. It is a hint. It could be faked up by post-processing the SVRL to remove all following-elements after the first successful-report or failed assert with a token in the its @role that matches. The best would be to terminate gracefully without testing further or looking at more contexts.

An implementation may decide how much of scope to support overriding by. For example, the implementor may decide that it is easy to support that the command line/invocation parameter only supports "only" with no role testing, and only supports that if the sch:schema/@scope is default or uses "only" (i.e. it is just a matter of swapping the XPath string, not generating different code.) Or the implementation may decide that it only supports certain overrides of sch:schema/@scope as a compile-time option not a run-time option.

In other words, an implementation

must support parse all sch:*/@scope and implement from, to and only, as language features
is free to support the role hint a much as is convenient and useful
is free to implement overriding sch:schema/@scope at runtie or compile-time as much as is convenient
Is free to implement prioritize or not

prioritize would be handled by two passes. This is not inefficient, as the desired outcome is get show-stopper assertions tested ahead of other assertions. There could be some extra inefficiency if pattern and rule variables need to be re-calculated in both passes. (However, this can coded around.)

Note that if an sch:rule has an role attribute that does no match, or it contains no asserts or reports (or their diagostics or properties) that match, it does not mean that the role context is not applicable: the pattern does not change depending on the scope attributes, all that happens is that some nodes will not be tested to see if they match any context in a pattern, and some assertions may not be tested. This is a matter of what is interesting to the invoker, not a matter of modeling. The scope is not a way to switch on or off rules.

There is an exception to this: consecutive rules at the end of a pattern that have an @role that does not have a matching token, or which have no assertions with matching roles, have no effect, therefore can be switched off. E.g.

<sch:schema ... scope="from when role={PET}">
...
<sch:pattern>
   <sch:rule context="dog" role="PET"  id="r1">
   ...
   </sch:rule>
   <sch:rule context="lion" role="WILD" id="r2">
       <sch:assert test="@exit='go'">Linus and his friends must go</sch:assert>
    </sch:rule>
    <sch:rule context="*" id="r3">
        <sch:report test="true()">Unknown animals are not regarded as wild or pets</sch:report<
    </sch:rule>
</sch:pattern>

In this case, rule r3 can be switched off as it has no children with role of PET. And the next previous one, now the last, r2 also can be switched off as its role is no PET and it has no reachable children with role of PET.