pkiraly/metadata-qa-api

Measuring syntantical correctness of values

Opened this issue · 12 comments

Syntactical correctness of values could be measured by matching them to a REGEX that is defined in the schema. Would this be a useful feature for the framework? I can have a try implementing it and issue a PR.

A question related to that: how do other measure deal with missing values? They should not be counted when measuring incorrect values.

Yes, it would be helpful. I already have a feature called problem catalogue (de.gwdg.metadataqa.api.problemcatalog.ProblemCatalog). It is not perfect now, but here is the idea of it:

It can register individual classes each represent a problem. These classes implement an "Observer" interface (de.gwdg.metadataqa.api.interfaces.Observer), which has two methods:

  • void update(PathCache cache, FieldCounter<Double> results). The cache represents the input, the results represent the output object. This method is actually the measurement, so the observers should measure something here.
  • String getHeader() returns a label explaining the measurement which will be used as a column header in the CSV file.

This is an implementation of the Observer pattern (https://en.wikipedia.org/wiki/Observer_pattern), that's why the important method is called "update" and not something else (such as "measure").

So my suggestion for the implementation: there might be a RegexChecker object, implementing this Observer interface. minimaly it should be created with a field and a regex string, e.g.

// it might be Field or JsonPath, I have to investigate it a bit later to decide which one
private Field field;
private Pattern pattern; 
private String header;
public RegexChecker(Field field, String pattern, String header) {
  this.field = field;
  this.pattern = Pattern.compile(pattern);
  this.header = header;
}

public void update(PathCache cache, FieldCounter<Double> results) {
  // run regex and change results accordingly 
}

public String getHeader() {
  return header;
}

What do you think?

For a larger composition: we can think about how to implement the SHACL dictionary. Regex would be one step towards that.

Nice, so this pattern can be used to implement new measures? I will have a try at this.
With respect to SHACL: don't you need RDF data support first?

Not yet. It is just a suggestion for implementation, I wrote this code as a comment, because you said, you would like to work on it. I can implement later.

SHACL: I am talking about the dictionary, the "Core constraints components" part of SHACL introduced (e.g. maxCount, datatype, pattern, see more at https://www.w3.org/TR/shacl/#core-components), which I think is general to any format (a further selection might need to filter out those which are closely bound to RDF). I will tell you more later (now I am in a hurry, but I wanted to make clear this point).

I have implemented this idea, but I had to change the Schema interface, so I started working on 0.7-SNAPSHOT version.

From the user's perspective:

  1. I added a property called rules, which will have several subproperties, right now the only property is pattern. It accepts a regex pattern:
  - name: url
    categories: [MANDATORY]
    extractable: true
    rules:
      pattern: ^https?://.*$
  1. If you want to use it you have to call CalculatorFacade's .enableRuleCatalogMeasurement() method, e.g.
CalculatorFacade facade = new CalculatorFacade()
  .setSchema(schema)
  .setCsvReader(
    new CsvReader()loper'
      .setHeader(((CsvAwareSchema) schema).getHeader()))
  .enableCompletenessMeasurement()
  .enableFieldCardinalityMeasurement()
  .enableRuleCatalogMeasurement();

It will add the header the pattern:<label> part, where the label is the JsonBranch's label property.

To the result of the individual record calculation it will return 1 if all instances of the data element match, 0 otherwise.

From developers' perspective:

Since the Observer interface I talked about earlier and its environment was quite specific to Europeana, I copied the main structure into a new package. Here the central interface is called RuleChecker. It has the same methods (update and getHeader) but it collects Boolean values instead of Double. It's only implementation yet is PatternChecker, but I expect that all relevant SHACL core constraints have their implementation here (and also they will be available as configuration properties). The class which governs the call of individual RuleCheckers is RuleCatalog, which is the implementation of the Calculator interface, so it can interact with the CalculatorFacade if its enableRuleCatalogMeasurement() has been called.

So the workflow:

  1. CalculatorFacade.enableRuleCatalogMeasurement()
  2. for each record CalculatorFacade -> RuleCatalog
  3. RuleCatalog -> list of RuleCheckers

The RuleCatalog reads information from the Schema's new getRuleCheckers() method, which returns a List<RuleChecker>

As it is available in 0.7-SNAPSHOT, the client library should update the reference in pom.xml.

@mielvds I've created a distinct issue for the SHACL vocabulary implementation, see #53.

Very nice!

To the result of the individual record calculation it will return 1 if all instances of the data element match, 0 otherwise.

What do you mean exactly? Shouldn't the outcome be a range between 0 and 1, equivalent to completeness? So if my record has 4 columns and only 2 validate, pattern is 0.5?

Also, see my earlier question about missing values... we should probably normalize these scores somehow? They can only be interpreted correctly and compared if you also know the completeness of the record.

Each individual Rule has an output of 1 (fits) or 0 (doesn't fit).

An example (only pattern is yet implemented, rest are comming):

- name: url
    rules:
      pattern: ^https?://.*$
- name: count
    rules:
      datatype: integer
- name: tags
    rules:
      pattern: ^\W+$
      minCount: 1
      maxCount: 3

these fields define altogether 5 rules, so you will get 5 outputs. We can make it a 6th output which summarizes the values, so if all fits, it is 1, if only one fits, it is 0.2 etc.

These scores will be calculated only if the field is available and not empty.

Here is an extract from the pattern implementation:

for (XmlFieldInstance instance : (List<XmlFieldInstance>) cache.get(field.getJsonPath())) {
  if (instance.hasValue())
    if (!pattern.matcher(instance.getValue()).matches()) {
      allPassed = false;
      break;
    }
}

Thninking about a bit, I think this should return a special value representing NA (no data, not available etc.). So if data is not available it should return NA (unless it is a mandatory field, because in that case it should be 0).

I think either way it could return NA, although I would rather recommend a separate counter for missing values.

The RuleCatalog returns a RuleCheckingOutput value, which is either NA or 0 or 1. I added an CSV and JSON meemoo schema for your repository (https://github.com/viaacode/metadata-quality-assessment/tree/master/src/test/resources/schema) and a test method also (https://github.com/viaacode/metadata-quality-assessment/blob/master/src/test/java/be/meemoo/CalculatorTest.java#L124). Please check them.

Looking good!