jpmml/jpmml-evaluator

JPMML is enforcing the definition of target fields while the spec says it is optional

karllessard opened this issue · 6 comments

In JPMML, some logic enforces that target mining fields should have been defined in the data dictionary, like at this location.

Though, when reading the JPMML latest spec (4.4), it indicates that the definition of target fields is optional:

The definition of target fields in the MiningSchema is not required and , in most cases, it does not have an impact on the scoring results.

That ends up having JPMML failing to load models that are, according to the spec, valid and therefore defining target fields become mandatory for evaluation in Java.

Is this the desired behaviour? I would expect the JPMML implementation to follow more closely the spec as it is "de facto the reference implementation of the PMML specification", as is saying its README file

In PMML there are two kinds of target fields:

  1. Public/Externally oriented. Think "the final aggregate prediction" made by a random forest (RF) model.
  2. Private/Internally oriented. Think "member predictions" made by member decision tree (DT) models of a random forest model.

A target field name can be omitted, if there's logically only one target field in scope. For example, while evaluating the member DT of RF, then each DT forms a separate scope, and since DT models are typically single-target models, then there is no need to invent a unique target field name for each DT.

In the top-level RF model scope there's also only one target field in scope, so technically it would be possible to avoid giving it a name. But stylistically, it should be named.

However, when there are multiple target fields in scope, then they must be named, in order to make target field references non-ambiguous - this is probably what the PMML spec intends to say.

.. like at this location.

This piece of code deals with the top-level model.

In my opinion, it's a good style to give every top-level model field a name. Different PMML engines may treat "anoymous" fields differently, which would be confusing. For example, in JPMML-Evaluator result maps they are mapped to the null map key, but some other PMML engine might use a different convention (eg. "_", "target", "_y" or whatever).

That ends up having JPMML failing to load models that are, according to the spec, valid

Any examples?

I would expect the JPMML implementation to follow more closely the spec as it is "de facto the reference implementation of the PMML specification",

The quality of the PMML spec has been going down with each new version.

The JPMML-Evaluator is fairly liberal in what it accepts PMML markup-wise, but it hates ambiguity. In the current case, it probably asks you to be more specific about what you're trying to accomplish.

I'm also of the opinion that it is good practice to define the target fields at the top-level, as it provides more details on the output of a prediction such as the type of the returned value.

But right now, I have a bunch of models that have been written a while ago by some team who did not follow this idea, and we need to update each of them to define the field in the data dictionary to load these models with more recent releases of JPMML.

For example:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" version="4.4">
    <Header copyright="Me" description="Some model"/>
    <DataDictionary numberOfFields="1">
        <DataField name="input" optype="continuous" dataType="double"/>
        <!-- DataField name="output" optype="continuous" dataType="double" /-->
    </DataDictionary>
    <TransformationDictionary/>
    <RegressionModel modelName="test" functionName="regression" modelType="linearRegression" targetFieldName="output">
        <MiningSchema>
            <MiningField name="input"/>
            <MiningField name="output" usageType="target"/>
        </MiningSchema>
        <RegressionTable intercept="0.00">
            <NumericPredictor name="input" exponent="1" coefficient="1.00"/>
        </RegressionTable>
    </RegressionModel>
</PMML>
var evaluator = new LoadingModelEvaluatorBuilder().load(pmmlFile).build();

Loading the model will fail, unless I add the commented-out output field declaration in the data dictionary

I took your example PMML, commented out both DataField@name="output" and MiningField@name="output" elements, and removed the RegressionModel@targetFieldName="output" attribute:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" version="4.4">
    <Header copyright="Me" description="Some model"/>
    <DataDictionary numberOfFields="1">
        <DataField name="input" optype="continuous" dataType="double"/>
        <!-- DataField name="output" optype="continuous" dataType="double" /-->
    </DataDictionary>
    <TransformationDictionary/>
    <RegressionModel modelName="test" functionName="regression" modelType="linearRegression">
        <MiningSchema>
            <MiningField name="input"/>
            <!--<MiningField name="output" usageType="target"/>-->
        </MiningSchema>
        <RegressionTable intercept="0.00">
            <NumericPredictor name="input" exponent="1" coefficient="1.00"/>
        </RegressionTable>
    </RegressionModel>
</PMML> 

Running with JPMML-Evaluator version 1.6.4 executable JAR:

$ java -jar pmml-evaluator-example/target/pmml-evaluator-example-executable-1.6-SNAPSHOT.jar --model issue_251.pmml --input input.csv --output output.csv --separator ","

Loads just fine, makes predictions just fine.

The contents of the newly produced output.csv file:

input,(null)
1,1.0
7,7.0
13,13.0

TLDR: Was expecting your model to load and execute just fine, even though the top-level model does not specify an explicit target field name. Seems to be the case.

Perhaps your model is structurally a bit more complex, and has two anonymous target fields in some scope?

But in the PMML files I'm trying to load, the target mining fields are declared and previous versions of JPMML (around the release for 4.2) were able to load it even if they were not found in the data dictionary, which seems to be valid according to the spec (or my understanding of it).

That being said, I'm fine also keeping thing the way they are, just wanted to point out that some models cannot be loaded in the recent versions of JPMML without modifying them. The way I understand it is that JPMML now has its own "spec" on top of the standard and you seem to confirm it as something intentional here:

The JPMML-Evaluator is fairly liberal in what it accepts PMML markup-wise, but it hates ambiguity. In the current case, it probably asks you to be more specific about what you're trying to accomplish.

If that's the case, then we can close this issue.

the target mining fields are declared and previous
versions of JPMML were able to load it even if they
were not found in the data dictionary

A MiningField element (ie. a field reference) that does not have a backing DataField element (ie. field definition that can be referenced)? Sounds like a deficient PMML file.

The JPMML-Evaluator library has become more strict over time - checking more and more things when the PMML XML file is first loaded into the memory.

just wanted to point out that some models cannot
be loaded in the recent versions of JPMML without modifying them.

The old version didn't perform enough checks, and let deficient PMML XML files reach the evaluation phase...

The way I understand it is that JPMML now has its
own "spec" on top of the standard and you seem to
confirm it as something intentional here:

The official PMML spec seems completely unmaintained in this point.

The previous development team has been sacked, and there's somebody new making changes (actually, breaking stuff, such as messing up the XML namespace identifier as detailed in jpmml/jpmml-model#38). However, this somebody does not interact with new issues.

I have a request for clarification open with DMG.org regarding the scoping of target fields:
http://mantis.dmg.org/view.php?id=205

It was marked as "acknowledged" in June 2019, but there has not been any technical comments added.

In this point, I feel that it's my duty to go on with the development of (J)PMML on my own, especially adding a couple extension attributes and elements here and there, in order to improve interoperability with popular Tabular ML frameworks. I have customers depending on this stuff, I can't wait after DMG.org indefinitely.