Default value handling for constant evaluation
Closed this issue · 7 comments
I used to have a model (built with 1.4.x version of JPMML library) with the following decision tree function:
<DerivedField name="ReplaceRegexVar" optype="categorical" dataType="string">
<Apply function="replace" >
<FieldRef field="inputA"/>
<Constant>Z+</Constant>
<Constant></Constant>
</Apply>
</DerivedField>
Currently when I do version upgrade for the PMML library from 1.4.x to 1.5.x, I found that the function behavior was changed:
-
In old version, the empty constant will be treated as an empty string: https://github.com/jpmml/jpmml-evaluator/blob/1.4.2/pmml-evaluator/src/main/java/org/jpmml/evaluator/ExpressionUtil.java#L146-L157, which is the same as what written in the doc: http://dmg.org/pmml/v4-4/Transformations.html
-
However, in the 1.5.x version of the library, when dataType is missing and value not specified, the Constant will be return as
FieldValues.MISSING_VALUE
(which isnull
): https://github.com/jpmml/jpmml-evaluator/blob/master/pmml-evaluator/src/main/java/org/jpmml/evaluator/ExpressionUtil.java#L228-L230
As I understanding this seems to be a breakable change for the old model compatibility? Can someone help to confirm which should be the correct behavior?
As I understanding this seems to be a breakable change for the old model compatibility?
The PMML schema version 4.4 extended the Constant
element with the optional missing
attribute:
http://mantis.dmg.org/view.php?id=216
http://dmg.org/pmml/v4-4-1/Transformations.html#xsdElement_Constant
The interpretation of <Constant></Constant>
now varies depending on the absence/presence of dataType
and missing
attributes.
Can someone help to confirm which should be the correct behavior?
The 1.4.X version is generating PMML schema version 4.3 documents, where <Constant></Constant>
means an empty string.
The 1.5.X version is generating PMML schema version 4.4 documents, where <Constant></Constant>
means a missing string. However, I believe that there's an extra dataType
attribute available now like this <Constant dataType="string"></Constant>
, which would change the interpretation back to an empty string.
The org.jpmml.model.filters.ImportFilter
SAX filter class should be extended with the following schema version upgrade logic: "If schema version == 4.3 and element matches 'empty Constant element', then add Constant@dataType='string'".
If you have legacy PMML documents then you'd need to perform this change manually.
The org.jpmml.model.filters.ImportFilter SAX filter class should be extended with the following schema version upgrade logic: "If schema version == 4.3 and element matches 'empty Constant element', then add Constant@dataType='string'".
Thanks for the reply. For the logic can I expect that this will be added to the next release version of the library? I feel that it's not good to maintain a private branch for the code which could be extremely hard for the later version upgrade.
I feel that it's not good to maintain a private branch for the code ...
I was suggesting that you should modify your existing PMML schema version 4.3 documents (not my library or your application code).
Right now you have:
<Constant></Constant>
Change it to the following, and your rogue missing value problem will be solved:
<Constant dataType="string"></Constant>
It's quite surprising that your PMML documents don't have the Constant@dataType
attribute available. Must be a very old file, because if I remember correctly, all JPMML family conversion libraries have been defining it for the past 3+ years (mostly for evaluation performance reasons, because inferring constant's data type from its string representation is rather expensive).
FWIW, here's the relevant JPMML-Evaluator code snippet:
https://github.com/jpmml/jpmml-evaluator/blob/1.5.13/pmml-evaluator/src/main/java/org/jpmml/evaluator/ExpressionUtil.java#L202-L239
The comment on line 217 states clearly that you can evade the "" -> missing
conversion by adding Constant@dataType="string".
<Constant dataType="string"></Constant>
Adding dataType
for empty string constant does works to prevent interpreting as missing value during evaluation time. Thanks for suggestion.
It's quite surprising that your PMML documents don't have the Constant@dataType attribute available. Must be a very old file, because if I remember correctly, all JPMML family conversion libraries have been defining it for the past 3+ years (mostly for evaluation performance reasons, because inferring constant's data type from its string representation is rather expensive).
The model is actually one of our unit test model, and yes it's introduced in 2017 (3 years ago). However the version of the PMML file is 4.3 (first line: <PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3">
). For now I can't tell whether this document is directly generated from the library or manually created on purpose for edge-case test. Can I assume that all library generated PMML documents (e.g., jpmml-xgboost) are always having dataType
attribute assigned to the Constant
?
Restored this issue with its original location and title.
My previous analysis/explanation was incorrect - this is a real bug in the JPMML-Evaluator library (not a missing feature in the JPMML-Model library).