Ignoring duplicate `OutputField` elements (instead of raising a `DuplicateValueException`)

Question

Ignoring duplicate `OutputField` elements (instead of raising a `DuplicateValueException`)

Closed this issue 3 years ago · 12 comments

I'm exporting LGBM Model from python, using https://github.com/SoftwareAG/nyoka, to be picked up in Java.

Whereas this generally works just fine, for LGBM I get the following exception:
org.jpmml.evaluator.DuplicateValueException: The value for field "probability_0" has already been defined

Here are the relevant bits from model that fails to load:

<MiningModel modelName="LightGBModel" functionName="classification">
         ...
        <MiningField name="fraud" usageType="target" optype="categorical"/>
    </MiningSchema>
    <Output>
      <OutputField name="probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>
      <OutputField name="probability_1" optype="continuous" dataType="double" feature="probability" value="1"/>
      <OutputField name="predicted_fraud" optype="categorical" dataType="integer" feature="predictedValue"/>
    </Output>
    <Segmentation multipleModelMethod="modelChain" missingThreshold="1">
    ...
        <Segment id="2" weight="1">
            <True/>
            <RegressionModel modelName="LGBMClassifier" functionName="classification" normalizationMethod="logit">
                ...
                <Output>
                    <OutputField name="probability_0" optype="continuous" dataType="double" feature="probability" value="0"/>
                    <OutputField name="probability_1" optype="continuous" dataType="double" feature="probability" value="1"/>
                    <OutputField name="predicted_fraud" optype="categorical" dataType="integer" feature="predictedValue"/>
                </Output>
                ...
            </RegressionModel>

Comparing this to the canonical example 4 at http://dmg.org/pmml/v4-4-1/MultipleModels.html

<MiningModel functionName="regression">
 <MiningSchema>
  ...
  <MiningField name="Class" usageType="target"/>
  <MiningField name="PollenIndex" usageType="target"/>
 </MiningSchema>
 <Output>
  <OutputField dataType="string" feature="predictedValue" name="PredictedClass" optype="categorical" targetField="Class" segmentId="1"/>
  <OutputField dataType="double" feature="probability" name="Probability_setosa" optype="continuous" targetField="Class" value="Iris-setosa" segmentId="1"/>
  <OutputField dataType="double" feature="probability" name="Probability_versicolor" optype="continuous" targetField="Class" value="Iris-versicolor" segmentId="1"/>
  <OutputField dataType="double" feature="probability" name="Probability_virginica" optype="continuous" targetField="Class" value="Iris-virginica" segmentId="1"/>
  <OutputField dataType="double" feature="predictedValue" name="Pollen Index" optype="continuous" targetField="PollenIndex"/>
 </Output>
 <Segmentation multipleModelMethod="modelChain">
  <Segment id="1">
   ...
    <Output>
     <OutputField dataType="string" feature="predictedValue" name="PredictedClass" optype="categorical"/>
     <OutputField dataType="double" feature="probability" name="Probability_setosa" optype="continuous" value="Iris-setosa"/>
     <OutputField dataType="double" feature="probability" name="Probability_versicolor" optype="continuous" value="Iris-versicolor"/>
     <OutputField dataType="double" feature="probability" name="Probability_virginica" optype="continuous" value="Iris-virginica"/>
    </Output>

As far as I can tell, the duplicity in dataType, name, optype, feature, and value specifications are identical with the example that one would regard as canonical, coming from official PMML 4.4 site. And the error suggests the value repetition is at fault.

As far as the other fields, those appear compliant too (though I am no expert).
segmentId
http://dmg.org/pmml/v4-4-1/MultipleModels.html

Since the Segment id attribute is optional, if it is not specified, Segments are identified by an implicit 1-based index, indicating the position in which each segment appears in the model.
...
OutputFields contained at top level MiningModel element apply to the winning Segment selected by the multipleModelMethod attribute (selectFirst, selectAll, majorityVote, modelChain, etc.) and the RESULT-FEATURE entityId returns the ID of the winning segment, but output fields from other segments may always be included by specifying the segmentId attribute.

targetField
http://dmg.org/pmml/v4-4/Output.html

If present, the attribute targetField must refer either to a MiningField of usage type target or a field described in Targets element. targetField is a required attribute in case the model has multiple target fields.
The model in question has a single target field indeed, so should not be required.

For the time being, we patched a workaround, erasing the top-level <Output>. The implicit behaviour of modelChain is identical to the specified, in this particular case, as the predicate is a blank True.

Calls involved:

nyoka
- version 5.0.1.
- nyoka.lgbm.lgb_to_pmml.lgb_to_pmml (nyoka.lgb_to_pmml)
lightgbm
- version 3.2.1
- lightgbm.sklearn.LGBMClassifier (lightgbm.LGBMClassifier)

jpmml

version 1.5.16
Load

Evaluator evaluator = new LoadingModelEvaluatorBuilder()
      .setLocatable(false)
      .setVisitors(new DefaultModelEvaluatorBattery())
      .load(modelFileStream)
      .build()
      .verify();

Evaluate
- Here is where the exception occurs

Map<FieldName, ?> results = evaluator.evaluate(fields);

As far as I could tell the generated pmml appears compliant, as far as I was able to dig into it. Do feel free to let me know if I am missing something, or reach me with any further questions. I'd just like not to be manually changing the generated files. :)

Thank you for all the good work!

Answer 1 · 2021-11-01T16:03:36.000Z

I'm exporting LGBM Model from python, using https://github.com/SoftwareAG/nyoka

Consider using the SkLearn2PMML package instead.

Comparing this to the canonical example 4 at http://dmg.org/pmml/v4-4-1/MultipleModels.html

The DMG site is known to contain invalid examples. If an example is in conflict with the specification itself, what to believe?

As far as the other fields, those appear compliant too (though I am no expert).

Define "compliance"?

I'd argue that Nyoka produced PMML files violate Field Scoping rules:
http://dmg.org/pmml/v4-3/FieldScope.html

There are two fields with name probability_0 (and probability_1) defined in the same scope.

Calls involved:

To save your and my time, please combine Nyoka with Software AG's PMML engine, not JPMML-Evaluator.

Answer 2 · 2021-11-01T20:35:24.000Z

I can see the line of reasoning based on the 'Scoping' section that results in the conclusion that said PMML violates the Field Scoping rules:

In a model chain, all Outputs from the segment models become part of the enclosing mining model scope. As a result, the Outputs of one segment model may be used as an input to any of the subsequent segment models.
The names of OutputFields must be unique from any other names in their scope, i.e. within a model or model segment and across all model segments in a model chain.

Ergo: subscope -> scope, scope -> unique.

On the contrary, I also see the following line of reasoning, based on the 'MultipleModels' section:

The results provided from a modelChain MiningModel are the results from the last Segment executed in the chain (i.e., the last Segment whose predicate evaluates to true). Note that the models combined by modelChain must have OutputField elements.
OutputFields contained at top level MiningModel element apply to the winning Segment selected by the multipleModelMethod attribute (selectFirst, selectAll, majorityVote, modelChain, etc.) and the RESULT-FEATURE entityId returns the ID of the winning segment, but output fields from other segments may always be included by specifying the segmentId attribute. OutputFields within Segments allow for results specific to that segment to be returned. <...> In the event of conflict between output fields specified in a higher level model and one or more of its subsidiary models, the highest level specification prevails.

On default, modelChain returns OutputField's of its last segment.
For multiple models, including modelChain, one can specify OutputField's at top level - above segmentation.
The purpose of specifying OutputField's at top level is the possibility to include Output's from other than final segment in modelChain output.
There ought to be no conflict between higher level model and subsidiary models – that is, their declaration should match, but it is even permissible if they do not.

My conclusion is: yes, there is ambiguity, but it would appear the latter is a clearly stated new functionality, building on the previous concepts, and the 'Scoping' section has just not been reworded to account for this. Indeed, the segmentId part appears new in 4.4., whereas the scoping wording is carried over, word-by-word, from 4.3. Finally, the example 4 provides clarification and supports the latter interpretation.. As in, it is not outdated, quite the contrary.

I know you do recommend sklearn2pmml, but I've ran into some major performance and debugging issues with the exported models, that were solved by exporting them with nyoka. I am also required to export the model specifically to Java; I don't think there is a 'Software AG PMML' engine for that?

Well, that's my two cents of due diligence; of course, if you still disagree I have to acquiesce. Just trying to have two packages bound by the same standard working, and be knowledgeable and of help within my ability while doing so..

Answer 3 · 2021-11-01T20:58:28.000Z

OK, I took another look at your sample, and noticed that your main point is centered around the use of the OutputField@segmentId attribute on the top level.

Indeed, the top-level OutputField element is referencing the nested OutputField element, so JPMML-Evaluator should be able to figure out that "hey, it's exactly the same field, so it doesn't count as a duplicate field declaration"?

One of those two declarations is definitely redundant. For maximal clarity, I'd drop the top-level one.

I know you do recommend sklearn2pmml, but I've ran into some major performance and debugging issues with the exported models, that were solved by exporting them with nyoka.

That's such an intriguing statement.

Care to share some examples on SkLearn2PMML issue tracker?
https://github.com/jpmml/sklearn2pmml/issues

I am also required to export the model specifically to Java; I don't think there is a 'Software AG PMML' engine for that?

Of course there is. Why would Software AG bother with Nyoka anyway, if they didn't have their own PMML engine?

Answer 4 · 2021-11-01T20:59:03.000Z

so JPMML-Evaluator should be able to figure out that "hey, it's exactly the same field, so it doesn't count as a duplicate field declaration"?

Re-opening this issue based on this realization.

Answer 5 · 2021-11-02T06:09:57.000Z

so JPMML-Evaluator should be able to figure out that "hey, it's exactly the same field, so it doesn't count as a duplicate field declaration"?

Re-opening this issue based on this realization.

My brain has thought about this issue overnight, and come to the conclusion that the current behaviour (throwing DuplicateValueException) is correct, and should not be changed.

The explanation is that the the top-level OutputField@segmentId='1' is a different entity than the low-level OutputField. It is not a (cross-)reference, but a new standalone field declaration. And according to PMML field scoping rules, it is not allowed to declare two fields with the same name in one scope.

but I've ran into some major performance and debugging issues with the exported models, that were solved by exporting them with nyoka.

Nyoka generated PMML documents that are syntactically valid, but semantically/functionally invalid.

Answer 6 · 2021-11-03T18:16:40.000Z

I have had a look around, and the first 3rd party package I found, pmml4s (https://github.com/autodeployai/pmml4s), understands the nyoka-generated pmml. I have also mentioned the issue to the nyoka team, where they do believe their interpretation of the standard is correct.

So, while I do agree both interpretations can be construed from the current set of wordings for the standard, I would argue the dmg group has been sufficiently clear with the intent to have this work, and that not supporting this risks both interoperability and further future compliance.

I'll deal with it one way or another now. Just wanted to point out that it does not appear, as you imply, that nyoka would be the odd one out with their interpretation.

Answer 7 · 2021-11-03T18:41:10.000Z

I have had a look around, and the first 3rd party package I found, pmml4s (https://github.com/autodeployai/pmml4s), understands the nyoka-generated pmml

Out of curiosity, what do you get with PMML4S? Do you get two results, both named probability_0?

Answer 8 · 2021-11-04T16:11:00.000Z

I get probability_0, probability_1, and predicted_fraud. That's it.

Answer 9 · 2021-11-04T16:56:01.000Z

I get probability_0, probability_1, and predicted_fraud.

Well, according to your definition of "compliance" you should be receiving six result fields (not three), because there are six active output field declarations.

The JPMML-Evaluator library refuses to overwrite an already defined value. PMML4S overwrites it silently.

Answer 10 · 2021-11-04T16:57:39.000Z

The fact of the matter is that Nyoka is generating duplicate OutputField declarations. Sure, they can do it (if six output fields is better than three output fields), but they should at least be specifying UNIQUE FIELDS NAMES, FFS.

Answer 11 · 2021-11-04T17:21:08.000Z

'My' definition states that the output is what is specified at top level, and only otherwise the default behaviour of fetching last submodel's Output. So I am getting exactly what I would expect.

Answer 12 · 2021-11-04T17:32:33.000Z

'My' definition states that the output is what is specified at top level,

Your definition is not compliant with the PMML specification, because you have a model chain, and the model chain returns the output of the last "true" segment, plus whatever outputs there may be at higher levels.

So, there must be six columns in your result, not three.

See http://dmg.org/pmml/v4-3/MultipleModels.html