DuplicatedFieldValueException after loading PMML in Java generated by Nyoka
mbicanic opened this issue · 8 comments
Training and exporting in Python
I am training a LightGBM model via its scikit-learn
interface lightgbm.LGBMClassifier
and then trying to export the model into PMML using sklearn2pmml
. This is the code used to train the PMMLPipeline
:
from lightgbm import LGBMClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
def train_model(model: LGBMClassifier, params: dict, X_train: pd.DataFrame, Y_train: np.ndarray):
pipe = PMMLPipeline([('classifier', LGBMClassifier(**params))])
pipe.fit(X_train, Y_train)
X_sample = X_train.sample(n=100, random_state=42)
pipe.verify(X_sample)
sklearn2pmml(pipe, "model.pmml", with_repr=True)
Versions:
scikit-learn
: 1.2.0sklearn2pmml
: 0.91.0
The generated PMML file
Due to confidentiality, I cannot share the whole PMML file here, but I can describe its general structure:
<MiningModel functionName="classification" algorithmName="LightGBM">
<MiningSchema>
<MiningField name="y" usageType="target"/>
<MiningField name="feature1" importance="517.0"/>
...
<MiningField name="feature47" importance="258.0"/>
</MiningSchema>
<Segmentation multipleModelMethod="modelChain" missingPredictionTreatment="returnMissing">
<Segment id="1">
<True/>
<MiningModel functionName="regression">
<MiningSchema> SAME AS ABOVE, BUT WITHOUT FEATURE IMPORTANCES </MiningSchema>
<Output>
<OutputField name="lgbmValue" optype="continuous" dataType="double" isFinalResult="false"/>
</Output>
<Segmentation multipleModelMethod="sum" missingPredictionTreatment="returnLastPrediction">
<Segment id="1">
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
DEFINITION OF TreeModel WITH A BUNCH OF <Node> TAGS
</TreeModel>
</Segment>
...
<Segment id="364">
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
DEFINITION OF TreeModel WITH A BUNCH OF <Node> TAGS
</TreeModel>
</Segment>
</Segmentation>
</Segment>
<Segment id="2">
<True/>
<RegressionModel functionName="classification" normalizationMethod="logit">
<MiningSchema>
<MiningField name="y" usageType="target"/>
<MiningField name="lgbmValue"/>
</MiningSchema>
<Output>
<OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
<OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
</Output>
<RegressionTable intercept="0.0" targetCategory="1">
<NumericPredictor name="lgbmValue" coefficient="1.0"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="0"/>
</RegressionModel>
</Segment>
</Segmentation>
<ModelVerification recordCount="100">...</ModelVerification>
</MiningModel>
As you can see, there are three <OutputField>
tags in total in the whole file:
- The internal LGBM output:
<OutputField name="lgbmValue" optype="continuous" dataType="double" isFinalResult="false"/>
- The external classifier output regarding the probability the target is 0:
<OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
- The external classifier output regarding the probability the target is 1:
<OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
Loading into Java
I use the org.jpmml:jpmml-evaluator-metro:1.6.4
library to load and use PMML models in Java. This is the code I'm using:
Evaluator evaluator = new LoadingModelEvaluatorBuilder().load(new File('/path/to/model.pmml')).build();
evaluator.verify();
System.out.println("Output fields: " + evaluator.getOutputFields());
System.out.println("Target field(s): " + evaluator.getTargetFields());
With this snippet, I get the following output:
Output fields: [
OutputField{name=probability_0, fieldName=probability_0, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=1},
OutputField{name=probability_1, fieldName=probability_1, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=1},
OutputField{name=predicted_y, fieldName=predicted_y, displayName=null, opType=categorical, dataType=integer, finalResult=true, depth=1},
OutputField{name=probability_0, fieldName=probability_0, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=0},
OutputField{name=probability_1, fieldName=probability_1, displayName=null, opType=continuous, dataType=double, finalResult=true, depth=0},
OutputField{name=predicted_y, fieldName=predicted_y, displayName=null, opType=categorical, dataType=integer, finalResult=true, depth=0}
]
Target field(s): [TargetField{name=y, fieldName=y, displayName=null, opType=categorical, dataType=integer}]
As you can see, all the output fields are duplicated - once for depth=1
and once for depth=0
. These OutputField
definitions are obviously not present in the PMML file itself, so I am wondering where they are coming from and how to get rid of them?
Problem: Cannot evaluate due to DuplicateFieldValueException
The problem with this is that I cannot call evaluator.evaluate(features)
, because I get the following error:
org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability_0" has already been defined
at org.jpmml.evaluator.EvaluationContext.declare(EvaluationContext.java:130)
at org.jpmml.evaluator.OutputUtil.evaluate(OutputUtil.java:438)
at org.jpmml.evaluator.ModelEvaluator.evaluateInternal(ModelEvaluator.java:467)
at org.jpmml.evaluator.mining.MiningModelEvaluator.evaluateInternal(MiningModelEvaluator.java:224)
at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:300)
I tried everything I could find in previous related issues (jpmml/jpmml-sparkml#92, jpmml/jpmml-sparkml-xgboost#13, jpmml/jpmml-sparkml-xgboost#15) and the documentation, but it didn't help, so I am beginning to think this is an issue with the library, since phantom OutputFields are being created.
I apologize in advance if this is due to me using the library in a wrong way, but I would appreciate any and all help you could provide.
Just look at your own data!
SkLearn2PMML/JPMML-SkLearn produces a LightGBM model that has the following schema:
- Sole target field
y
- Two probability-type output fields
probability(0)
andprobability(1)
Does your misbehaving model.pmml
look like the above?
Closing as invalid - the user is attempting to evaluate invalid PMML documents (generated by N****, not SkLearn2PMML).
@vruusmann I used to use Nyoka, but I had other issues with it. I guarantee that this particular model.pmml was generated with sklearn2pmml. I really don't understand why the hostility and the certainty I used nyoka
? Here is the first few lines of the generated PMML, directly copy-pasted:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
<Header>
<Application name="SkLearn2PMML package" version="0.91.0"/>
<Timestamp>2023-03-31T09:57:19Z</Timestamp>
</Header>
<MiningBuildTask>
<Extension name="repr">PMMLPipeline(steps=[('classifier', LGBMClassifier(class_weight={0: 0.05, 1: 0.95},
learning_rate=0.07168998753077896, max_depth=18,
min_data_in_leaf=418, n_estimators=364, num_leaves=58,
objective='binary', reg_alpha=0.07558439164814572,
reg_lambda=0.05483594313753313))])</Extension>
</MiningBuildTask>
It explicitly says SkLearn2PMML package
, so I am very confused how you got to the conclusion I used nyoka
? Please, undo the change of the issue title, because it is dishonest. I wouldn't come here with this question if I generated the PMML with nyoka, as I am well aware there could be incompatibilities between them.
And once we agree that the PMML file has indeed been generated with sklearn2pmml
, I would greatly appreciate an explanation or at least a helping hint regarding the duplication of OutputFields
in the loaded model.
I guarantee that this particular model.pmml was generated with sklearn2pmml.
Your DuplicatedFieldValueException
was raised when scoring a Nyoka-produced PMML document. The JPMML-Evaluator library is not renaming existing OutputField
elements, and is not inventing new ones.
That's a hard fact. No point in arguing - open your model.pmml
in text editor, and take a look into it.
I really don't understand why the hostility and the certainty I used nyoka?
Because Nyoka is generating invalid/irreproducible PMML documents, and then it is me who has to prove over and over again that JPMML software is correct.
I wouldn't come here with this question if I generated the PMML with nyoka
Please attach your model.pmml
here (or send it to my e-mail), so that we can resolve this issue based on factual matters.
org.jpmml.evaluator.DuplicateFieldValueException: The value for field "probability_0" has already been defined
All JPMML conversion libraries name probability fields using a probability(<category>)
pattern. Nyoka (and related stuff) uses a probability_<category>
pattern.
Now, seeing that the duplicate output fields is called probability_0
, which statement is likely the correct one?
- The PMML document was generated by SkLearn2PMML (based on JPMML-SkLearn)
- The PMML document was generated by Nyoka.
@vruusmann
I apologize, it was indeed my mistake. As I said, I used nyoka before, and had to migrate to sklearn2pmml and pmml-evaluator due to other issues.
The problem was that I am using MLflow to register models. I have a script that trains a model, saves it to PMML, and then registers the model together with the model.pmml
artifact. Ever since I modified the script to use sklearn2pmml
instead of nyoka
, the connection to MLflow wasn't working properly, so even though the local PMML file created by the training script was indeed generated by sklearn2pmml
, the "latest" MLflow model I was fetching in Java was still the one relying on a nyoka
PMML.
Once again, I apologize for wasting your time and insisting I was correct, I was completely unaware of this problem. Nevertheless, I appreciate that you in the end explained why and how you know the file was Nyoka-generated - it was very helpful. Thank you for your time and effort!
Apology accepted!
The problem was that I am using MLflow to register models.
Do you have this MLflow integration project available somewhere? Pure Java, or Java-wrapped-into-Python?
I've meant to provide such integration myself, but haven't started yet.
Unfortunately, the project is not available publicly as it's a company project. However, it's not really an integration in the strict sense of the word, it's more of a bypass. I am normally registering the model as a Python sklearn
model, and then additionally logging the PMML file as an artifact:
from lightgbm import LGBMClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
def train_model(model: LGBMClassifier, params: dict, X_train: pd.DataFrame, Y_train: np.ndarray):
pipe = PMMLPipeline([('classifier', LGBMClassifier(**params))])
pipe.fit(X_train, Y_train)
X_sample = X_train.sample(n=100, random_state=42)
pipe.verify(X_sample)
sklearn2pmml(pipe, "model.pmml", with_repr=True)
return pipe['classifier']
def log_model(model: LGBMClassifier, X_data: pd.DataFrame):
mlflow.sklearn.log_model(
sk_model: model,
artifact_path: "",
registered_model_name: MODEL_NAME,
signature: mlflow.models.signature.infer_signature(X_data)
)
mlflow.log_artifact("model.pmml") # referring to the local file generated in train_model
X_train, Y_train = load_dataset(...)
model = train_model(LGBMClassifier(), X_train, Y_train)
log_model(model, X_train)
And then in Java, instead of loading the model, I just load the PMML artifact as a file and initialize the Evaluator class with it:
private Evaluator loadModel(String modelName) throws Exception {
try (MlflowClient client = new MlflowClient(MLFLOW_URI)) {
ModelRegistry.ModelVersion version = client.getRegisteredModel(modelName).getLatestVersions(0);
File artifactDir = client.downloadArtifacts(version.getRunId());
File[] files = artifactDir.listFiles(f -> f.getName().equals("model.pmml"));
Evaluator evaluator = new LoadingModelEvaluatorBuilder().load(files[0]).build();
evaluator.verify();
return evaluator;
}
}
It's a pretty simple process, all things considered, and surprisingly easy to use Python models in Java this way, while also leveraging MLflow.