Problem with parsing gbdt models
Closed this issue · 24 comments
I tried to construct gbdt model in spark with a pmml file which describes a gbdt model, and I got the following error:
"caused by:org.shaded.jpmml.evaluator.MissingFieldException: Field 'decisionFunction(1.0)' is not defined". What are the possible things I need to check?
Thanks!
"Field 'decisionFunction(1.0)' is not defined"
You should cast the data type of the label column from double (0.0/1.0
) to either integer (0/1
) or string ("0"/"1"
), and re-train the GBDT model.
It looks like a target category formatting problem in some JPMML conversion library. Which ML framework/library were you using - JPMML-SparkML, JPMML-SkLearn, or something else?
TLDR: The correct name of the field should be decisionFunction(1)
here.
I used sklearn2pmml to generate gbdt models
The label definition in the data dictionary part of my pmml is the following:
<Field name="typ2" optype="categorical" dataType="double"> <Value value="0.0" /> <Value value="1.0" /> </Field>
so is the "typ2" column in double format, as is specified in the snippet? ( I asked because "1.0" and "0.0" seem to be represented in string format.)
I used sklearn2pmml to generate gbdt models
In that case cast the data type of the y
variable from double
to integer
:
pipeline = PMMLPipeline(...)
pipeline.fit(X, y.astype(int))
Does it fix your problem? If the re-training is not an option, then you may try replacing all the occurrences of decisionFunction(1.0)
with decisionFunction(1)
.
However, I'm quite surprised that the SkLearn2PMML/JPMML-SkLearn stack has produced such an invalid PMML document. It should be performing a full field name/scope resolution during conversion. Or perhaps have you changed anything about this particular PMML document manually?
No manual change before that.
Just now I tried changing the decision function input. After I change 1.0 to 1,I also make 2 more changes to fix bump up errors:
(1)change the "targetCategory" of the RegressionTable element from '1.0' to '1', and change '0.0' to '0'
(2)change the dataType of label column in the DataDictionary from double to integer
Now I got this error: "Field 'decisionFunction(1)' is not defined"
Are there any other related parts I need to fix?
(1)change the "targetCategory" of the RegressionTable element from '1.0' to '1', and change '0.0' to '0'
The values of the RegressionTable@targetCategory
attribute must match exactly the values of the DataField/Value@value
attribute.
Now I got this error: "Field 'decisionFunction(1)' is not defined"
This field is originally declared as some OutputField
element. Assuming a binary classification GBDT model there should be exactly one such field in that PMML document.
As a general comment, I can recall a similar "field not found" exception reported against one of the SkLearn2PMML or JPMML-SkLearn projects in the past. Moreover, I can even remember fixing it.
What's your SkLearn2PMML package version? Maybe you're running some outdated version?
Any chance you can provide a reproducible example (Python script plus a CSV input file) about generating such broken PMML documents.
My sklearn2pmml version is 0.39.0. Python version is 3.5.2 and sklearn version is 0.20.1
I am afraid I cannot offer full example because of regulations of my client. I will try to work on a toy example to reproduce
strange thing: When I switch to python 2.7, I can generate a pmml with gbdt which contains double type label, and this file can be correctly parsed to generate gbdt models with spark jpmml-evaluator. Attached is my pmml file.
So what is the problem here?
When I switch to python 2.7, I can generate a pmml with gbdt which contains double type label.
Must be that Python 2.X uses different Pandas/Numpy/Pickle package versions than Python 3.X. And these different package versions take care of double
to integer
conversion automatically.
In your public result.txt
file the decisionFunction(1.0)
field is defined on line 84, and then it is referenced exactly once on line 94. In your failing file, does the field resolution exception ("decisionFunction(1.0) is not defined") also happen in the same place (ie. inside the transformedDecisionFunction(1.0)
output field declaration)?
The error message did not point out which line of my file is wrong. The actual exception is "Failed to execute user defined function(blablabla...)" In the stack trace of the exception I can found that it was caused by "Field 'decisionFunction(1)' is not defined" exception.
By the way: my spark jpmml-evaluator version is 1.2.0
@vruusmann
Sent you an email, describing the error in detail
I will summarize the stack trace as follows:
org.apache.spark.SparkException. Failed to execute user defined function ......(with mismatch data types)
caused by:org.shaded.jpmml.evaluator.MissingFieldException: Field 'decisionFunction(1.0)' is not defined
at org.shaded.jpmml.evaluator.EvaluationContext.lookup(EvaluationContext.java:64)
Do you have any idea what could trigger this kind of exception?
Got your e-mail with the screenshot of the stack trace.
On that image, the field was called decisionFunction(1)
(not the missing .0
suffix), which suggests that the exception happens with integer labels too? Or had you already modified this PMML document in some way?
In any case, would really need to have access to a reproducible test case. There is full integration test coverage for sklearn.ensemble.GradientBoostingClassifier
available here:
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L188
https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L394
Both integration test cases convert and evaluate correctly. What are you doing differently?
Attached is a demo archive, which trains a GradientBoostingClassifier
for a binary classification problem where the label is encoded as double (0.0/1.0
).
Training:
$ python main.py
Scoring:
$ java -jar ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/pmml-evaluator-example-executable-1.4-SNAPSHOT.jar --model Audit.pmml --input Audit.csv --output Audit-results.csv --copy-columns false
Everything works as advertised. Can you "break" this demo archive (changing something about GradientBoostingClassifier
parameterization) so that it would start throwing this "field not found" exception?
I train the model with integer label and the problem is still there. The definition of "decisionFunction(1)" is just there in the output field and the evaluator cannot parse it, complaining about missing field. Again I cannot reproduce it in my own environment. Is it possible that it's related to environment issue? like outdated pmml-model dependency in the classpath?
In my demo archive, I can change the data type of the label column from double
to integer
, and everything still works correctly:
df["Adjusted"] = df["Adjusted"].astype(int)
I have demonstrated you two times that everything is OK. If you claim otherwise, then you need to back up your claims with hard evidence.
@vruusmann Now I am able to reproduce the issue with a tiny program and a small gbdt pmml file. The following are the program and the pmml file. Remove the txt suffix and you can run them.
In the program file PMMLUnitTest, I constructed two small data frame inputDF1 and inputDF2. The error happens when the evaluator is evaluating inputDF2. The error message is as follows:
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$evaluationFunction$1$1: (struct<alcohol:double,typea:double,tobacco:double,age:double>) => struct<chd:int,probability(0):double,probability(1):double>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.shaded.jpmml.evaluator.MissingFieldException: Field "decisionFunction(1)" is not defined
at org.shaded.jpmml.evaluator.EvaluationContext.lookup(EvaluationContext.java:64)
at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:589)
at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:315)
at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:240)
at org.shaded.jpmml.evaluator.mining.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:209)
at org.shaded.jpmml.evaluator.spark.PMMLTransformer$$anonfun$evaluationFunction$1$1.apply(PMMLTransformer.scala:78)
at org.shaded.jpmml.evaluator.spark.PMMLTransformer$$anonfun$evaluationFunction$1$1.apply(PMMLTransformer.scala:66)
... 16 more
Thanks for the update - it's an important piece of information, that the field lookup exception happens selectively (does not happen with inputDF1, but happens with inputDF2).
The most likely explanation is that the evaluation of inputDF2 produces a so-called "missing value" in the first stage. The second stage expects to find a non-missing value; from its perspective, a "missing value" is the same as "undefined value".
Will take some time to think about appropriate solution. One thing is that the exception message should be more explicit about this distinction between "missing value" and "undefined value" - at the moment it seemed to suggest that perhaps some JPMML converter library is producing incorrect PMML documents (whereas in reality all JPMML converters and evaluators are correct, and the problem is related to the input data record).
Another thing is that it's possible to customize the "missing prediction handling" at the PMML language level:
http://mantis.dmg.org/view.php?id=178
In the current case, the model evaluation process should probably throw org.jpmml.evaluator.InvalidResultException
instead - the input data record is incomplete, and it's impossible to perform the requested computation on it.
The most likely explanation is that the evaluation of inputDF2 produces a so-called "missing value" in the first stage.
To elaborate - the first data record has Some(1.0)
, but the second data record has None
.
Another solution is that the MiningSchema
element of this model should simply state that all input fields must have non-missing values defined (ie MiningField@missingValueTreatment="x-returnInvalid"
).
Interestingly: If I change the content of inputDF1, as follows:
`val inputRDD1 = spark.sparkContext.parallelize(Seq(
TestEntry(
a_date="2018-11-01",adiposity = 38.03,alcohol= Some(24.26),b_date="2018-10-02",chd=1.0, dst_sp = 114.0,famhist="Present",from_key="node/bb",ldl=6.41,new_diff = 30.0,obesity = 31.99,sbp=170.0,src_sp = 170.0,to_key="node/cc",tobacco = None,typea = 51.0,vfeature = 170.0,age=Some(58.0)
)
))
val inputDF1=spark.sqlContext.createDataFrame(inputRDD1)
`
The evaluation of inputDF1 will NOT crash, even though the "tobacco" feature has null value.
My assumption is that this particular combination of feature values happen to bypass the branch in the model which triggers the evaluation of "tobacco" features, which should have triggered the "MissingFieldException".
Am I correct?
My assumption is that this particular combination of feature values happen to bypass the branch which triggers the evaluation of "tobacco" features.
Exactly. If you want to trigger this exception on purpose with inputDF1, then you need to set the value of some top-level input field to a missing value.
For example, the "age" input field appears to be a popular first splitting criterion. If you set the value of the "age" input field to a missing value, then the prediction should always fail.
This exception was changed from MissingFieldException
to MissingValueException
in JPMML-Evaluator version 1.4.5:
jpmml/jpmml-evaluator@60a836e
The base version of the JPMML-Evaluator-Spark project is currently 1.4.4:
https://github.com/jpmml/jpmml-evaluator-spark/blob/master/pom.xml#L8
So, a simple base version update (scheduled to happen later this week) should solve most of the confusion.