Error: py4j.protocol.Py4JJavaError: An error occurred while calling o204.fit.
Closed this issue · 3 comments
This error occurs during the execution of the fitting step for sparkxgb.
I am using spark 3.0.2 with prebuild Hadoop 3.2
I checked with both Java 11 and Java 8 version but the error persists
The complete error log is mentioned below:
Traceback (most recent call last):
File "/tmp/spark-2ad69262-722e-457b-bacb-31338c62f081/train_xgb.py", line 274, in
train(dfs, input_cols, target_col)
File "/tmp/spark-2ad69262-722e-457b-bacb-31338c62f081/train_xgb.py", line 171, in train
model = cv.fit(train_sdf)
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 129, in fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 436, in _fit
File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 436, in
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/tuning.py", line 53, in singleTask
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 62, in next
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 103, in fitSingleModel
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 127, in fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 321, in _fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 318, in _fit_java
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o204.fit.
: org.apache.spark.SparkException: ML algorithm was given empty dataset.
at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:147)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:174)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:150)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Unknown Source)
Any help is appreciated.
I'd start looking at the shape and structure of your dataset.
org.apache.spark.SparkException: ML algorithm was given empty dataset.
For xgboost you'll need to have your target variable encoded as a numeric* index (same goes for any categorical predictors).
(* I think it's a DoubleType
, but can't remember off the top of my head.)
If you're stuck, check out the example notebook which uses pyspark.ml.feature.StringIndexer
to do exactly that.
For the predictors, you'll need to run them all through pyspark.ml.feature.VectorAssembler
. Again, that's in the example notebook.
Sorry for the late reply,
There was no problem with the example notebook, it ran smoothly. The problem was with my dataset. I had to convert numerical types into DoubleType
and then it worked fine.
thank you
@akshayparanjape when you say the NB runs smoothly, could you please tell me:
- what spark version are you using?
- what xgboost4j and xgboost4j-spark JARs are you using?
- what python version are you on?
- what command/env/vars are you using?
I am almost with every version of Spark, Xgboost and setting python to python3 just running into the following:
java.io.IOException: Cannot run program "python": error=2, No such file or directory
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
at java.base/java.lang.Runtime.exec(Runtime.java:592)
at java.base/java.lang.Runtime.exec(Runtime.java:416)
at java.base/java.lang.Runtime.exec(Runtime.java:313)
at ml.dmlc.xgboost4j.java.RabitTracker.startTrackerProcess(RabitTracker.java:136)
at ml.dmlc.xgboost4j.java.RabitTracker.start(RabitTracker.java:170)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.startTracker(XGBoost.scala:393)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:549)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:40)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:150)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: error=2, No such file or directory
at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
... 22 more