sllynn/spark-xgboost

Cannot Load model using PySpark xgboost4j

WillSmisi opened this issue · 4 comments

Background

I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form.

The training and saving is done, but It seems I cannot load the model.

Current libraries versions:

  • Pyspark 2.4.5

  • xgboost4j 0.91

  • xgboost4j-spark 0.91

The main process is as follow:

trainingData, testData = data.randomSplit([0.7,0.3])
vectorAssembler = VectorAssembler()
      .setInputCols(numeric_features_new) 
      .setOutputCol(FEATURES)
scaler = MinMaxScaler(inputCol = FEATURES,
                      outputCol = FEATURES + '_scaler')
assemblerInputCols = FEATURES + '_scaler'

xgb_params = dict(
        eta=0.1,
        maxDepth=2,
        missing=0.0,
        objective="binary:logistic",
        numRound=5,
        numWorkers=1
    )

xgb = (
      XGBoostClassifier(**xgb_params)
          .setFeaturesCol(assemblerInputCols)
          .setLabelCol(LABEL)
  )

pipeline = Pipeline(stages=[
             vectorAssembler,
             scaler,
             xgb
           ])
print "training model"
pipline_model = pipeline.fit(trainingData)
print "saving model to S3"
pipline_model.write().overwrite().save(modelOssDir)
print "saved model to S3"
print "Loading model..."
pipline_model = PipelineModel.load(modelOssDir)

The error I get:

Traceback (most recent call last):
  File "xgboost.py", line 95, in <module>
    pipline_model = PipelineModel.load(modelOssDir)
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 362, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 242, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/util.py", line 304, in load
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/pipeline.py", line 299, in _from_java
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 227, in _from_java
  File "/home/admin/1610603211241401722_0/pyspark.zip/pyspark/ml/wrapper.py", line 221, in __get_class
ImportError: No module named ml.dmlc.xgboost4j.scala.spark
at com.aliyun.odps.cupid.CupidUtil.errMsg2SparkException(CupidUtil.java:50)
    at com.aliyun.odps.cupid.CupidUtil.getResult(CupidUtil.java:131)
    at com.aliyun.odps.cupid.requestcupid.YarnClientImplUtil.pollAMStatus(YarnClientImplUtil.java:108)
    at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.applicationReportTransform(YarnClientImpl.java:377)
    ... 12 more
21/01/22 11:39:21 ERROR Client: Application diagnostics message: Failed to contact YARN for application application_1611286494541_745555769.
Exception in thread "main" org.apache.spark.SparkException: Application application_1611286494541_745555769 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I am searching for a long time on net.But no use. Please help or try to give some ideas how to achieve this.

thanks in advance.

Do you get an error with v0.9 of xgboost4j / xgboost4j-spark?

Do you get an error with v0.9 of xgboost4j / xgboost4j-spark?

I guess so.I succeed in training xgboost model and uploading model,but failed to load xgboost model.

Do you get an error with v0.9 of xgboost4j / xgboost4j-spark?

Thanks for your reply,have you tried to save model and then load it?

You can save the model as below:

pipe = Pipeline(stages = stages + [xgb])
model = pipe.fit(data)
model.write().overwrite().save(modelpath)

and load it later as :

from pyspark.ml import PipelineModel
model = PipelineModel.load(modelpath)

This worked for me.

You and also directly save and load XGBoostClassifier or XGBRegressor since they have JavaWriter as the parent class
One point to be noted here is that if you are training on a distributed system then you will have to save the model on a distributed storage system like HDFS or Amazon S3