JPMML-SparkML-XGBoost

JPMML-SparkML plugin for converting XGBoost4J-Spark models to PMML.

Prerequisites

Apache Spark 2.0.X or 2.1.X.
XGBoost4J-Spark 0.7.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build installs JPMML-SparkML-XGBoost library into local repository using coordinates org.jpmml:jpmml-sparkml-xgboost:1.0-SNAPSHOT.

Usage

The JPMML-SparkML-XGBoost library extends the JPMML-SparkML library with support for ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel and ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel prediction model classes.

Launch the Spark shell with XGBoost-extended JPMML-SparkML-Package; use --packages to include the XGBoost4J-Spark runtime dependency:

spark-shell --packages ml.dmlc:xgboost4j-spark:0.7 --jars jpmml-sparkml-package-1.1-SNAPSHOT.jar

Fitting and exporting an example pipeline model:

import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.RFormula
import org.jpmml.sparkml.ConverterUtil

val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Iris.csv")

val formula = new RFormula().setFormula("Species ~ .")
var estimator = new XGBoostEstimator(Map("objective" -> "multi:softmax", "num_class" -> 3))
estimator = estimator.set(estimator.round, 11)

val pipeline = new Pipeline().setStages(Array(formula, estimator))
val pipelineModel = pipeline.fit(df)

val pmmlBytes = ConverterUtil.toPMMLByteArray(df.schema, pipelineModel)
println(new String(pmmlBytes, "UTF-8"))

License

JPMML-SparkML-XGBoost is licensed under the GNU Affero General Public License (AGPL) version 3.0. Other licenses are available on request.

Additional information