/pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

PySpark2PMML

Python package for converting Apache Spark ML pipelines to PMML.

Features

This package is a thin PySpark wrapper for the JPMML-SparkML library.

Prerequisites

  • Apache Spark 3.0.X, 3.1.X, 3.2.X, 3.3.X, 3.4.X or 3.5.X.
  • Python 2.7, 3.4 or newer.

Installation

Install a release version from PyPI:

pip install pyspark2pmml

Alternatively, install the latest snapshot version from GitHub:

pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.git

Configuration and usage

PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:

Apache Spark version JPMML-SparkML branch Latest JPMML-SparkML version
3.0.X 2.0.X 2.0.3
3.1.X 2.1.X 2.1.3
3.2.X 2.2.X 2.2.3
3.3.X 2.3.X 2.3.2
3.4.X 2.4.X 2.4.1
3.5.X master 2.5.0

Launch PySpark; use the --packages command-line option to specify the coordinates of relevant JPMML-SparkML modules:

  • org.jpmml:pmml-sparkml:${version} - Core module.
  • org.jpmml:pmml-sparkml-lightgbm:${version} - LightGBM via SynapseML extension module.
  • org.jpmml:pmml-sparkml-xgboost:${version} - XGBoost via XGBoost4J-Spark extension module.

Launching core:

pyspark --packages org.jpmml:pmml-sparkml:${version}

Fitting a Spark ML pipeline:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted Spark ML pipeline to a PMML file:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

The representation of individual Spark ML pipeline stages can be customized via conversion options:

from pyspark2pmml import PMMLBuilder

classifierModel = pipelineModel.stages[1]

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
	.putOption(classifierModel, "compact", False) \
	.putOption(classifierModel, "estimate_featureImportances", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

License

PySpark2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use PySpark2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes PySpark2PMML available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io