/jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML

Primary LanguageJavaGNU Affero General Public License v3.0AGPL-3.0

JPMML-SkLearn Build Status

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Table of Contents

Features

Overview

  • Functionality:
    • Three times more supported Python packages, transformers and estimators than all the competitors combined!
    • Thorough collection, analysis and encoding of feature information:
      • Names.
      • Data and operational types.
      • Valid, invalid and missing value spaces.
      • Descriptive statistics.
    • Pipeline extensions:
      • Pruning.
      • Decision engineering (prediction post-processing).
      • Model verification.
    • Conversion options.
  • Extensibility:
    • Rich Java APIs for developing custom converters.
    • Automatic discovery and registration of custom converters based on META-INF/sklearn2pmml.properties resource files.
    • Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM and JPMML-XGBoost.
  • Production quality:
    • Complete test coverage.
    • Fully compliant with the JPMML-Evaluator library.

Supported packages

Scikit-Learn

Examples: main.py

Category Encoders

Examples: extensions/category_encoders.py

H2O.ai

Examples: main-h2o.py

Imbalanced-Learn

Examples: extensions/imblearn.py

LightGBM

Examples: main-lightgbm.py

Mlxtend

Examples: N/A

Scikit-Lego

Examples: extensions/sklego.py

  • sklego.meta.EstimatorTransformer
    • Predict functions apply, decision_function, predict.
  • sklego.preprocessing.IdentityTransformer
SkLearn2PMML

Examples: main.py and extensions/sklearn2pmml.py

  • Helpers:
    • sklearn2pmml.EstimatorProxy
    • sklearn2pmml.SelectorProxy
  • Feature specification and decoration:
    • sklearn2pmml.decoration.Alias
    • sklearn2pmml.decoration.CategoricalDomain
    • sklearn2pmml.decoration.ContinuousDomain
    • sklearn2pmml.decoration.ContinuousDomainEraser
    • sklearn2pmml.decoration.DateDomain
    • sklearn2pmml.decoration.DateTimeDomain
    • sklearn2pmml.decoration.DiscreteDomainEraser
    • sklearn2pmml.decoration.MultiDomain
    • sklearn2pmml.decoration.OrdinalDomain
  • Ensemble methods:
    • sklearn2pmml.ensemble.GBDTLMRegressor
      • The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
      • The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
    • sklearn2pmml.ensemble.GBDTLRClassifier
      • The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
      • The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
    • sklearn2pmml.ensemble.SelectFirstClassifier
    • sklearn2pmml.ensemble.SelectFirstRegressor
  • Feature selection:
    • sklearn2pmml.feature_selection.SelectUnique
  • Neural networks:
    • sklearn2pmml.neural_network.MLPTransformer
  • Pipeline:
    • sklearn2pmml.pipeline.PMMLPipeline
  • Postprocessing:
    • sklearn2pmml.postprocessing.BusinessDecisionTransformer
  • Preprocessing:
    • sklearn2pmml.preprocessing.Aggregator
    • sklearn2pmml.preprocessing.CastTransformer
    • sklearn2pmml.preprocessing.ConcatTransformer
    • sklearn2pmml.preprocessing.CutTransformer
    • sklearn2pmml.preprocessing.DaysSinceYearTransformer
    • sklearn2pmml.preprocessing.ExpressionTransformer
      • Ternary conditional expression <expression_true> if <condition> else <expression_false>.
      • Array indexing expressions X[<column index>] and X[<column name>].
      • String concatenation expressions.
      • String slicing expressions <str>[<start>:<stop>].
      • Arithmetic operators +, -, *, / and %.
      • Identity comparison operators is None and is not None.
      • Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
      • Logical operators and, or and not.
      • Value missingness check functions 'numpy.isnan', pandas.isnull and pandas.notnull.
      • Numpy universal functions.
      • String functions startswith(<prefix>), endswith(<suffix>), lower, upper and strip.
      • String length function len(<str>)
    • sklearn2pmml.preprocessing.FilterLookupTransformer
    • sklearn2pmml.preprocessing.LookupTransformer
    • sklearn2pmml.preprocessing.MatchesTransformer
    • sklearn2pmml.preprocessing.MultiLookupTransformer
    • sklearn2pmml.preprocessing.PMMLLabelBinarizer
    • sklearn2pmml.preprocessing.PMMLLabelEncoder
    • sklearn2pmml.preprocessing.PowerFunctionTransformer
    • sklearn2pmml.preprocessing.ReplaceTransformer
    • sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
    • sklearn2pmml.preprocessing.SecondsSinceYearTransformer
    • sklearn2pmml.preprocessing.StringNormalizer
    • sklearn2pmml.preprocessing.SubstringTransformer
    • sklearn2pmml.preprocessing.WordCountTransformer
    • sklearn2pmml.preprocessing.h2o.H2OFrameCreator
    • sklearn2pmml.preprocessing.scipy.BSplineTransformer
    • sklearn2pmml.util.Reshaper
  • Rule sets:
    • sklearn2pmml.ruleset.RuleSetClassifier
Sklearn-Pandas

Examples: main.py

  • sklearn_pandas.CategoricalImputer
  • sklearn_pandas.DataFrameMapper
TPOT

Examples: extensions/tpot.py

  • tpot.builtins.stacking_estimator.StackingEstimator
XGBoost

Examples: main-xgboost.py

Prerequisites

The Python side of operations

Validating Python installation:

import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml

print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

  • Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.7-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

  1. Use Python to train a model.
  2. Serialize the model in pickle data format to a file in a local filesystem.
  3. Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
    ("pca", PCA(n_components = 3)),
    ("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Recording feature importance information in a pickle data format-compatible manner:

classifier.pmml_feature_importances_ = classifier.feature_importances_

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

from sklearn.externals import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --help

Documentation

Up-to-date:

Slightly outdated:

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io