/jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML

Primary LanguageJavaGNU Affero General Public License v3.0AGPL-3.0

JPMML-SkLearn Build Status

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Table of Contents

Features

Overview

  • Functionality:
    • Three times more supported Python packages, transformers and estimators than all the competitors combined!
    • Thorough collection, analysis and encoding of feature information:
      • Names.
      • Data and operational types.
      • Valid, invalid and missing value spaces.
      • Descriptive statistics.
    • Pipeline extensions:
      • Pruning.
      • Decision engineering (prediction post-processing).
      • Model verification.
    • Conversion options.
  • Extensibility:
    • Rich Java APIs for developing custom converters.
    • Automatic discovery and registration of custom converters based on META-INF/sklearn2pmml.properties resource files.
    • Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.
  • Production quality:
    • Complete test coverage.
    • Fully compliant with the JPMML-Evaluator library.

Supported packages

Scikit-Learn

Examples: main.py

BorutaPy

Examples: extensions/boruta.py

  • boruta.BorutaPy
Category Encoders

Examples: extensions/category_encoders.py and extensions/category_encoders-xgboost.py

H2O.ai

Examples: main-h2o.py

Hyperopt-sklearn

Examples: extensions/hpsklearn.py

  • hpsklearn.HyperoptEstimator
Imbalanced-Learn

Examples: extensions/imblearn.py

InterpretML

Examples: extensions/interpret.py

LightGBM

Examples: main-lightgbm.py

Mlxtend

Examples: N/A

OptBinning

Examples: extensions/optbinning.py

PyCaret

Examples: extensions/pycaret.py

  • pycaret.internal.pipeline.Pipeline
  • pycaret.internal.preprocess.transformers.CleanColumnNames
  • pycaret.internal.preprocess.transformers.FixImbalancer
  • pycaret.internal.preprocess.transformers.RareCategoryGrouping
  • pycaret.internal.preprocess.transformers.RemoveMulticollinearity
  • pycaret.internal.preprocess.transformers.RemoveOutliers
  • pycaret.internal.preprocess.transformers.TransformerWrapper
  • pycaret.internal.preprocess.transformers.TransformerWrapperWithInverse
Scikit-Lego

Examples: extensions/sklego.py

  • sklego.meta.EstimatorTransformer
    • Predict functions apply, decision_function, predict and predict_proba.
  • sklego.meta.OrdinalClassifier
  • sklego.pipeline.DebugPipeline
  • sklego.preprocessing.IdentityTransformer
Scikit-Tree

Examples: extensions/sktree.py

SkLearn2PMML

Examples: main.py and extensions/sklearn2pmml.py

  • Helpers:
    • sklearn2pmml.EstimatorProxy
    • sklearn2pmml.SelectorProxy
    • sklearn2pmml.h2o.H2OEstimatorProxy
  • Feature cross-references:
    • sklearn2pmml.cross_reference.Memorizer
    • sklearn2pmml.cross_reference.Recaller
  • Feature specification and decoration:
    • sklearn2pmml.decoration.Alias
    • sklearn2pmml.decoration.CategoricalDomain
    • sklearn2pmml.decoration.ContinuousDomain
    • sklearn2pmml.decoration.ContinuousDomainEraser
    • sklearn2pmml.decoration.DateDomain
    • sklearn2pmml.decoration.DateTimeDomain
    • sklearn2pmml.decoration.DiscreteDomainEraser
    • sklearn2pmml.decoration.MultiAlias
    • sklearn2pmml.decoration.MultiDomain
    • sklearn2pmml.decoration.OrdinalDomain
  • Ensemble methods:
    • sklearn2pmml.ensemble.EstimatorChain
    • sklearn2pmml.ensemble.GBDTLMRegressor
      • The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
      • The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
    • sklearn2pmml.ensemble.GBDTLRClassifier
      • The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
      • The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
    • sklearn2pmml.ensemble.SelectFirstClassifier
    • sklearn2pmml.ensemble.SelectFirstRegressor
  • UDF models:
    • sklearn2pmml.expression.ExpressionClassifier
    • sklearn2pmml.expression.ExpressionRegressor
  • Feature selection:
    • sklearn2pmml.feature_selection.SelectUnique
  • Linear models:
    • sklearn2pmml.statsmodels.StatsModelsClassifier
    • sklearn2pmml.statsmodels.StatsModelsOrdinalClassifier
    • sklearn2pmml.statsmodels.StatsModelsRegressor
  • Neural networks:
    • sklearn2pmml.neural_network.MLPTransformer
  • Pipeline:
    • sklearn2pmml.pipeline.PMMLPipeline
  • Postprocessing:
    • sklearn2pmml.postprocessing.BusinessDecisionTransformer
  • Preprocessing:
    • sklearn2pmml.preprocessing.Aggregator
    • sklearn2pmml.preprocessing.BSplineTransformer
    • sklearn2pmml.preprocessing.CastTransformer
    • sklearn2pmml.preprocessing.ConcatTransformer
    • sklearn2pmml.preprocessing.CutTransformer
    • sklearn2pmml.preprocessing.DataFrameConstructor
    • sklearn2pmml.preprocessing.DateTimeFormatter
    • sklearn2pmml.preprocessing.DaysSinceYearTransformer
    • sklearn2pmml.preprocessing.ExpressionTransformer
      • Ternary conditional expression <expression_true> if <condition> else <expression_false>.
      • Array indexing expressions X[<column index>] and X[<column name>].
      • String concatenation expressions.
      • String slicing expressions <str>[<start>:<stop>].
      • Arithmetic operators +, -, *, / and %.
      • The power operator **.
      • Identity comparison operators is None and is not None.
      • Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
      • Logical operators and, or and not.
      • Math constants math.e, math.nan, math.pi and math.tau.
      • Math functions (too numerous to list).
      • Numpy constants numpy.e, numpy.NaN. numpy.NZERO, numpy.pi and numpy.PZERO.
      • Numpy function numpy.where.
      • Numpy universal functions (too numerous to list).
      • Pandas constants pandas.NA and pandas.NaT.
      • Pandas functions pandas.isna, pandas.isnull, pandas.notna and pandas.notnull.
      • Scipy functions scipy.special.expit and scipy.special.logit.
      • String functions startswith(<prefix>), endswith(<suffix>), lower, upper and strip.
      • String length function len(<str>).
      • Perl Compatible Regular Expression (PCRE) functions pcre.search and pcre.sub.
      • Regular Expression (RE) functions re.search, and re.sub.
      • User-defined functions.
    • sklearn2pmml.preprocessing.FilterLookupTransformer
    • sklearn2pmml.preprocessing.IdentityTransformer
    • sklearn2pmml.preprocessing.LookupTransformer
    • sklearn2pmml.preprocessing.MatchesTransformer
    • sklearn2pmml.preprocessing.MultiLookupTransformer
    • sklearn2pmml.preprocessing.NumberFormatter
    • sklearn2pmml.preprocessing.PMMLLabelBinarizer
    • sklearn2pmml.preprocessing.PMMLLabelEncoder
    • sklearn2pmml.preprocessing.PowerFunctionTransformer
    • sklearn2pmml.preprocessing.ReplaceTransformer
    • sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
    • sklearn2pmml.preprocessing.SecondsSinceYearTransformer
    • sklearn2pmml.preprocessing.SelectFirstTransformer
    • sklearn2pmml.preprocessing.SeriesConstructor
    • sklearn2pmml.preprocessing.StringNormalizer
    • sklearn2pmml.preprocessing.SubstringTransformer
    • sklearn2pmml.preprocessing.WordCountTransformer
    • sklearn2pmml.preprocessing.h2o.H2OFrameConstructor
    • sklearn2pmml.util.Reshaper
    • sklearn2pmml.util.Slicer
  • Rule sets:
    • sklearn2pmml.ruleset.RuleSetClassifier
  • Decision trees:
    • sklearn2pmml.tree.chaid.CHAIDClassifier
    • sklearn2pmml.tree.chaid.CHAIDRegressor
Sklearn-Pandas

Examples: main.py

  • sklearn_pandas.CategoricalImputer
  • sklearn_pandas.DataFrameMapper
StatsModels

Examples: main-statsmodels.py

TPOT

Examples: extensions/tpot.py

  • tpot.builtins.stacking_estimator.StackingEstimator
XGBoost

Examples: main-xgboost.py, extensions/category_encoders-xgboost.py and extensions/categorical.py

Prerequisites

The Python side of operations

Validating Python installation:

import joblib, sklearn, sklearn_pandas, sklearn2pmml

print(joblib.__version__)
print(sklearn.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

  • Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.8-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

  1. Use Python to train a model.
  2. Serialize the model in pickle data format to a file in a local filesystem.
  3. Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
    ("pca", PCA(n_components = 3)),
    ("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Recording feature importance information in a pickle data format-compatible manner:

classifier.pmml_feature_importances_ = classifier.feature_importances_

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar --help

Documentation

Integrations:

Extensions:

Miscellaneous:

Archived:

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io