JPMML-SkLearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Features
Prerequisites
- The Python side of operations
- The JPMML-SkLearn side of operations
Installation
Usage
- The Python side of operations
- The JPMML-SkLearn side of operations
Documentation
License
Additional information

Features

Supported Estimator and Transformer types:
- Clustering:
  - cluster.KMeans
  - cluster.MiniBatchKMeans
- Composite Estimators:
  - compose.ColumnTransformer
  - compose.TransformedTargetRegressor
- Matrix Decomposition:
  - decomposition.PCA
  - decomposition.IncrementalPCA
  - decomposition.TruncatedSVD
- Discriminant Analysis:
  - discriminant_analysis.LinearDiscriminantAnalysis
- Dummies:
  - dummy.DummyClassifier
  - dummy.DummyRegressor
- Ensemble Methods:
  - ensemble.AdaBoostRegressor
  - ensemble.BaggingClassifier
  - ensemble.BaggingRegressor
  - ensemble.ExtraTreesClassifier
  - ensemble.ExtraTreesRegressor
  - ensemble.GradientBoostingClassifier
  - ensemble.GradientBoostingRegressor
  - ensemble.HistGradientBoostingClassifier
  - ensemble.HistGradientBoostingRegressor
  - ensemble.IsolationForest
  - ensemble.RandomForestClassifier
  - ensemble.RandomForestRegressor
  - ensemble.StackingClassifier
  - ensemble.StackingRegressor
  - ensemble.VotingClassifier
  - ensemble.VotingRegressor
- Feature Extraction:
  - feature_extraction.DictVectorizer
  - feature_extraction.text.CountVectorizer
  - feature_extraction.text.TfidfVectorizer
- Feature Selection:
  - feature_selection.GenericUnivariateSelect (only via sklearn2pmml.SelectorProxy)
  - feature_selection.RFE (only via sklearn2pmml.SelectorProxy)
  - feature_selection.RFECV (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFdr (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFpr (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFromModel (either directly or via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectFwe (only via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectKBest (either directly or via sklearn2pmml.SelectorProxy)
  - feature_selection.SelectPercentile (only via sklearn2pmml.SelectorProxy)
  - feature_selection.VarianceThreshold (only via sklearn2pmml.SelectorProxy)
- Impute:
  - impute.MissingIndicator
  - impute.SimpleImputer
- Isotonic regression:
  - isotonic.IsotonicRegression
- Generalized Linear Models:
  - linear_model.ARDRegression
  - linear_model.BayesianRidge
  - linear_model.ElasticNet
  - linear_model.ElasticNetCV
  - linear_model.GammaRegressor
  - linear_model.HuberRegressor
  - linear_model.Lars
  - linear_model.LarsCV
  - linear_model.Lasso
  - linear_model.LassoCV
  - linear_model.LassoLars
  - linear_model.LassoLarsCV
  - linear_model.LinearRegression
  - linear_model.LogisticRegression
  - linear_model.LogisticRegressionCV
  - linear_model.OrthogonalMatchingPursuit
  - linear_model.OrthogonalMatchingPursuitCV
  - linear_model.PoissonRegressor
  - linear_model.Ridge
  - linear_model.RidgeCV
  - linear_model.RidgeClassifier
  - linear_model.RidgeClassifierCV
  - linear_model.SGDClassifier
  - linear_model.SGDRegressor
  - linear_model.TheilSenRegressor
- Model Selection:
  - model_selection.GridSearchCV
  - model_selection.RandomizedSearchCV
- Multiclass classification:
  - multiclass.OneVsRestClassifier
- Naive Bayes:
  - naive_bayes.GaussianNB
- Nearest Neighbors:
  - neighbors.KNeighborsClassifier
  - neighbors.KNeighborsRegressor
- Pipelines:
  - pipeline.FeatureUnion
  - pipeline.Pipeline
- Neural network models:
  - neural_network.MLPClassifier
  - neural_network.MLPRegressor
- Preprocessing and Normalization:
  - preprocessing.Binarizer
  - preprocessing.FunctionTransformer
  - preprocessing.Imputer
  - preprocessing.LabelBinarizer
  - preprocessing.LabelEncoder
  - preprocessing.MaxAbsScaler
  - preprocessing.MinMaxScaler
  - preprocessing.OneHotEncoder
  - preprocessing.OrdinalEncoder
  - preprocessing.PolynomialFeatures
  - preprocessing.RobustScaler
  - preprocessing.StandardScaler
- Support Vector Machines:
  - svm.LinearSVC
  - svm.LinearSVR
  - svm.OneClassSVM
  - svm.SVC
  - svm.NuSVC
  - svm.SVR
  - svm.NuSVR
- Decision Trees:
  - tree.DecisionTreeClassifier
  - tree.DecisionTreeRegressor
  - tree.ExtraTreeClassifier
  - tree.ExtraTreeRegressor
Supported third-party Estimator and Transformer types:
- Category Encoders:
- H2O.ai:
- Imbalanced-Learn (imblearn):
- LightGBM:
- Mlxtend:
  - mlxtend.preprocessing.DenseTransformer
- SkLearn2PMML:
  - sklearn2pmml.EstimatorProxy
  - sklearn2pmml.SelectorProxy
  - sklearn2pmml.decoration.Alias
  - sklearn2pmml.decoration.CategoricalDomain
  - sklearn2pmml.decoration.ContinuousDomain
  - sklearn2pmml.decoration.DateDomain
  - sklearn2pmml.decoration.DateTimeDomain
  - sklearn2pmml.decoration.MultiDomain
  - sklearn2pmml.decoration.OrdinalDomain
  - sklearn2pmml.ensemble.GBDTLMRegressor
    - The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
    - The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
  - sklearn2pmml.ensemble.GBDTLRClassifier
    - The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
    - The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
  - sklearn2pmml.ensemble.SelectFirstClassifier
  - sklearn2pmml.ensemble.SelectFirstRegressor
  - sklearn2pmml.feature_selection.SelectUnique
  - sklearn2pmml.pipeline.PMMLPipeline
  - sklearn2pmml.preprocessing.Aggregator
  - sklearn2pmml.preprocessing.CastTransformer
  - sklearn2pmml.preprocessing.ConcatTransformer
  - sklearn2pmml.preprocessing.CutTransformer
  - sklearn2pmml.preprocessing.DaysSinceYearTransformer
  - sklearn2pmml.preprocessing.ExpressionTransformer
    - Ternary conditional expression <expression_true> if <condition> else <expression_false>.
    - Array indexing expressions X[<column index>] and X[<column name>].
    - String concatenation expressions.
    - String slicing expressions <str>[<start>:<stop>].
    - Arithmetic operators +, -, *, / and %.
    - Identity comparison operators is None and is not None.
    - Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
    - Logical operators and, or and not.
    - Value missingness check functions pandas.isnull and pandas.notnull.
    - Numpy universal functions.
    - String functions lower, upper and strip.
    - String length function len(<str>)
  - sklearn2pmml.preprocessing.IdentityTransformer
  - sklearn2pmml.preprocessing.LookupTransformer
  - sklearn2pmml.preprocessing.MatchesTransformer
  - sklearn2pmml.preprocessing.MultiLookupTransformer
  - sklearn2pmml.preprocessing.PMMLLabelBinarizer
  - sklearn2pmml.preprocessing.PMMLLabelEncoder
  - sklearn2pmml.preprocessing.PowerFunctionTransformer
  - sklearn2pmml.preprocessing.ReplaceTransformer
  - sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
  - sklearn2pmml.preprocessing.SecondsSinceYearTransformer
  - sklearn2pmml.preprocessing.StringNormalizer
  - sklearn2pmml.preprocessing.SubstringTransformer
  - sklearn2pmml.preprocessing.WordCountTransformer
  - sklearn2pmml.preprocessing.h2o.H2OFrameCreator
  - sklearn2pmml.preprocessing.scipy.BSplineTransformer
  - sklearn2pmml.ruleset.RuleSetClassifier
- Sklearn-Pandas:
  - sklearn_pandas.CategoricalImputer
  - sklearn_pandas.DataFrameMapper
- TPOT:
  - tpot.builtins.stacking_estimator.StackingEstimator
- XGBoost:
Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.

Prerequisites

The Python side of operations

Python 2.7, 3.4 or newer.
scikit-learn 0.16.0 or newer.
sklearn-pandas 0.0.10 or newer.
sklearn2pmml 0.14.0 or newer.

Validating Python installation:

import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml

print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces an executable uber-JAR file target/jpmml-sklearn-executable-1.6-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

Use Python to train a model.
Serialize the model in pickle data format to a file in a local filesystem.
Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
	("pca", PCA(n_components = 3)),
	("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

from sklearn.externals import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar target/jpmml-sklearn-executable-1.6-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar target/jpmml-sklearn-executable-1.6-SNAPSHOT.jar --help

Documentation

Up-to-date:

Slightly outdated:

Converting Scikit-Learn to PMML

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io

jieYM/jpmml-sklearn

JPMML-SkLearn

Table of Contents

Features

Prerequisites

The Python side of operations

The JPMML-SkLearn side of operations

Installation

Usage

The Python side of operations

The JPMML-SkLearn side of operations

Documentation

License

Additional information