Java library and command-line application for converting Scikit-Learn pipelines to PMML.
- Functionality:
- Three times more supported Python packages, transformers and estimators than all the competitors combined!
- Thorough collection, analysis and encoding of feature information:
- Names.
- Data and operational types.
- Valid, invalid and missing value spaces.
- Descriptive statistics.
- Pipeline extensions:
- Pruning.
- Decision engineering (prediction post-processing).
- Model verification.
- Conversion options.
- Extensibility:
- Rich Java APIs for developing custom converters.
- Automatic discovery and registration of custom converters based on
META-INF/sklearn2pmml.properties
resource files. - Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM and JPMML-XGBoost.
- Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.
Scikit-Learn
Examples: main.py
- Clustering:
- Composite estimators:
- Matrix decomposition:
- Discriminant analysis:
- Dummies:
- Ensemble methods:
ensemble.AdaBoostRegressor
ensemble.BaggingClassifier
ensemble.BaggingRegressor
ensemble.ExtraTreesClassifier
ensemble.ExtraTreesRegressor
ensemble.GradientBoostingClassifier
ensemble.GradientBoostingRegressor
ensemble.HistGradientBoostingClassifier
ensemble.HistGradientBoostingRegressor
ensemble.IsolationForest
ensemble.RandomForestClassifier
ensemble.RandomForestRegressor
ensemble.StackingClassifier
ensemble.StackingRegressor
ensemble.VotingClassifier
ensemble.VotingRegressor
- Feature extraction:
- Feature selection:
feature_selection.GenericUnivariateSelect
(only viasklearn2pmml.SelectorProxy
)feature_selection.RFE
(only viasklearn2pmml.SelectorProxy
)feature_selection.RFECV
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFdr
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFpr
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFromModel
(either directly or viasklearn2pmml.SelectorProxy
)feature_selection.SelectFwe
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectKBest
(either directly or viasklearn2pmml.SelectorProxy
)feature_selection.SelectPercentile
(only viasklearn2pmml.SelectorProxy
)feature_selection.VarianceThreshold
(only viasklearn2pmml.SelectorProxy
)
- Impute:
- Isotonic regression:
- Generalized linear models:
linear_model.ARDRegression
linear_model.BayesianRidge
linear_model.ElasticNet
linear_model.ElasticNetCV
linear_model.GammaRegressor
linear_model.HuberRegressor
linear_model.Lars
linear_model.LarsCV
linear_model.Lasso
linear_model.LassoCV
linear_model.LassoLars
linear_model.LassoLarsCV
linear_model.LinearRegression
linear_model.LogisticRegression
linear_model.LogisticRegressionCV
linear_model.OrthogonalMatchingPursuit
linear_model.OrthogonalMatchingPursuitCV
linear_model.PoissonRegressor
linear_model.Ridge
linear_model.RidgeCV
linear_model.RidgeClassifier
linear_model.RidgeClassifierCV
linear_model.SGDClassifier
linear_model.SGDRegressor
linear_model.TheilSenRegressor
- Model selection:
- Multiclass classification:
- Naive Bayes:
- Nearest neighbors:
- Pipelines:
- Neural network models:
- Preprocessing and normalization:
preprocessing.Binarizer
preprocessing.FunctionTransformer
preprocessing.Imputer
preprocessing.KBinsDiscretizer
preprocessing.LabelBinarizer
preprocessing.LabelEncoder
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.OneHotEncoder
preprocessing.OrdinalEncoder
preprocessing.PolynomialFeatures
preprocessing.PowerTransformer
preprocessing.RobustScaler
preprocessing.StandardScaler
- Support vector machines:
- Decision trees:
Category Encoders
Examples: extensions/category_encoders.py
H2O.ai
Examples: main-h2o.py
h2o.estimators.gbm.H2OGradientBoostingEstimator
h2o.estimators.glm.H2OGeneralizedLinearEstimator
h2o.estimators.isolation_forest.H2OIsolationForestEstimator
h2o.estimators.random_forest.H2ORandomForestEstimator
h2o.estimators.stackedensemble.H2OStackedEnsembleEstimator
h2o.estimators.xgboost.H2OXGBoostEstimator
Imbalanced-Learn
Examples: extensions/imblearn.py
- Under-sampling methods:
imblearn.under_sampling.AllKNN
imblearn.under_sampling.ClusterCentroids
imblearn.under_sampling.CondensedNearestNeighbour
imblearn.under_sampling.EditedNearestNeighbours
imblearn.under_sampling.InstanceHardnessThreshold
imblearn.under_sampling.NearMiss
imblearn.under_sampling.NeighbourhoodCleaningRule
imblearn.under_sampling.OneSidedSelection
imblearn.under_sampling.RandomUnderSampler
imblearn.under_sampling.RepeatedEditedNearestNeighbours
imblearn.under_sampling.TomekLinks
- Over-sampling methods:
- Combination of over- and under-sampling methods:
- Ensemble methods:
- Pipeline:
LightGBM
Examples: main-lightgbm.py
Scikit-Lego
Examples: extensions/sklego.py
sklego.meta.EstimatorTransformer
- Predict functions
apply
,decision_function
,predict
.
- Predict functions
sklego.preprocessing.IdentityTransformer
SkLearn2PMML
Examples: main.py and extensions/sklearn2pmml.py
- Helpers:
sklearn2pmml.EstimatorProxy
sklearn2pmml.SelectorProxy
- Feature specification and decoration:
sklearn2pmml.decoration.Alias
sklearn2pmml.decoration.CategoricalDomain
sklearn2pmml.decoration.ContinuousDomain
sklearn2pmml.decoration.ContinuousDomainEraser
sklearn2pmml.decoration.DateDomain
sklearn2pmml.decoration.DateTimeDomain
sklearn2pmml.decoration.DiscreteDomainEraser
sklearn2pmml.decoration.MultiDomain
sklearn2pmml.decoration.OrdinalDomain
- Ensemble methods:
sklearn2pmml.ensemble.GBDTLMRegressor
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
LGBMRegressor
,XGBRegressor
,XGBRFRegressor
. - The LM side: A Scikit-Learn linear regressor (eg.
ElasticNet
,LinearRegression
,SGDRegressor
).
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
sklearn2pmml.ensemble.GBDTLRClassifier
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
LGBMClassifier
,XGBClassifier
,XGBRFClassifier
. - The LR side: A Scikit-Learn binary linear classifier (eg.
LinearSVC
,LogisticRegression
,SGDClassifier
).
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
sklearn2pmml.ensemble.SelectFirstClassifier
sklearn2pmml.ensemble.SelectFirstRegressor
- Feature selection:
sklearn2pmml.feature_selection.SelectUnique
- Neural networks:
sklearn2pmml.neural_network.MLPTransformer
- Pipeline:
sklearn2pmml.pipeline.PMMLPipeline
- Postprocessing:
sklearn2pmml.postprocessing.BusinessDecisionTransformer
- Preprocessing:
sklearn2pmml.preprocessing.Aggregator
sklearn2pmml.preprocessing.CastTransformer
sklearn2pmml.preprocessing.ConcatTransformer
sklearn2pmml.preprocessing.CutTransformer
sklearn2pmml.preprocessing.DaysSinceYearTransformer
sklearn2pmml.preprocessing.ExpressionTransformer
- Ternary conditional expression
<expression_true> if <condition> else <expression_false>
. - Array indexing expressions
X[<column index>]
andX[<column name>]
. - String concatenation expressions.
- String slicing expressions
<str>[<start>:<stop>]
. - Arithmetic operators
+
,-
,*
,/
and%
. - Identity comparison operators
is None
andis not None
. - Comparison operators
in <list>
,not in <list>
,<=
,<
,==
,!=
,>
and>=
. - Logical operators
and
,or
andnot
. - Value missingness check functions 'numpy.isnan',
pandas.isnull
andpandas.notnull
. - Numpy universal functions.
- String functions
startswith(<prefix>)
,endswith(<suffix>)
,lower
,upper
andstrip
. - String length function
len(<str>)
- Ternary conditional expression
sklearn2pmml.preprocessing.FilterLookupTransformer
sklearn2pmml.preprocessing.LookupTransformer
sklearn2pmml.preprocessing.MatchesTransformer
sklearn2pmml.preprocessing.MultiLookupTransformer
sklearn2pmml.preprocessing.PMMLLabelBinarizer
sklearn2pmml.preprocessing.PMMLLabelEncoder
sklearn2pmml.preprocessing.PowerFunctionTransformer
sklearn2pmml.preprocessing.ReplaceTransformer
sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
sklearn2pmml.preprocessing.SecondsSinceYearTransformer
sklearn2pmml.preprocessing.StringNormalizer
sklearn2pmml.preprocessing.SubstringTransformer
sklearn2pmml.preprocessing.WordCountTransformer
sklearn2pmml.preprocessing.h2o.H2OFrameCreator
sklearn2pmml.preprocessing.scipy.BSplineTransformer
sklearn2pmml.util.Reshaper
- Rule sets:
sklearn2pmml.ruleset.RuleSetClassifier
XGBoost
Examples: main-xgboost.py
- Python 2.7, 3.4 or newer.
scikit-learn
0.16.0 or newer.sklearn-pandas
0.0.10 or newer.sklearn2pmml
0.14.0 or newer.
Validating Python installation:
import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml
print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)
- Java 1.8 or newer.
Enter the project root directory and build using Apache Maven:
mvn clean install
The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.7-SNAPSHOT.jar
, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar
.
A typical workflow can be summarized as follows:
- Use Python to train a model.
- Serialize the model in
pickle
data format to a file in a local filesystem. - Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.
Loading data to a pandas.DataFrame
object:
import pandas
df = pandas.read_csv("Iris.csv")
iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]
First, creating a sklearn_pandas.DataFrameMapper
object, which performs column-oriented feature engineering and selection work:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain
column_preprocessor = DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])
Second, creating Transformer
and Selector
objects, which perform table-oriented feature engineering and selection work:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy
table_preprocessor = Pipeline([
("pca", PCA(n_components = 3)),
("selector", SelectorProxy(SelectKBest(k = 2)))
])
Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy
object.
Third, creating an Estimator
object:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(min_samples_leaf = 5)
Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline
object, and running the experiment:
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("columns", column_preprocessor),
("table", table_preprocessor),
("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)
Recording feature importance information in a pickle
data format-compatible manner:
classifier.pmml_feature_importances_ = classifier.feature_importances_
Embedding model verification data:
pipeline.verify(iris_X.sample(n = 15))
Storing the fitted PMMLPipeline
object in pickle
data format:
from sklearn.externals import joblib
joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)
Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.
Converting the pipeline pickle file pipeline.pkl.z
to a PMML file pipeline.pmml
:
java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml
Getting help:
java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --help
Up-to-date:
- Benchmarking Scikit-Learn against JPMML-Evaluator in Java and Python environments
- Extending Scikit-Learn with outlier detector transformer type
- Analyzing Scikit-Learn feature importances via PMML
- Training Scikit-Learn based TF(-IDF) plus XGBoost pipelines
- Converting Scikit-Learn based TF(-IDF) pipelines to PMML documents
- Converting Scikit-Learn based Imbalanced-Learn (imblearn) pipelines to PMML documents
- Extending Scikit-Learn with date and datetime features
- Extending Scikit-Learn with feature specifications
- Converting logistic regression models to PMML documents
- Stacking Scikit-Learn, LightGBM and XGBoost models
- Converting Scikit-Learn hyperparameter-tuned pipelines to PMML documents
- Extending Scikit-Learn with GBDT plus LR ensemble (GBDT+LR) model type
- Converting Scikit-Learn based TPOT automated machine learning (AutoML) pipelines to PMML documents
- Converting Scikit-Learn based LightGBM pipelines to PMML documents
- Extending Scikit-Learn with business rules (BR) model type
Slightly outdated:
JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.
JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact info@openscoring.io