Warning: this package has reached the end of its life. It is incompatible with recent versions of scikit-learn, and I (Ralph Haygood) have neither the time nor the interest to solve this problem. If you wish to have a go at it yourself, by all means fork the GitHub repository and proceed.
Ultimately, the problem may be one or more incompatibilities between recent versions of scikit-learn and/or NumPy and a wad (25,644 lines) of C code not written by me, which I was forced to add to this package when the maintainers of scikit-learn replaced the sklearn.ensemble.partial_dependence.partial_dependence
function with the not-fully-equivalent sklearn.inspection.partial_dependence
function.
This package depended on the grid argument of the former, which is missing from the latter.
I worked around this defect by extracting the code that had implemented the grid argument and integrating it into this package.
However, this code may well depend on internal characteristics of scikit-learn or NumPy that have changed since then.
This package provides a Python module for computing Friedman and Popescu's H statistics, in order to look for interactions among variables in scikit-learn gradient-boosting models (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).
See Jerome H. Friedman and Bogdan E. Popescu, 2008, "Predictive learning via rule ensembles", Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.
pip install sklearn-gbmi
On some systems, if you wish to use this package with Python 3, then you must install with pip3
rather than pip
.
In case of difficulties with installing or using this package, consult "Advanced installation" below.
Given a scikit-learn gradient-boosting model gbm
that has been fitted to a NumPy array or pandas data frame
array_or_frame
and a list of indices of columns of the array or columns of the data frame indices_or_columns
, the
H statistic of the variables represented by the elements of array_or_frame
and specified by indices_or_columns
can
be computed via
from sklearn_gbmi import *
h(gbm, array_or_frame, indices_or_columns)
Alternatively, the two-variable H statistic of each pair of variables represented by the elements of array_or_frame
and specified by indices_or_columns
can be computed via
from sklearn_gbmi import *
h_all_pairs(gbm, array_or_frame, indices_or_columns)
(Compared to iteratively calling h
, calling h_all_pairs
avoids redundant computations.)
indices_or_columns
is optional, with default value 'all'
. If it is 'all'
, then all columns of array_or_frame
are
used.
NaN
is returned if a computation is spoiled by weak main effects and rounding errors.
H varies from 0 to 1. The larger H, the stronger the evidence for an interaction among the variables.
See the Jupyter notebook example.ipynb (https://github.com/ralphhaygood/sklearn-gbmi/blob/master/example.ipynb) for a complete example of how to use this package.
-
Per Friedman and Popescu, only variables with strong main effects should be examined for interactions. Strengths of main effects are available as
gbm.feature_importances_
oncegbm
has been fitted. -
Per Friedman and Popescu, collinearity among variables can lead to interactions in
gbm
that are not present in the target function. To forestall such spurious interactions, check for strong correlations among variables before fittinggbm
.
Installing this package requires NumPy, so if installation fails with a complaint that NumPy is missing, add it to the install command:
pip install numpy sklearn-gbmi
For performance, this package is partly implemented using Cython (C extensions for Python). It includes a C file that was generated by Cython, which is compiled for your system when you install the package. Normally, this C file is fine, but occasionally, it may not compile, or the result may not run. In the first case, installing the package fails, while in the second case, using the package fails, typically with a cryptic error message; for example:
ValueError: sklearn.tree._criterion.Criterion size changed, may indicate binary incompatibility.
In such a case, you may still be able to install and use the package by regenerating the C file, as follows.
First, if this package is installed (i.e., installation succeeds, but usage fails), uninstall it:
pip uninstall sklearn-gbmi
Then, install Cython:
pip install cython
Next, set the environment variable USE_CYTHONIZE
to 1. For bash and similar shells:
export USE_CYTHONIZE=1
For csh and similar shells:
setenv USE_CYTHONIZE 1
Finally, reinstall this package:
pip install sklearn-gbmi --no-cache-dir
The C file should be regenerated and compiled for your system, hopefully making this package usable on your system.