/sklearn-gbmi

scikit-learn gradient-boosting-model interactions

Primary LanguageJupyter NotebookMIT LicenseMIT

Warning: this package has reached the end of its life. It is incompatible with recent versions of scikit-learn, and I (Ralph Haygood) have neither the time nor the interest to solve this problem. If you wish to have a go at it yourself, by all means fork the GitHub repository and proceed.

Ultimately, the problem may be one or more incompatibilities between recent versions of scikit-learn and/or NumPy and a wad (25,644 lines) of C code not written by me, which I was forced to add to this package when the maintainers of scikit-learn replaced the sklearn.ensemble.partial_dependence.partial_dependence function with the not-fully-equivalent sklearn.inspection.partial_dependence function. This package depended on the grid argument of the former, which is missing from the latter. I worked around this defect by extracting the code that had implemented the grid argument and integrating it into this package. However, this code may well depend on internal characteristics of scikit-learn or NumPy that have changed since then.

sklearn-gbmi: scikit-learn gradient-boosting-model interactions

This package provides a Python module for computing Friedman and Popescu's H statistics, in order to look for interactions among variables in scikit-learn gradient-boosting models (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

See Jerome H. Friedman and Bogdan E. Popescu, 2008, "Predictive learning via rule ensembles", Ann. Appl. Stat. 2:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.

Installation

pip install sklearn-gbmi

On some systems, if you wish to use this package with Python 3, then you must install with pip3 rather than pip.

In case of difficulties with installing or using this package, consult "Advanced installation" below.

Usage

Given a scikit-learn gradient-boosting model gbm that has been fitted to a NumPy array or pandas data frame array_or_frame and a list of indices of columns of the array or columns of the data frame indices_or_columns, the H statistic of the variables represented by the elements of array_or_frame and specified by indices_or_columns can be computed via

from sklearn_gbmi import *

h(gbm, array_or_frame, indices_or_columns)

Alternatively, the two-variable H statistic of each pair of variables represented by the elements of array_or_frame and specified by indices_or_columns can be computed via

from sklearn_gbmi import *

h_all_pairs(gbm, array_or_frame, indices_or_columns)

(Compared to iteratively calling h, calling h_all_pairs avoids redundant computations.)

indices_or_columns is optional, with default value 'all'. If it is 'all', then all columns of array_or_frame are used.

NaN is returned if a computation is spoiled by weak main effects and rounding errors.

H varies from 0 to 1. The larger H, the stronger the evidence for an interaction among the variables.

Example

See the Jupyter notebook example.ipynb (https://github.com/ralphhaygood/sklearn-gbmi/blob/master/example.ipynb) for a complete example of how to use this package.

Notes

  1. Per Friedman and Popescu, only variables with strong main effects should be examined for interactions. Strengths of main effects are available as gbm.feature_importances_ once gbm has been fitted.

  2. Per Friedman and Popescu, collinearity among variables can lead to interactions in gbm that are not present in the target function. To forestall such spurious interactions, check for strong correlations among variables before fitting gbm.

Advanced installation

Installing this package requires NumPy, so if installation fails with a complaint that NumPy is missing, add it to the install command:

pip install numpy sklearn-gbmi

For performance, this package is partly implemented using Cython (C extensions for Python). It includes a C file that was generated by Cython, which is compiled for your system when you install the package. Normally, this C file is fine, but occasionally, it may not compile, or the result may not run. In the first case, installing the package fails, while in the second case, using the package fails, typically with a cryptic error message; for example:

ValueError: sklearn.tree._criterion.Criterion size changed, may indicate binary incompatibility.

In such a case, you may still be able to install and use the package by regenerating the C file, as follows.

First, if this package is installed (i.e., installation succeeds, but usage fails), uninstall it:

pip uninstall sklearn-gbmi

Then, install Cython:

pip install cython

Next, set the environment variable USE_CYTHONIZE to 1. For bash and similar shells:

export USE_CYTHONIZE=1

For csh and similar shells:

setenv USE_CYTHONIZE 1

Finally, reinstall this package:

pip install sklearn-gbmi --no-cache-dir

The C file should be regenerated and compiled for your system, hopefully making this package usable on your system.