/ClusterShapley

Explaining dimensionality results using SHAP values

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

pypi_version pypi_downloads

ClusterShapley

ClusterShapley is a technique to explain non-linear dimendionality reduction results. You can explain the cluster formation after reducing the dimensionality to 2D. Read the preprint or publisher versions for further details.

Installation

ClusterShapley depends upon common machine learning libraries, such as scikit-learn and NumPy. It also depends on SHAP.

Requirements:

  • shap
  • numpy
  • scipy
  • scikit-learn
  • pybind11

If you have these requirements installed, use PyPI:

pip install cluster-shapley

Usage examples

ClusterShapley package follows the same idea of sklearn classes, in which you need to fit and transform data.

Explaining cluster formation

Suppose you want to investigate the decisions of a dimensionality reduction (DR) technique to impose a projection on 2D. The first thing to do is to project the dataset.

import umap

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris


data = load_iris()
X, y = data.data, data.target

reducer = umap.UMAP(verbose=0, random_state=0)
embedding = reducer.fit_transform(X)
plt.scatter(embedding[:, 0], embedding[:, 1], c=y)

UMAP embedding of the Iris dataset

Compute explanations

Now, you can generate explanations to understand why UMAP (or any other DR technique) imposed that cluster formation.

import random
import numpy as np
# our library
import dr_explainer as dre


# fit the dataset
clusterShapley = dre.ClusterShapley()
clusterShapley.fit(X, y)

# compute explanations for data subset

to_explain = np.array(random.sample(X.tolist(), int(X.shape[0] * 0.2)))

shap_values = clusterShapley.transform(to_explain)
The matrix shap_values of shape (3, 30, 4) contains:
  • the features' contributions for each class (3);
  • upon the samples used to generate explanations (30);
  • for each feature (4).

Visualize the contributions using SHAP plot

For now, you can rely on SHAP library to visualize the contributions

klass = 0
c_exp = shap.Explanation(shap_values[klass], data=to_explain, feature_names=data.feature_names)
shap.plots.beeswarm(c_exp)

Contributions for the embedding of class 0

The plot shows the contributions of each feature for the cohesion of the selected class. Example for 'petal length (cm)':

  • Low feature values (blue) contribute for the cohesion of the selected class.
  • Higher feature values (red) do not contribute for the cohesion.

Defining your own clusters

Suppose you want to investigate why UMAP clustered 2 classes together while projecting the third one distant in 2D.

To understand that, we can use ClusterShapley to explain how the features contribute to these two major clusters.

# fit KMeans with two clusters (see notebooks/ for the complete code)

Two clusters returned by KMeans on the embedding

Lets generate explanations knowing that cluster 0 is on right and cluster 1 is on left.

clusterShapley = dre.ClusterShapley()
clusterShapley.fit(X, kmeans.labels_)

shap_values = clusterShapley.transform(to_explain)

*For the right cluster*

c_exp = shap.Explanation(shap_values[0], data=to_explain, feature_names=data.feature_names)
shap.plots.beeswarm(c_exp)

Features' contributions for cluster 0

The right cluster is characterized by the low values of petal length (cm), petal width (cm), sepal length (cm).

*For the left cluster*

c_exp = shap.Explanation(shap_values[1], data=to_explain, feature_names=data.feature_names)
shap.plots.beeswarm(c_exp)

Features' contributions for cluster 1

On the other hand, the left cluster (composed by two classes) is characterized by high values of petal length (cm), petal width (cm), sepal length (cm).

Citation

Please, use the following reference to further details and to cite ClusterShapley in your work:

@article{MarcilioJr2021_ClusterShapley,
    title = {Explaining dimensionality reduction results using Shapley values},
    journal = {Expert Systems with Applications},
    volume = {178},
    pages = {115020},
    year = {2021},
    issn = {0957-4174},
    doi = {https://doi.org/10.1016/j.eswa.2021.115020},
    url = {https://www.sciencedirect.com/science/article/pii/S0957417421004619},
    author = {Wilson E. Marcílio-Jr and Danilo M. Eler}
    }

License

ClusterShapley follows the 3-clause BSD license.

ClusterShapley uses the open-source SHAP implementation from SHAP.