clusteval

clusteval is Python package for unsupervised cluster evaluation. Three methods are implemented that can be used to evalute clusterings; silhouette, dbindex, and derivative Four clustering methods can be used: agglomerative, kmeans, dbscan and hdbscan.

Installation
Requirements
Quick Start
Contribute
Citation

Installation

Install clusteval from PyPI (recommended). clusteval is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
It is distributed under the MIT license.
A new environment can be created as following:

conda create -n env_clusteval python=3.6
conda activate env_clusteval

pip install clusteval

Beta version can be installed from the GitHub source:

git clone https://github.com/erdogant/clusteval
cd clusteval
pip install -U .

Import clusteval package

from clusteval import clusteval

Create example data set

# Generate random data
from sklearn.datasets import make_blobs
X, labx_true = make_blobs(n_samples=750, centers=4, n_features=2, cluster_std=0.5)

Cluster validation using Silhouette score

# Determine the optimal number of clusters

ce = clusteval(method='silhouette')
ce.fit(X)
ce.plot()
ce.dendrogram()
ce.scatter(X)

Cluster validation using davies-boulin index

# Determine the optimal number of clusters
ce = clusteval(method='dbindex')
ce.fit(X)
ce.plot()
ce.scatter(X)
ce.dendrogram()

Cluster validation using derivative method

# Determine the optimal number of clusters
ce = clusteval(method='derivative')
ce.fit(X)
ce.plot()
ce.scatter(X)
ce.dendrogram()

Cluster validation using dbscan

# Determine the optimal number of clusters using dbscan and silhoutte
ce = clusteval(cluster='dbscan')
ce.fit(X)
ce.plot()
ce.scatter(X)
ce.dendrogram()

Cluster validation using hdbscan

To run hdbscan, it requires an installation. This library is not included in the clusteval setup file because it frequently gives installation issues.

pip install hdbscan

# Determine the optimal number of clusters
ce = clusteval(cluster='hdbscan')
ce.plot()
ce.scatter(X)

Citation

Please cite clusteval in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{erdogant2019clusteval,
  title={clusteval},
  author={Erdogan Taskesen},
  year={2019},
  howpublished={\url{https://github.com/erdogant/clusteval}},
}

TODO

Use ARI when the ground truth clustering has large equal sized clusters
Usa AMI when the ground truth clustering is unbalanced and there exist small clusters
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html#sphx-glr-auto-examples-cluster-plot-adjusted-for-chance-measures-py

Maintainer

Erdogan Taskesen, github: erdogant
Contributions are welcome.
If you wish to buy me a Coffee for this work, it is very appreciated :) Star it if you like it!

T-Jedsada/clusteval