clusteval
clusteval
is Python package for unsupervised cluster evaluation. Three methods are implemented that can be used to evalute clusterings; silhouette, dbindex, and derivative Four clustering methods can be used: agglomerative, kmeans, dbscan and hdbscan.
Contents
Installation
-
Install clusteval from PyPI (recommended). clusteval is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
-
It is distributed under the MIT license.
-
A new environment can be created as following:
conda create -n env_clusteval python=3.6
conda activate env_clusteval
pip install clusteval
- Beta version can be installed from the GitHub source:
git clone https://github.com/erdogant/clusteval
cd clusteval
pip install -U .
Import clusteval package
from clusteval import clusteval
Create example data set
# Generate random data
from sklearn.datasets import make_blobs
X, labx_true = make_blobs(n_samples=750, centers=4, n_features=2, cluster_std=0.5)
Cluster validation using Silhouette score
# Determine the optimal number of clusters
ce = clusteval(method='silhouette')
ce.fit(X)
ce.plot()
ce.dendrogram()
ce.scatter(X)
Cluster validation using davies-boulin index
# Determine the optimal number of clusters
ce = clusteval(method='dbindex')
ce.fit(X)
ce.plot()
ce.scatter(X)
ce.dendrogram()
Cluster validation using derivative method
# Determine the optimal number of clusters
ce = clusteval(method='derivative')
ce.fit(X)
ce.plot()
ce.scatter(X)
ce.dendrogram()
Cluster validation using dbscan
# Determine the optimal number of clusters using dbscan and silhoutte
ce = clusteval(cluster='dbscan')
ce.fit(X)
ce.plot()
ce.scatter(X)
ce.dendrogram()
Cluster validation using hdbscan
To run hdbscan, it requires an installation. This library is not included in the clusteval
setup file because it frequently gives installation issues.
pip install hdbscan
# Determine the optimal number of clusters
ce = clusteval(cluster='hdbscan')
ce.plot()
ce.scatter(X)
Citation
Please cite clusteval in your publications if this is useful for your research. Here is an example BibTeX entry:
@misc{erdogant2019clusteval,
title={clusteval},
author={Erdogan Taskesen},
year={2019},
howpublished={\url{https://github.com/erdogant/clusteval}},
}
TODO
- Use ARI when the ground truth clustering has large equal sized clusters
- Usa AMI when the ground truth clustering is unbalanced and there exist small clusters
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
- https://scikit-learn.org/stable/auto_examples/cluster/plot_adjusted_for_chance_measures.html#sphx-glr-auto-examples-cluster-plot-adjusted-for-chance-measures-py