The package provides a simple way to perform clustering in Python. For this purpose it provides a variety of algorithms from different domains. Additionally, ClustPy includes methods that are often needed for research purposes, such as plots, clustering metrics or evaluation methods. Further, it integrates various frequently used datasets (e.g., from the UCI repository) through largely automated loading options.

The focus of the ClustPy package is not on efficiency (here we recommend e.g. pyclustering), but on the possibility to try out a wide range of modern scientific methods. In particular, this should also make lesser-known methods accessible in a simple and convenient way.

Since it largely follows the implementation conventions of sklearn clustering, it can be combined with many other packages (see below).

Installation

For Users

Stable Version

The current stable version can be installed by the following command:

pip install clustpy

If you want to install the complete package including all data loader functions, you should use:

pip install clustpy[full]

Note that a gcc compiler is required for installation. Therefore, in case of an installation error, make sure that:

Windows: Microsoft C++ Build Tools is installed
Linux/Mac: Python dev is installed (e.g., by running apt-get install python-dev - the exact command may differ depending on the linux distribution)

The error messages may look like this:

'error: command 'gcc' failed: No such file or directory'
'Could not build wheels for clustpy, which is required to install pyproject.toml-based projects'
'Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools'

Development Version

The current development version can be installed directly from git by executing:

sudo pip install git+https://github.com/collinleiber/ClustPy.git

Alternatively, clone the repository, go to the directory and execute:

sudo python setup.py install

If you have no sudo rights you can use:

python setup.py install --prefix ~/.local

For Developers

Clone the repository, go to the directory and do the following (NumPy must be installed beforehand).

Install package locally and compile C files:

python setup.py install --prefix ~/.local

Copy compiled C files to correct file location:

python setup.py build_ext --inplace

Remove clustpy via pip to avoid ambiguities during development, e.g., when changing files in the code:

pip uninstall clustpy

Components

Clustering Algorithms

Partition-based Clustering

Algorithm	Publication	Published at	Original Code	Docs
DipInit (incl. DipExt)	Utilizing Structure-Rich Features to Improve Clustering	ECML PKDD 2020	Link (R)	Link
DipMeans	Dip-means: an incremental clustering method for estimating the number of clusters	NIPS 2012	Link (Matlab)	Link
Dip'n'sub (incl. TailoredDip)	Extension of the Dip-test Repertoire - Efficient and Differentiable p-value Calculation for Clustering	SIAM SDM 2023	Link (Python)	Link
GapStatistic	Estimating the number of clusters in a data set via the gap statistic	RSS: Series B 2002	-	Link
G-Means	Learning the k in k-means	NIPS 2003	-	Link
LDA-K-Means	Adaptive dimension reduction using discriminant analysis and K-means clustering	ICML 2007	-	Link
PG-Means	PG-means: learning the number of clusters in data	NIPS 2006	-	Link
Projected Dip-Means	The Projected Dip-means Clustering Algorithm	SETN 2018	-	Link
SkinnyDip (incl. UniDip)	Skinny-dip: Clustering in a Sea of Noise	KDD 2016	Link (R)	Link
SpecialK	k Is the Magic Number—Inferring the Number of Clusters Through Nonparametric Concentration Inequalities	ECML PKDD 2019	Link (Python)	Link
SubKmeans	Towards an Optimal Subspace for K-Means	KDD 2017	Link (Scala)	Link
X-Means	X-means: Extending k-means with efficient estimation of the number of clusters	ICML 2000	-	Link

Density-based Clustering

Algorithm	Publication	Published at	Original Code	Docs
Multi Density DBSCAN	Multi Density DBSCAN	IDEAL 2011	-	Link

Hierarchical Clustering

Algorithm	Publication	Published at	Original Code	Docs
DIANA	Finding Groups in Data: An Introduction to Cluster Analysis	JASA 1991	-	Link

Alternative Clustering / Non-redundant Clustering

Algorithm	Publication	Published at	Original Code	Docs
AutoNR	Automatic Parameter Selection for Non-Redundant Clustering	SIAM SDM 2022	Link (Python)	Link
NR-Kmeans	Discovering Non-Redundant K-means Clusterings in Optimal Subspaces	KDD 2018	Link (Scala)	Link
Orth1 + Orth2	Non-redundant multi-view clustering via orthogonalization	ICDM 2007	-	Link

Deep Clustering

Algorithm	Publication	Published at	Original Code	Docs
ACe/DeC	Details (Don't) Matter: Isolating Cluster Information in Deep Embedded Spaces	IJCAI 2021	Link (Python + PyTorch)	Link
AEC	Auto-encoder based data clustering	CIARP 2013	Link (Matlab)	Link
DCN	Towards K-means-friendly spaces: simultaneous deep learning and clustering	ICML 2017	Link (Python + Theano)	Link
DDC	Deep density-based image clustering	Knowledge-Based Systems 2020	Link (Python + Keras)	Link
DEC	Unsupervised deep embedding for clustering analysis	ICML 2016	Link (Python + Caffe)	Link
DeepECT	Deep embedded cluster tree	ICDM 2019	Link (Python + PyTorch)	Link
DipDECK	Dip-based Deep Embedded Clustering with k-Estimation	KDD 2021	Link (Python + PyTorch)	Link
DipEncoder	The DipEncoder: Enforcing Multimodality in Autoencoders	KDD 2022	Link (Python + PyTorch)	Link
DKM	Deep k-Means: Jointly clustering with k-Means and learning representations	Pattern Recognition Letters 2020	Link (Python + Tensorflow)	Link
ENRC	Deep Embedded Non-Redundant Clustering	AAAI 2020	Link (Python + PyTorch)	Link
IDEC	Improved Deep Embedded Clustering with Local Structure Preservation	IJCAI 2017	Link (Python + Keras)	Link
N2D	N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding	ICPR 2021	Link (Python + Keras)	Link
VaDE	Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering	IJCAI 2017	Link (Python + Keras)	Link

Neural Networks

Algorithm	Publication	Published at	Original Code	Docs
Convolutional Autoencoder (ResNet)	Deep Residual Learning for Image Recognition	CVPR 2016	-	Link
Feedforward Autoencoder	Modular Learning in Neural Networks	AAAI 1987	-	Link
Neighbor Encoder	Representation Learning by Reconstructing Neighborhoods	arXiv 2018	-	Link
Stacked Autoencoder	Greedy Layer-Wise Training of Deep Networks	NIPS 2006	-	Link
Variational Autoencoder	Auto-Encoding Variational Bayes	ICLR 2014	-	Link

Other implementations

Metrics
- Confusion Matrix [Docs]
- Fair Normalized Mutual Information (FNMI) [Publication] [Docs]
- Hierarchical Metrics
  - Dendrogram Purity [Publication] [Docs]
  - Leaf Purity [Publication] [Docs]
- Information-Theoretic External Cluster-Validity Measure (DOM) [Publication] [Docs]
- Pair Counting Scores (f1, rand, jaccard, recall, precision) [Publication] [Docs]
- Purity [Publication] [Docs]
- Scores for multiple labelings (see alternative clustering algorithms)
  - Multiple Labelings Confusion Matrix [Docs]
  - Multiple Labelings Pair Counting Scores [Publication] [Docs]
- Unsupervised Clustering Accuracy [Publication] [Docs]
- Variation of information [Publication] [Docs]
Utils
- Automatic evaluation methods [Docs]
- Hartigans Dip-test [Publication] [Docs]
- Various plots [Docs]
Datasets
- Synthetic dataset creators
  - For common subspace clustering [Docs]
  - For alternative clustering [Docs]
- Real-world dataset loaders (e.g., Iris, Wine, Mice protein, Optdigits, MNIST, ...)
  - UCI Repository [Website]
  - UEA & UCR Time Series Classification Repository [Website]
  - MedMNIST [Website]
  - Torchvision Datasets [Website]
  - Sklearn Datasets [Website]
  - Others
- Dataset loaders for datasets with multiple labelings
  - ALOI (subset) [Website]
  - CMU Face [Website]
  - Dancing Stickfigures [Publication]
  - Fruit [Publication]
  - NRLetters [Publication]
  - WebKB [Website]

Python environments

ClustPy utilizes global Python environment variables in some places. These can be defined using os.environ['VARIABLE_NAME'] = VARIABLE_VALUE. The following variable names are used:

'CLUSTPY_DATA': Defines the path where downloaded datasets should be saved.
'CLUSTPY_DEVICE': Define the device to be used for Pytorch applications. Example: os.environ['CLUSTPY_DEVICE'] = 'cuda:1'

Compatible packages

We stick as close as possible to the implementation details of sklean clustering. Therefore, our methods are compatible with many other packages. Examples are:

sklearn clustering
- K-Means
- Affinity propagation
- Mean-shift
- Spectral clustering
- Ward hierarchical clustering
- Agglomerative clustering
- DBSCAN
- OPTICS
- Gaussian mixtures
- BIRCH
kmodes
- k-modes
- k-prototypes
HDBSCAN
- HDBSCAN
scikit-learn-extra
- k-medoids
- Density-Based common-nearest-neighbors clustering
Density Peak Clustering
- DPC

Coding Examples

1)

In this first example, the subspace algorithm SubKmeans is run on a synthetic subspace dataset. Afterward, the clustering accuracy is calculated to evaluate the result.

from clustpy.partition import SubKmeans
from clustpy.data import create_subspace_data
from clustpy.metrics import unsupervised_clustering_accuracy as acc

data, labels = create_subspace_data(1000, n_clusters=4, subspace_features=[2,5])
sk = SubKmeans(4)
sk.fit(data)
acc_res = acc(labels, sk.labels_)
print("Clustering accuracy:", acc_res)

2)

The second example covers the topic of non-redundant/alternative clustering. Here, the NrKmeans algorithm is run on the Fruit dataset. Beware that NrKmeans as a non-redundant clustering algorithm returns multiple labelings. Therefore, we calculate the confusion matrix by comparing each combination of labels using the normalized mutual information (nmi). The confusion matrix will be printed and finally the best matching nmi will be stated for each set of labels.

from clustpy.alternative import NrKmeans
from clustpy.data import load_fruit
from clustpy.metrics import MultipleLabelingsConfusionMatrix
from sklearn.metrics import normalized_mutual_info_score as nmi
import numpy as np

data, labels = load_fruit(return_X_y=True)
nk = NrKmeans([3, 3])
nk.fit(data)
mlcm = MultipleLabelingsConfusionMatrix(labels, nk.labels_, nmi)
mlcm.rearrange()
print(mlcm.confusion_matrix)
print(np.max(mlcm.confusion_matrix, axis=1))

3)

One mentionable feature of the ClustPy package is the ability to run various modern deep clustering algorithms out of the box. For example, the following code runs the DEC algorithm on the Optdigits dataset. To evaluate the result, we compute the adjusted RAND index (ari).

from clustpy.deep import DEC
from clustpy.data import load_optdigits
from sklearn.metrics import adjusted_rand_score as ari

data, labels = load_optdigits(return_X_y=True)
dec = DEC(10)
dec.fit(data)
my_ari = ari(labels, dec.labels_)
print(my_ari)

4)

In this more complex example, we use ClustPy's evaluation functions, which automatically run the specified algorithms multiple times on previously defined datasets. All results of the given metrics are stored in a Pandas dataframe.

from clustpy.utils import EvaluationDataset, EvaluationAlgorithm, EvaluationMetric, evaluate_multiple_datasets
from clustpy.partition import ProjectedDipMeans, SubKmeans
from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score
from sklearn.cluster import KMeans, DBSCAN
from clustpy.data import load_breast_cancer, load_iris, load_wine
from clustpy.metrics import unsupervised_clustering_accuracy as acc
from sklearn.decomposition import PCA
import numpy as np

def reduce_dimensionality(X, dims):
    pca = PCA(dims)
    X_new = pca.fit_transform(X)
    return X_new

def znorm(X):
    return (X - np.mean(X)) / np.std(X)

def minmax(X):
    return (X - np.min(X)) / (np.max(X) - np.min(X))

datasets = [
    EvaluationDataset("Breast_pca_znorm", data=load_breast_cancer, preprocess_methods=[reduce_dimensionality, znorm],
                      preprocess_params=[{"dims": 0.9}, {}], ignore_algorithms=["pdipmeans"]),
    EvaluationDataset("Iris_pca", data=load_iris, preprocess_methods=reduce_dimensionality,
                      preprocess_params={"dims": 0.9}),
    EvaluationDataset("Wine", data=load_wine),
    EvaluationDataset("Wine_znorm", data=load_wine, preprocess_methods=znorm)]

algorithms = [
    EvaluationAlgorithm("SubKmeans", SubKmeans, {"n_clusters": None}),
    EvaluationAlgorithm("pdipmeans", ProjectedDipMeans, {}),  # Determines n_clusters automatically
    EvaluationAlgorithm("dbscan", DBSCAN, {"eps": 0.01, "min_samples": 5}, preprocess_methods=minmax,
                        deterministic=True),
    EvaluationAlgorithm("kmeans", KMeans, {"n_clusters": None}),
    EvaluationAlgorithm("kmeans_minmax", KMeans, {"n_clusters": None}, preprocess_methods=minmax)]

metrics = [EvaluationMetric("NMI", nmi), EvaluationMetric("ACC", acc),
           EvaluationMetric("Silhouette", silhouette_score, use_gt=False)]

df = evaluate_multiple_datasets(datasets, algorithms, metrics, n_repetitions=5,
                                aggregation_functions=[np.mean, np.std, np.max, np.min],
                                add_runtime=True, add_n_clusters=True, save_path=None,
                                save_intermediate_results=False)
print(df)

Citation

If you use the ClustPy package in the context of a scientific publication, please cite it as follows:

Leiber, C., Miklautz, L., Plant, C., Böhm, C. (2023, December). Benchmarking Deep Clustering Algorithms With ClustPy. 2023 IEEE International Conference on Data Mining Workshops (ICDMW). [DOI]

BibTeX:

@inproceedings{leiber2023benchmarking,
  title = {Benchmarking Deep Clustering Algorithms With ClustPy},
  author = {Leiber, Collin and Miklautz, Lukas and Plant, Claudia and Böhm, Christian},
  booktitle = {2023 IEEE International Conference on Data Mining Workshops (ICDMW)}, 
  year = {2023},
  pages = {625-632},
  publisher = {IEEE},
  doi = {10.1109/ICDMW60847.2023.00087}
}

Publications using ClustPy

Application of Deep Clustering Algorithms (10/2023)
Benchmarking Deep Clustering Algorithms With ClustPy (12/2023)
Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms (08/2024)