/flakylib

Flaky Clustering Library (Minimum Sum-Of-Squares Clustering)

Primary LanguageJupyter NotebookMIT LicenseMIT

flakylib

Flaky Clustering Library (Minimum Sum-Of-Squares Clustering)

Clustering is one of the main methods for getting insight on the underlying nature and structure of data in unsupervised way. The purpose of clustering is organizing a set of data into clusters, such that the elements in each cluster are similar and different from those in other clusters.

K-means is one of the most used and fastest clustering algorithms. Actually, in nowadays K-means represents a big family of algorithms aimed to solving Minimum Sum-of-Square Clustering problem. FlakyLib is a Python library of K-means algorithm family optimized for Big Data clustering. Most of the algorithms in the FlakyLib are parallelized and optimized for high-performance computing with Numba, which translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. All FlakyLib algorithms are aimed to solving Minimum Sum-Of-Squares Clustering problem.

In solving large size problems, there are two major drawbacks of standard K-means technique: (i) since it has to process the Large/Big input dataset, it has heavy computational costs and (ii) it has a tendency to converge to one of the local minima of poor quality. In order to reduce the computational complexity, we collect a clustering techniques that utilize parallelization, memory and computational optimizations. To avoid the local minima convergence problem the different Variable Neighbourhood Search (VNS) techniques was used. Using FlakyLib it is possible to clusterize a tens/hundreds of millions of entities in a reasonable amount of time with efficient memory usage.

In most of the cases the naive K-means algorithm is not provide the best possible solution getting stuck in the nearest pit of local minimum. But using different VNS-based approaches (metaheuristics) it is possible to force K-means moving forward and on every iteration increasing solution quality. VNS allow us to transform the local search algorithms (like naive K-means) to global ones.