/Parallel-K-Means-Clustering

A parallel implementation of the K-means clustering algorithm in Python

Primary LanguagePython

Parallel-K-Means-Clustering

In this report, we propose a parallel version of the K-means clustering algorithm and implement it using Python’s multiprocessing library. We then run a series of simulations to compare the runtime and speedup between the basic K-means clustering algorithm and our parallel version. We also explore the impact of incorporating more CPUs into our parallel algorithm. Our findings confirm what we intuitively thought would happen based on our understanding of parallel computing and the “embarrassingly parallel” nature of the K-means clustering algorithm. For datasets with fewer than 5,000 observations, the basic K-means clustering algorithm outperforms the parallel version. However, as the number of data points increases, our parallel K-means clustering algorithm significantly reduces overall time complexity. Added measures for speedup, scaleup and sizeup provide further evidence that our parallel K-means clustering algorithm is both robust and scalable.