Intro

This is an implementation of k-means clustering algorithm for distributed systems using Hadoop. You will need to manually specify the number of clusters K when launching the program.

Generating dataset

You can generate an N points dataset using datasetgen.py. You have to write also the number of clusters K and the standard deviation. The command will be like:

python datasetgen.py N K STD

example:

python datasetgen.py 1000 3 0.45

Plotting

If you are running the program on a single node for testing purposes, you can also plot the result using:

python plot.py

Other k-means versions

We made also:

Sequential version in C++
GPU accelerated version in CUDA/C++

Acknowledgments

Parallel Computing - Computer Engineering Master Degree @University of Florence.