Project in Cornell CS4320 to implement k-means clustering algorithm using Hadoop map reduce.
Giri Kuncoro (gk256@cornell.edu), Batu Inal (bi49@cornell.edu)
The k-means clustering algorithm is implemented based on below idea:
- Pick k points to serve as the initial cluster centroids
- For every point Pk, find the closes centroid Ci (using Euclidean distance) and associate it with Ci
- Update the Ci's by taking all points associated with each Ci in the previous step and setting the new Ci's to the mean of the points associated with it
- Repeat for a specific number of iterations or until the centroids stop changing, whichever comes first
The code is implemented based on the algorithm provided in Homework 4 CS4320 instruction and Hadoop documentation, particularly WritableComparator, Mapper and Reducer class sections.