/MapReduce-Kmeans

An implementation of the k-means algorithm using Hadoop and HDFS.

Primary LanguageJavaMIT LicenseMIT

MapReduce-Kmeans

An implementation of the k-means algorithm using Hadoop and HDFS written in Java.
The program was developed and tested on a Windows 10 machine using hadoop-3.35 and Maven structure with K = 3 .

  1. Description
  2. Installing & Configuring Hadoop Locally
  3. Running K-Means on Hadoop
  4. Results
  5. Notes

This project implements the k-means clustering algorithm on Hadoop using sythetic data as a sample. The data can be found at src/main/resources/data.txt and were generated by the DataGenerator.java component biased towards 3 initial centers located at src/main/resources/centroid.txt. A visual representation of the said data can be obtained by running the DataPlotter.java file

Windows

  1. Watch this Video and follow the steps closely.
  2. Open the windows cmd as an administrator
  3. Navigate to the folder you installed hadoop ex C:\hadoop-3.3.5
  4. Navigate to hadoop/sbin
  5. Type start-all.cmd to start all the hadoop services (demons)
  6. To confirm that it is working go to your browser and in the url type http://localhost:9870/. Keep this tab open. This will come in handy later

Warnings!

  1. When setting env variables make sure JAVA_HOME and HADOOP_HOME don't contain any spaces in the path.
  2. Hadoop runs on Java 8 or later
  3. If you are still getting any errors especially java exceptions try to search them on the web.

Ubuntu Linux

You can install Hadoop in ubuntu by following This article

Before you start

Put the data.txt and centroid.txt files from the resources folder in hdfs in the same directory. You can do that by opening a terminal and running

$ hdfs dfs -copyFromLocal <path-to-data.txt> <destination-folder-in-hdfs>
$ hdfs dfs -copyFromLocal <path-to-centroid.txt> <destination-folder-in-hdfs>

1. Clone this repository and navigate tothe folder:

$ git clone https://github.com/nickkatsios/MapReduce-Kmeans.git
$ cd MapReduce-Kmeans

2. Build project using Maven:

$ mvn install

A target folder should be generated with a MapReduce-Kmeans-1.0-SNAPSHOT.jar jar file inside.

3. Run the k-means algorithm using:

$ cd target
$ hadoop jar KmeansTest-1.0-SNAPSHOT.jar gr.aueb.dmst.nickkatsios.KMeans <input-hdfs-directory> <output-hdfs-directory>

With the input direcory being the directory where you put your data.txt and centroid.txt files. And output directory the directory name the output folders are based upon.

You are done With the example data and centroid files convergence should be reached after ~10 iterations.

  1. In your browser tab where http://localhost:9870/ (the namenode) is running navigate to utilities --> browse the file system
  2. After convergence x number of folders should be generated each containing the output of each iteration based on the output path/name specified in the jar execution. Navigate to the most recent one.
  3. Download the part-r-0000 file and open it with a text editor. It should contain the final centers (x,y).

The cmd output for each iteration = map-reduce job.

The state of the filesystem after running the jar.

The final directory with the final centers in the part-r-0000 file.

The part-r-0000 file opened in notepad

This project was made as an assignement of the Big Data Management Systems course at DMST AUEB.

Team members
Nikolaos Katsios 8200071
Theodoros Skondras Mexis 8200156