Inverted Index formulation using TF-IDF on a Hadoop Cluster

Cluster Configuration

Create a user called "hadoopusr" or similar, on all nodes which are to be part of the cluster. ALL HADOOP OPERATIONS ARE PERFORMED THROUGH THIS USER
As the newly created hadoopusr, setup the node for pseudo-distributed mode as hown in the link from the apache site. Repeat for all nodes which will form the cluster
Once the setup is complete, identify one of the nodes to be a master. The remaining nodes are slaves
Add the master's public key to all the slave nodes through the following commands:
On Master:

cp ~/.ssh/id_rsa.pub /tmp/master

On each slave:

scp <master IP>:/tmp/master /tmp/master
cat /tmp/master >> ~/.ssh/authorized_keys

Configure the node for operation

For each node, the following three changes need to be made. All the below files are located in the $HADOOP_HOME/etc/hadoop folder
core-site.xml: Add the master node's hostname for the property field
yarn-site.xml: Add a property for the YARN hostname, and set it to the hostname of the master
hdfs-site.xml: Add the local disk path where the HDFS files will be stored. One property for each namenode as well as datanode.

In addition to these, on the master node, add the hostname of each of the slaves to the slaves file.

Configure /etc/hosts

On each node, open the /etc/hosts file in sudo mode.
Add entries for IP address hostname for each node if the cluster
Uncomment the line near the top of the file which sets the loopback IP address to the hostname
For e.g.:

localhost 127.0.0.1
# ubuntu 127.0.0.1

Configure Job History Server

Configure the yarn-site.xml file to support log aggregation to store history of past applications run on the cluster.

Sample configuration files are included in the conf folder for each of the above

Running Hadoop Cluster

If the cluster is being setup for the first time, format the namenode through the command:
hdfs namenode -format
This needs to be run everytime there is a change to the hdfs-site.xml file.

Some utility scripts are included in the scripts folder to start and stop the hadoop cluster. They need to be modified as needed for the cluster to run correctly.
On the master node, execute the start-all.sh script included in the repo to start the Cluster. stop-all.sh will stop the cluster.

If needed, additionally run the following command on each slave node to start the history server:
mr-jobhistory-daemon.sh --config $HADOOP_HOME/etc/hadoop start historyserver

Compiling the code

The code can be compiled into a jar using the following commands:

hadoop com.sun.tools.javac.Main InvertedIndex.java -d inverted-index
jar cf InvertedIndex.jar inverted-index/*.class

Similarly, the same commands can be repeated for the QueryDocs.java file to create the Query.jar file

Running the code

Upload the dataset to the HDFS using the following commands:

hdfs dfs -mkdir -p /usr/input/dataset
hdfs dfs -put <path to dataset on local FS> /usr/input/dataset

Once the dataset is uploaded to HDFS, the map-reduce job can be started through:
hadoop jar InvertedIndex.jar InvertedIndex <HDFS Input path> <Hdfs output path>

To fetch the result, run hdfs dfs -cat <path to output>/part-*

Once the index is built, the Query can be run using the command run-query.sh <Inverted Index location on HDFS> <Query Output path on HDFS>

prashravoor/abda-inverted-index