Implementation of Hits Algorithm using Giraph

Setting up Giraph

Giraph runs on top of Hadoop. Download the binaries for Giraph (make sure to get the bin-hadoop2 version).
Configure Giraph, and run a few sample programs, as shown here.

Add the following environment variables to your bashrc file:

GIRAPH_HOME
HADOOP_HOME
HADOOP_CONF_DIR If needed, add the bin folder of Giraph to PATH

Common errors

All the giraph libraries need to be copied to the Hadoop directory in order to get the examples to work. Run this command to copy the libraries:

cp $GIRAPH_HOME/*.jar $GIRAPH_HOME/lib/*.jar $HADOOP_HOME/share/hadoop/yarn/lib

In addition, the jar that you run from needs to be copied to the location $HADOOP_HOME/share/hadoop/mapreduce/lib.

Building the code

Build the code and the target jar with the following command:

mvn clean install assembly:single

This produces the output jar named HitsAlgorithm-1.0-SNAPSHOT-jar-with-dependencies.jar in the target folder, which is to be used to run the Giraph application.

Running the application

Ensure Hadoop cluster is up and running. It is required for Giraph to run
Use the script run-giraph.sh to run the Giraph Application. The command requires the following arguments:

Path to the jar file created through the build
The fully qualified class name
Path to input file (on HDFS)
Path to output file (on HDFS)

For example:

./run-giraph.sh target/HitsAlgorithm-1.0-SNAPSHOT-jar-with-dependencies.jar \
    com.pes.giraph.App \
    /usr/input/input_small.txt \
    /usr/output/hits_small

Number of supersteps

Default max number of supersteps has been set to 50. Modify this as needed in the run-giraph.sh script by setting the value into max.num.steps.

Getting the output

The output can be read by looking at the output folder specified for the run. Sample output is shown below (ran agains the small.txt dataset):

A       (Hub,Auth) = (1.0000,0.0000)
B       (Hub,Auth) = (0.8422,0.4564)
C       (Hub,Auth) = (0.0000,0.5801)
D       (Hub,Auth) = (0.3431,0.5801)