This Project developed for the university course of "Cloud Computing" reproduce the PageRank algorithm, that was the first algorithm used in Google to rank web pages in their search engine results. The project reproduced PageRank using both Spark and Hadoop, two of the most used analytics engine for large-scale data processing
Firstly be sure that 'wiki-micro.txt' is present in HDFS with:
hadoop fs -ls
Then, in the folder where the 'PageRank_Spark.py' is present execute:
$SPARK_HOME/bin/spark-submit PageRank_Spark.py <input> <output> <iteration> <alpha>
With input 'wiki-micro.txt', output will be the folder in which the program will write the results divided into different file 'part-0000x' inside the output folder. To check result:
hadoop fs -cat output/part-0000x
In the PageRank folder execute
mvn clean package
Then:
hadoop jar target/PageRank-1.0-SNAPSHOT.jar it.unipi.hadoop.PageRank <input> <output directory> <# of iterations> <# of reducers> <random jump probability>
You will retrieve the output in the ouput directory.