- Project 3 of Big Data subject
- Implemented by:
- Theodoros Mandilaras
- cs2.190018
- MSc DI/EKPA 2019-2020
Implementation and evaluation of two different PageRank algorithms on the Google-web dataset.
p = M⋅p
where p is a vector of size Nx1 (where N is the number of nodes in the graph). p[i,1] represents the PageRank of page i. Initially every page’s PageRank is equal to 1. We repeat the above computation, i.e., multiplying M with p and storing in p for a pre-specified number of steps.
p = α⋅M⋅p + (1-α)⋅In
where α is a numeric value, N is the number of pages (nodes) in the Web graph, and In is a unary vector (i.e., all values are set to 1) of size Nx1.
The program can be executed in three different modes:
In the default mode, the program gets from the arguments the desired method, the path for the graph and the number of
the iterations (if improved method has been selected, it also gets the a
numeric value). Then, performs the selected
algorithm for the desired number of iterations.
Output: In the end, it stores the top and bottom 20 in a txt file in the results directory, and also it makes the
histogram for the current ranks.
The evaluate mode implemented to execute and store the outputs(top20/bottom20 and the plots) for all the requested iterations (10, 50, 100, 200) with a single run.
In this mode, the selected algorithm iterates until it converge. That means that, it iterates until the sum of the absolute differences between last rank vector and the new one is less than a small value.
The program creates a directory called “results” in the working directory. Inside it, also creates three folders:
- results: In this folder the top and bottom 20 csv are stored. Files format:
<method>_<a value>_<iterations>_<bottom20 or top20>.txt
- plots: In this folder the histogram plots are stored. Files format:
<method>_<a value>_<iterations>.png
- ranks: That was a folder for development purpose. It is no more used.
The program makes use of the Argparse module to parse the arguments. The required arguments are:
-p
or--path
: the path for the .txt file with the graph-m
or--method
: to set the selected method. Options are 'simple', 'improved'
Optional arguments:
-i
or--iterations
: to set how many iterations the program to perform.-a
or--a
: The numeric value used for the improved algorithm (it is required for this algorithm).-c
or--converge
: To enable the mode with the convergence.-e
or--eval
: To execute the evaluate mode..
Simple execution for the default mode with the simple algorithm for 100 iterations:
python page_ranking.py -m simple -p web-Google.txt -i 100
Similarly for the improved algorithm
python page_ranking.py -m improved -p web-Google.txt -a 0.2 -i 100
Evaluate mode:
python page_ranking.py -m improved -p web-Google.txt -a=0.85 -e
Convergence mode. If iterations mentioned, they will be ignored.
python page_ranking.py -m simple -p web-Google.txt -c