The goal of this project is to use PageRank to derive a ranking of words in a document based on their PageRank scores.
The PageRank score of a word serves as an indicator of the importance of the word in the text.
I have used a collection consisting of titles and abstracts of research articles published in the WWW conference.
The dataset is available for download from the repository. Each document is POS tagged.
There are two command line inputs
-
data_path
: Path the files of the data. -
w
: w parameter
python main.py --data_path ./www/www/ --w 6
K = 1 --> MMR : 0.0
K = 2 --> MMR : 0.6421052631578947
K = 3 --> MMR : 0.6255639097744361
K = 4 --> MMR : 0.5319548872180455
K = 5 --> MMR : 0.4689223057644111
K = 6 --> MMR : 0.4561403508771925
K = 7 --> MMR : 0.4510651629072684
K = 8 --> MMR : 0.3342337987826705
K = 9 --> MMR : 0.32255639097744243
K = 10 --> MMR : 0.2771631459601365
Note
: You can also find this dictionary in a JSON file (mmr_map.json) that gets created at every run.