- Isha Potnis (ipotnis1@umbc.edu)
- Akanksha Bhosale (akanksh1@umbc.edu)
- Shantanu Sengupta (ssen1@umbc.edu)
In the PageRank Algorithm, the rank of a page is dependent only on the number of inbound links associated with that page. We evaluate and analyze the impact of four important factors i.e. page popularity, page history, user preferences and content of the document while raking a web page. In addition, each of these factors contributes differently towards ranking the pages i.e. some factors are more dominant in returning more relevant pages than others. In this research, we focus sharply to find out which weight factor is more dominant in returning the most relevant results and by what factor, on the basis of a user study. In conclusion, we found that Content has the maximum weightr in returning high quality responses compared to popularity, history and domain of the web page.
Commands for cloning into through git:
git clone https://github.com/shaansen/PageRank.git
If you download the ZIP folder, unzip it.
Commands for generating history_log:
python ./generatelog.py
Commands for generating domain_log:
python ./mydomainlog.py
Include both the generated files in input folder
You are done with initial setup required for the project.
javac cp "./jars/*" *.java
java cp ".:./jars/*" Main f [F parameter] docs [input directory]
java cp ".:./jars/*" Main f 0.7 docs ./input
- jsoup: Java HTML Parser - https://jsoup.org/
- WebCopy - https://www.cyotek.com/cyotek-webcopy