This project consists of implemting the HITS algorithm using data from Wikipedia, in three different distributed computing environments: Hive, Spark and SparkSQL. The goal is to analyze the links of the articles to determine the most important Wikipedia pages.
The dataset consists of two files:
- Titles: A list of titles of Wikipedia articles (one title per row)
- Links: A list of links in the format "from1: to11 to12 ..."
Two scores were computed for each page, the authority score and the hub score (initialized to 1). A good hub represents a page that pointed to many other pages, and a good authority represents a page that is linked by many different hubs.
After initialization, the process consists of:
- Updating the authority scores:
- Updating the hub scores:
- Normalizing the authority scores
where
- Normalizing the hub scores
where
- Iterate
- Maria Daltayanni - Distributed Computing - Course notes