/hits-algorithm

Implementation of the HITS algorithm with Wikipedia data in Hive, Spark and SparkSQL

Primary LanguagePython

Implementation of HITS algorithm in Hive, Spark and SparkSQL

Description

This project consists of implemting the HITS algorithm using data from Wikipedia, in three different distributed computing environments: Hive, Spark and SparkSQL. The goal is to analyze the links of the articles to determine the most important Wikipedia pages.

Data

The dataset consists of two files:

  • Titles: A list of titles of Wikipedia articles (one title per row)
  • Links: A list of links in the format "from1: to11 to12 ..."

Methods

Two scores were computed for each page, the authority score and the hub score (initialized to 1). A good hub represents a page that pointed to many other pages, and a good authority represents a page that is linked by many different hubs.

After initialization, the process consists of:

  • Updating the authority scores:

  • Updating the hub scores:

  • Normalizing the authority scores

    where

  • Normalizing the hub scores

    where

  • Iterate

References

  • Maria Daltayanni - Distributed Computing - Course notes