/pagerank

Rank web pages by importance

Primary LanguagePython

PageRank

A simplified implementation of the pagerank algorithm.

From Wikipedia:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Two methodds have been implemented here to calculate pageranks of web pages, given a corpus of web pages:

  1. Sampling
  2. Iteration

Sampling

We consider the behavior of a hypothetical surfer on the internet who clicks on links at random. Consider the corpus of web pages below:

{
    "Page1": {"Page2"},
    "Page2": {"Page1", "Page3"},
    "Page3": {"Page2", "Page4"},
    "Page4": {"Page2"}
}

This dictionary maps a page to a set of all pages linked to by that page.

This method imagines a surfer who starts with a web page at random, and then randomly chooses links to follow. If the surfer is on Page2, for example, they would randomly choose between Page1 and Page3 to visit next (duplicate links on the same page are treated as a single link, and links from a page to itself are ignored as well). If they chose Page 3, the surfer would then randomly choose between Page 2 and Page 4 to visit next.

A page’s PageRank, hence, can be described as the probability that a random surfer is on that page at any given time.

One way to interpret this model is as a Markov Chain, where each page represents a state, and each page has a transition model that chooses among its links at random. At each time step, the state switches to one of the pages linked to by the current state.

By sampling states randomly from the Markov Chain, we get an estimate for each page’s PageRank. We start by choosing a page at random, then keep following links at random, keeping track of how many times we’ve visited each page. After we gather all of our samples, the proportion of the time we were on each page gives an estimate for that page’s rank.

To ensure we always get to somewhere in the corpus of web pages, a damping factor d is used. With probability d (usually around 0.85), the random surfer will choose from one of the links on the current page at random. And with probability 1 - d, the random surfer chooses one out of all of the pages in the corpus at random (including the one they are currently on).

Hence, the random surfer starts by choosing a page at random, and then, for each additional sample generated they choose a link from the current page at random with probability d, and choose any page at random with probability 1 - d. We keep track of how many times each page has shown up as a sample, and the proportion of states that were on a given page is its PageRank.

Iteration

A page’s PageRank is also defined using a recursive mathematical expression. Let PR(p) be the PageRank of a given page p: the probability that a random surfer ends up on that page. Now, there are two ways that a random surfer could end up on the page:

  • With probability 1 - d, the surfer chose a page at random and ended up on page p.
  • With probability d, the surfer followed a link from a page i to page p.

For the first condition the 1 - d probability of choosing a page at random is split evenly among all N possible pages. So, the probability will be (1 - d) / N

For the second condition, we consider each possible page i that links to page p. For each of those incoming pages, if NumLinks(i) denotes the number of links on page i, then each page i that links to p has its own PageRank, PR(i), representing the probability that we are on page i at any given time. And since from page i we travel to any of that page’s links with equal probability, we divide PR(i) by the number of links NumLinks(i) to get the probability that we were on page i and chose the link to page p.

So the pagerank for a page p is defined as:

equation

  • d is the damping factor,
  • N is the total number of pages in the corpus,
  • i ranges over all pages that link to page p,
  • NumLinks(i) is the number of links present on page i

Hence, we start by assuming the PageRank of every page is 1 / N (i.e., equally likely to be on any page). Then, we use the above formula to calculate new PageRank values for each page, based on the previous PageRank values. We keep repeating this process, calculating a new set of PageRank values for each page based on the previous set of PageRank values, eventually the PageRank values converge (i.e., not change by more than a small threshold with each iteration (in this case, 0.001).

How to run

python pagerank.py <corpus_name>

Example:

python pagerank.py corpus1

Resources:

Wikipedia, Stanford CS54N Handout#24, CS50