/Sloth

Final Project for Big Data Class

Primary LanguagePython

Hierarchical Clustering and Visualization of Wikipedia

(Final Project for Big Data Class)

Brought to you by The Sloth Collective: Peter Li, Justin Mao-Jones, Maya Rotmensch.

Introduction

Jimmy Wales and Larry Sanger founded Wikipedia in January 2001 with one goal – offer a free-access, free-content Internet encyclopedia that is editable by the same people who access it. From this simple idea, Wikipedia has grown to more than 35 million articles in nearly 250 different languages. With 500 million unique visitors each month and 18 billion total page views, Wikipedia ranks as the Internet's sixth most popular site.

Like all encyclopaedia, Wikipedia is “a reference work that contains information on all branches of knowledge”. Unlike traditional paper versions however, Wikipedia exists wholly in the digital realm. This difference in media makes Wikipedia an entirely different type of reference. Since the format is digital and editable by almost everyone, articles can be added and updated with much greater speed. New findings can be added as soon as they are discovered. More importantly, however, Wikipedia's digital format allows for hyperlinks that connect one article to another. As a result, encyclopedia entries are no longer separate and unconnected. Instead, the links between pages creates a rich and complicated graph structure.

For our project, we examine the Wikipedia graph and its community structure. We propose a hierarchical community model built using a stochastic block model. Further, we develop an interactive visualization that allows users to interact with and explore the Wikipedia graph.

Final Report

Project Makeup

The project is made up of three main parts:

  1. Data Processing with MapReduce
  2. Modeling of the link graph
  3. A D3 visualization of the processed graph

To run the project and reproduce our results please visit each folder in the order they are numbered above and follow individual instructions available in the READMEs.

If you want to jump straight to the cool visualization, please follow the link directly to the interactive demo:

http://slothbigdatademo-env-vvjkvbvppp.elasticbeanstalk.com/