DCU Cloud project

1. Large dataset Analysis

“Stack Exchange is a network of question and answer websites on diverse topics in many different fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The sites are modeled after Stack Overflow, a forum for computer programming questions that was the original site in this network.”

Stack Exchange Data Explorer (SEDE) https://data.stackexchange.com/stackoverflow/query/new

[Task 1] Data Acquisition:

  • We are required to acquire the top 200,000 posts by viewcount from the Stack Exchange site. Problem is that we can only download 50.000 records at a time.

[Task 2] Data Cleaning with PIG

  • Extract, transform and load the data as applicable

[Task 3] Querying with HIVE

    1. The top 10 posts by score ?
    1. The top 10 users by score ?
    1. The number of distinct users, who used the word ‘hadoop’ in one of their posts ?

[Task 4] Calculate the per-user TF-IDF with HIVE

  • Find Top 10 terms used for each of the top 10 users by post score

2. ETL with AWS Datawarehousing