“Stack Exchange is a network of question and answer websites on diverse topics in many different fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The sites are modeled after Stack Overflow, a forum for computer programming questions that was the original site in this network.”
Stack Exchange Data Explorer (SEDE) https://data.stackexchange.com/stackoverflow/query/new
[Task 1] Data Acquisition:
- We are required to acquire the top 200,000 posts by viewcount from the Stack Exchange site. Problem is that we can only download 50.000 records at a time.
[Task 2] Data Cleaning with PIG
- Extract, transform and load the data as applicable
[Task 3] Querying with HIVE
-
- The top 10 posts by score ?
-
- The top 10 users by score ?
-
- The number of distinct users, who used the word ‘hadoop’ in one of their posts ?
[Task 4] Calculate the per-user TF-IDF with HIVE
- Find Top 10 terms used for each of the top 10 users by post score