This project was required for us as a part of the "Big Data Analytics" course that we studied at University of Science and Technology, Zewail City (Mini Project 1 - Hadoop Ecosystem).
The Dataset consists of a list of comments taken for the Reddit public API from various subreddits.
-
The dataset is one file with each comment in JSON format in one line
-
Total Comments: 53,851,542
-
Compression Type: bzip2 (5,452,413,560 bytes compressed | 31,648,374,104 bytes uncompressed)
-
Dataset can be found here
-
Do NOT open the dataset file on an editor as it may cause a crash or a freeze to your application or your system
-
You can explore/generate your own samples using these commands
Windows (Powershell):
$ gc <file_name> | select -first <line_number> >> sample.out
Linux:
$ head -n <line_number> <file_name> >> sample.out
The details are mentioned in the Report.pdf document.
1. The top subreddits.
2. The most popular topics in top subreddits.
3. The topics that yield the highest number of upvotes.
4. The top users in top subreddits.
5. The posts reach on top subreddits by day and hour.
A heatmap showing the posts engagement on "r/AskReddit" subreddit:
Each job is placed in a folder where a script specific to this job is available to run it and pass the necessary parameters to the mapper and reducer. If a task requires two jobs to produce the final output, then the script will run the first job and then feed its output to the second job.
To use the compressed the dataset we edited the file HADOOP HOME/etc/hadoop/core- site.xml to add the bzip2 codec. The jobs were running with the compressed dataset used as input without any problems. However, later we discovered that we did not need to modify anything for Hadoop to read from a bzip2 file, and we did give a try by removing the bzip2 codec from the core-site.xml and once again the jobs were able to run with the compressed dataset as input without any problems.
We used code snippets from the below to achieve certain functionalities in our jobs.
Special Thanks to Dr. Elsayed Hemayed, Eng. Ahmed Sameh Elsayed, and Eng. Muhammad AlAref.