/team11-DE-project

Team project repository for Data Engineering I

Primary LanguageJupyter Notebook

Presidential candidate analysis

This repository contains the code for a project in the course Data Engineering I. The project set out to perform simple analysis of the occurences of mentions of presidential candidate on the online forum Reddit. Much analysis have been done on the same type of data sets before and the main purpose of this project was to perform scalability experiments by manually increasing the capacity of the spark cluster and the size of the input data held on HDFS.