Sparkify-Project

Installation
Motivation
Files
Results
Licensing

Installation

There are two ways to run the attached Project scripts.

Locally

You should have Spark installed on your machine
Python and PySpark API must be installed as well
Set up all Environment Variables correctly

In the Cloud (easier way)

You should have an Amazon AWS account
(All other requirements are handled by AWS)

Motivation

Sparkify is an imaginary music streaming company with thousands of users which generate vast amounts of data
The data is difficult to analyze on a single machine, due to its volume size
Therefore a Big Data tool such as Spark is needed to analyze this data
The end goal is to predict customer Churn.

The Project code is inside the .ipynb Notebook
The Notebook consists of following parts:

Importing PySpark modules; Installation of some Python modules
PySpark session creation
Data Import
Data Cleaning
Data Aggregation and Preparation
Classification Algorithms are used to predict customer Churn
Results are evaluated based on common metrics

Files

Following files are attached to this repository:

Sparkify_Big.ipynb - Jupyter Notebook with the complete analysis of the full dataset (12GB)
Should be run on Amazon AWS
Sparkify_Big.html - the HTML page of the Sparkify_Big.ipynb file
Sparkify_Small.ipynb - Jupyter Notebook with the analysis of the sample dataset (128MB)
Sparkify_Small.ipyng - the HTML page of the Sparkify_Small.ipynb file
mini_sparkify_event_data.zip - sample data set

The Full Data set is stored on the S3 server:

"s3n://udacity-dsnd/sparkify/sparkify_event_data.json"

Sample Data Set is also available on the S3 server:

"s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json"

Results

sparkify_event_data is a 12GB data set with 25 million rows. It contains user data of about 25K users.

The data contains information about:

pages that the user visited
all timestamps
time spent on each page
user gender
user location
etc.

The data also contains information whether the user has cancelled the music service or not (Churn)

Data was used to engineer 14 features which served as an input for classification models to predict Churn

Three models were used on the full dataset (12GB), the best performing model was Gradient Boosted Tree Classifier yielding an F1 metric of 0.81

A detailed step-by-step analysis with results is available here:
https://towardsdatascience.com/machine-learning-with-pyspark-and-amazon-emr-3149dbc847ae

Licensing

The dataset was provided by Udacity https://www.udacity.com/
For any questions or concerns regarding the dataset, please contact Udacity