/sparkify

This is the final project for the Data Scientist Nanodegree, where our goal is to predict churn for a fictional streaming service called Sparkify.

Primary LanguageHTML

Sparkify Capstone Project

At this project, we predict churn for a fictional streaming service called Sparkify.

The complete analysis and discussion are available here.

Do you want to churn?

Table of Contents

  1. Requirements
  2. Project Overview
  3. File Descriptions
  4. Running the project
  5. Results

Requirements

  • Python 3
  • The comple list of requirents can be found at requirements.txt

Project Overview

At this project, we try to predict if a user will churn (Canceling the service) given some information regarding the service interactions.

In the future, we can try to predict if the user will Downgrade the subscription (became a free user).

I've applied four classification algorithms and several techniques to work with the data.

To allow an easy visualization, I've hosted the notebook HTML files online, so you will see a static version, without the need to open the .ipynb at the GitHub (which is usually slow and hangs for large files).

To a more complete analysis, I recommend check my article here.

File Descriptions

.
├── results/ # Folder with the static version of the notebooks - Also hosted online (See at the project overview)
├── pyspark.sh # The file to run and config PySpark for local mode
├── visualizations.py # The implementation of some visualizations on Plotly, for a more interactive heatmap (See on the data exploration notebook)
├── jupyter_utils.py # A script to config pandas for a standard view between all the notebooks
├── Data Exploration.ipynb # Notebook with the exploration of the raw data and after the feature engineering
├── Sparkify.ipynb # Notebook with the exploration and feature engineering
├── Results - Spark.ipynb # Notebook with all the visualizations related to the results of training and the GridSearch
├── requirements.txt # The project dependencies

Dataset:

The dataset was given by Udacity. I've hosted on my S3 to make it more comfortable to download and work between my environments.

  • The full dataset is available here - 12Gb:

  • I've created a small version without some columns (firstName, lastName, location, userAgent), but with all the events - available here - 2Gb.

Raw Dataset features

  • ts: Event timestamp in milliseconds
  • gender: M or F
  • firstName: First name of the user
  • lastName: Last name of the user
  • length: Length of the song
  • level: Level of subscription free or paid.
  • registration: User registration timestamp
  • userId: User id at the service
  • auth: If the user is logged
  • page: Action of the event (next song, thumps up, thumbs down)
  • sessionId: Id of the session
  • location: Location of the event
  • userAgent: Browser/web agent of the event
  • song: Name of the song
  • artist: Name of the artist
  • method: HTTP method of the event
  • status: HTTP status of the request (200, 404)

Running the project

To run the training code, you can run the pyspark.sh. file, then just go to the Sparkify notebook. The next step is to decide which version of the data will fit you. For example, there are three variations of the load file. The medium dataset, the entire dataset as JSON, or the entire dataset as CSV (my version of it, as I've mentioned at the files section).

To running locally, I recommend downloading the dataset, so you won't need to download each time with Spark.

If you want to run the visualizations, don't forget to install the requirements, especially the plotly lib.

Results

Best results for each model



Heatmap - Absolute value


Heatmap - Feature vs. music listening time