Udacity-Capstone-Spark-project

Sparkify-project

###Customer Churn Using PySpark Prediction for Music App.

This repository contains the results of the Data Science Nanodegree SpThis repository includes the findings of the Sparkify Capstone Project Datascience Nanodegree. Its aim is to make the code available to the reviewers. See a Medium Blog Post for more information.arkify Capstone Project. Its’s purpose is to give the reviewers access to the code. More information can be found on a Medium Blog Post.

Table of Contents

  • Installation
  • Project Motivation
  • Files Description
  • Result
  • Licensing, Authors, and Acknowledgements

Installation

The following applications and Python libraries needs to be included in this project:

  • Python
  • Spark
  • Pyspark
  • pandas
  • Matplotlib
  • Seaborn

You would also need to have Jupyter Notebook applications enabled like Anaconda to run and execute.

Project Motivation

developing Skills of:

  • Loading and manipulating huge sets of data into Spark using Spark SQL and Spark Dataframes
  • Use the Spark ML machine learning APIs to create and fine-tune models -Integrating the skills I learned in the Spark course and the Nanodegree program for Data Scientists

Files Description

  • Sparkify.ipynb Notebook is main file of the project.
  • It demonstrates the process of using pyspark to explore the data and build the model.

Result

We split the collection of function & goal variable data into train, test and then developed a pipeline and implemented 3 models of machine learning. Because the churned users are a fairly small subset, we used F1 performance as the optimizing metric, and we found a better model for GBTClassifier compared to another.

I post a blog about the detail, you can find it here.

Licensing, Authors, Acknowledgements

Must give credit to Udacity for the project. And instructions in the notebook are also well prepared by Udacity team.