Udacity-Capstone-Spark-project

Sparkify-project

###Customer Churn Using PySpark Prediction for Music App.

This repository contains the results of the Data Science Nanodegree SpThis repository includes the findings of the Sparkify Capstone Project Datascience Nanodegree. Its aim is to make the code available to the reviewers. See a Medium Blog Post for more information.arkify Capstone Project. Its’s purpose is to give the reviewers access to the code. More information can be found on a Medium Blog Post.

Installation
Project Motivation
Files Description
Result
Licensing, Authors, and Acknowledgements

Installation

The following applications and Python libraries needs to be included in this project:

Python
Spark
Pyspark
pandas
Matplotlib
Seaborn

You would also need to have Jupyter Notebook applications enabled like Anaconda to run and execute.

Project Motivation

developing Skills of:

Loading and manipulating huge sets of data into Spark using Spark SQL and Spark Dataframes
Use the Spark ML machine learning APIs to create and fine-tune models -Integrating the skills I learned in the Spark course and the Nanodegree program for Data Scientists

Files Description

Sparkify.ipynb Notebook is main file of the project.
It demonstrates the process of using pyspark to explore the data and build the model.

Result

We split the collection of function & goal variable data into train, test and then developed a pipeline and implemented 3 models of machine learning. Because the churned users are a fairly small subset, we used F1 performance as the optimizing metric, and we found a better model for GBTClassifier compared to another.

I post a blog about the detail, you can find it here.

Licensing, Authors, Acknowledgements

Must give credit to Udacity for the project. And instructions in the notebook are also well prepared by Udacity team.

bibekuchiha/Udacity-Capstone-Spark-project