###Customer Churn Using PySpark Prediction for Music App.
This repository contains the results of the Data Science Nanodegree SpThis repository includes the findings of the Sparkify Capstone Project Datascience Nanodegree. Its aim is to make the code available to the reviewers. See a Medium Blog Post for more information.arkify Capstone Project. Its’s purpose is to give the reviewers access to the code. More information can be found on a Medium Blog Post.
- Installation
- Project Motivation
- Files Description
- Result
- Licensing, Authors, and Acknowledgements
The following applications and Python libraries needs to be included in this project:
- Python
- Spark
- Pyspark
- pandas
- Matplotlib
- Seaborn
You would also need to have Jupyter Notebook applications enabled like Anaconda to run and execute.
- Loading and manipulating huge sets of data into Spark using Spark SQL and Spark Dataframes
- Use the Spark ML machine learning APIs to create and fine-tune models -Integrating the skills I learned in the Spark course and the Nanodegree program for Data Scientists
- Sparkify.ipynb Notebook is main file of the project.
- It demonstrates the process of using pyspark to explore the data and build the model.
We split the collection of function & goal variable data into train, test and then developed a pipeline and implemented 3 models of machine learning. Because the churned users are a fairly small subset, we used F1 performance as the optimizing metric, and we found a better model for GBTClassifier compared to another.
I post a blog about the detail, you can find it here.
Must give credit to Udacity for the project. And instructions in the notebook are also well prepared by Udacity team.