This is the repository for project Data Scientist Capstone Project, a part of the Data Scientist Nanodegree Program by Udacity.
Churn Prediction is a popular problem facing in many types of businesses. It minimizes customer defection by predicting which customers are likely to cancel a subscription to a service. Though churn prediction was originally used within the telecommunications industry, it has become common practices across many businesses such as banks, insurance firm, and other verticals.
In this project, I'm about to build an end-to-end machine learning model to predict churn customer based on a sample dataset of Sparkify users data and the Apache Spark Machine Learning framework. The model is capable of predict which users is likely to churn the music application service.
The goal of this project is to create an end-to-end prediction model of churn users of the Sparkify music application; the tasks involved are the following:
- Preprocessing (load, clean, and transform) the raw dataset in json format with PySpark
- Analyze the data to define the set of features which can be used to train a predictive model
- Train classifiers that can determine of a user is churned or not by using Apache Spark Machine Learning framework
- Select the best and improve the model to get higher results
- Present the results in a report in Medium blog post(this post) of the end-to-end process to build an ML model in Apache Spark Machine Learning.
-
F-1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and recall r of the test to compute the score. This traditional F-measure or balanced F-score is the harmonic mean of precision and recall. F-1 = 2 x (precision x recall) / (precision + recall)
-
Accuracy is a common metric for binary classifiers; it takes into account both true positives and true negatives with equal weight: accuracy = (true_positive + true_negatives)/dataset_size.
The project is run in Apache Spark environment. Refer to Apache Spark to set up an environment for Spark.
- PySpark
- Spark MLlib
- seaborn
- datasources - images - repository |- Sparkify.ipynb |- Sparkify.html - README.md
The final churn prediction model gets the F-1 score of 78% and Accuracy of 80%. The report of the this project is presented in this blog post Building Churn Prediction Model with Apache Spark Machine Learning.
- Apache Spark
- The Data Scientist Guide to Apache Spark, databricks