Sparkify - Data Science Nanodegree CAPSTONE

Overview of Project

The project in this repository is CAPSTONE from Udacity Data Science Nanodegree, it contains jupyter notebook that contains all the code for each part required to pass the projects.

The assigment of the project was to identify users, of the fictional service called Sparkify, that would churn away from the service. The data contained within the project itself had variety of different variables both numerical and categorical, that required to be cleaned, transformed, processed in order to get each part of the project required.

Required steps

Read and clean the data
Exploratory Data Analysis
Feature Engineering
Machine Learning - Classification of users

Required packages

Pandas - data analytics library
numpy - data analytics and processing library
matplotlib - data visualization library
pyspark - Python API written in python to support Apache Spark, which is distributed framework for processing Big Data analysis
seaborn - build on top of matplotlib visualization library allowing to customize plots a little bit faster

Results of the project

After loading, cleaning, exploring data and feature engineering four classifiers have been trained:

Logistic Regression Classifier
Support Vector Classifier
Random Forest Classifier
Boosted Trees Classifier

All the classifiers were run on dataset with 21 featres and data set with 29 features, in order to compare how number of features influences the accuracy of the prediction, even with only 8 new features have been added.

The results with details can be found in article below: https://medium.com/@m.dyngosz/sparkify-will-churn-or-will-not-churn-this-is-the-question-88a1b2668283?sk=3d3bc4503dfcc6c6435b97baee2a689e

Included Files