The project in this repository is CAPSTONE from Udacity Data Science Nanodegree, it contains jupyter notebook that contains all the code for each part required to pass the projects.
The assigment of the project was to identify users, of the fictional service called Sparkify, that would churn away from the service. The data contained within the project itself had variety of different variables both numerical and categorical, that required to be cleaned, transformed, processed in order to get each part of the project required.
- Read and clean the data
- Exploratory Data Analysis
- Feature Engineering
- Machine Learning - Classification of users
- Pandas - data analytics library
- numpy - data analytics and processing library
- matplotlib - data visualization library
- pyspark - Python API written in python to support Apache Spark, which is distributed framework for processing Big Data analysis
- seaborn - build on top of matplotlib visualization library allowing to customize plots a little bit faster
After loading, cleaning, exploring data and feature engineering four classifiers have been trained:
- Logistic Regression Classifier
- Support Vector Classifier
- Random Forest Classifier
- Boosted Trees Classifier
All the classifiers were run on dataset with 21 featres and data set with 29 features, in order to compare how number of features influences the accuracy of the prediction, even with only 8 new features have been added.
The results with details can be found in article below: https://medium.com/@m.dyngosz/sparkify-will-churn-or-will-not-churn-this-is-the-question-88a1b2668283?sk=3d3bc4503dfcc6c6435b97baee2a689e
README.md - read me file Sparkify.ipynb - notebook with all the code used for this project