This repository holds data and code for my Bachelor thesis: "Exploring machine learning models for churn prediction of membership subscriptions". The goal is to test different ML models to predict if a member will be renewing their yearly membership.
The metric of choice for this classification task of predicting non-renewals will be precision with the acceptable threshold of 75%. That being said the performance metric and threshold might be re-evaluated after performing some EDA on the data and determining class balance.
Devise a ML model for for predicting who won't be renewing their membership and to determine if the non-renewing member can be converted 3 months prior to their renewal date.
Extra: If a probability model has good performance then determine users who are within 30% - 50% chance of renewing. Those users have the biggest potential to be converted from non-renewals to renewals.
General facts:
- The available data is inspired by PALMS data found in company named Business Networking International (BNI)
- BNI creates local groups of entrepreneurs who form relationships and refer eachother's businesses
- Company business model: yearly membership subscription
- BNI groups meet on a weekly basis
- Data from 2016-03 to 2021-02 in a monthly format
- Each months' PALMS data contains information members and chapters performance
Here are a couple of links with a legend/explanation of the PALMS data: Link 1 and Link 2.
General information about all members.
Information about member drop and join dates.
- 1. Prepare for thesis
- Get data
- Business objective
- Create a cleaning log
- Decide on metric and threshold of choice
- 2. Prepare & clean data ("cleaning_log.md")
- Anonymize data
- Check control sums to ensure PALMS data hasn't been duplicated
- Concatenate PALMS data
- Create a master dataset - merge
- Ensure data integrity:
- Fix member records with two or less months with negative year of membership
- Fix member records with more than two months with negative year of membership
- Remove duplicate records
- Aggregate 3-, 6-, 9- month datasets
- Re-merge datasets
- Label: renewing/not renewing
- 3. Explore data (each of the 3-, 6-, 9- month aggregated datasets)
- Feature engineering:
- Seat popularity rate
- Chapter retention rate
- Chapter size
- Feature selection
- Exploratory Data Analysis:
- Summary Statistics
- Outliers
- Normality
- Visual representations
- Scale
- [Extra] Data Analysis:
- Which features are the most indicative if the member will or won't renew?
- Which seats are the most profitable?
- Feature engineering:
- 4. Create ML models
- Split data: train, validate, test
- Hyperparameter tuning (cross-validate & plot)
- Learning curve
- Power analysis
- Test results
- 5. Meet with promotor to discuss results and get feedback
- 6. Write LaTeX thesis