Logistic Regression using Numpy - Churn Classification

Business Objective

In this project, we aim to predict customer churn for a service-providing company called XYZ. Churned customers are those who have decided to end their relationship with the company. The company wants to know if the customers will renew their subscription for the coming year or not.

Data Description

The dataset consists of approximately 2000 rows and 16 columns with the following features:

Year
Customer_id (unique id)
Phone_no (customer phone no)
Gender (Male/Female)
Age
No of days subscribed (the number of days since the subscription)
Multi-screen (does the customer have a single/multiple screen subscription)
Mail subscription (customer receives mails or not)
Weekly mins watched (number of minutes watched weekly)
Minimum daily mins (minimum minutes watched)
Maximum daily mins (maximum minutes watched)
Weekly nights max mins (number of minutes watched at night time)
Videos watched (total number of videos watched)
Maximum_days_inactive (days since inactive)
Customer support calls (number of customer support calls)
Churn (1-Yes, 0-No)

Aim

The goal of this project is to build a logistic regression learning model using NumPy on the given dataset to determine whether the customer will churn or not.

Tech Stack

Language: Python
Libraries: numpy, pandas, matplotlib, seaborn, scikit-learn, pickle, imblearn, statsmodels

Approach

Import the required libraries and read the dataset.
Perform Exploratory Data Analysis (EDA) with data visualization.
Conduct Feature Engineering by dropping unwanted columns.
Build a Logistic Regression Model using the statsmodels library.
Split the dataset into a training and testing set.
Validate the model's predictions, including accuracy score, confusion matrix, ROC and AUC, recall score, precision score, and F1-score.
Handle the unbalanced data using various methods, including balanced weights, random weights, and adjusting imbalanced data using SMOTE.
Perform feature selection with different methods, such as barrier threshold selection and RFE method.
Save the best model in the form of a pickle file.
Inspect and clean up the data, including data encoding on categorical variables.

Modular Code Overview

input: Contains all the data files for analysis, such as Data_regression.csv.
src: The most important folder containing all the modularized code for different steps in a modularized manner. This folder consists of Engine.py and ML_Pipeline.
output: Contains the best-fitted models trained on the data. These models can be loaded and used for future predictions without the need to retrain them.
lib: A reference folder containing the original iPython notebook used in the project.