This is a Customer Conversion Prediction project that I found during the Guvi Hackathon, with the help of this project any Startup Insurance Company can predict whether a customer will subscribe there Insurance or not
Author = Vishal Kumar Mridha
Domain = Insurance
Level = Intermediate
Accuracy Score = 89%>
Project Type = End To End Project
- Problem Statement
- Data Collection
- Data Description
- Data Preprocessing
- Handle Outliers
- Exploratory Data Analysis (EDA)
- Data Encoding
- Feature Selection
- Data Balancing
- Data Scaling
- Model Building
- Model Performances & Feature Importance
- Make Pickle File
- Create New Enviornment
- Create Web App With Streamlit
- Upload All Files In Github Repository
- Deploy Model On Heroku
Web App Overview
The problem statement is to build an ML model that predicts if a client will subscribe to insurance or not. We need to use the historical data to reduce the cost of telephonic marketing, still it effective but Incur a lot of costs. It is also important to identify the customers that are most likely to convert beforehand so that they can be specifically targeted via call.
In a simple word, we have to build a model which helps any Startup Insurance Company can understand what strategy they should follow to get more and more Customers via call, they can also use this model to check the customer will subscribe there Insurance or not
I found this dataset during the Guvi Hackathon, first I downloaded this dataset from google docs to my local storage then uploaded in Github Repo and then imported in jupyter notebook
df = pd.read_excel('https://raw.githubusercontent.com/vishalkrishna90/CUSTOMER-CONVERSION-PREDICTION/main/Customer%20Conversion%20Prediction.xlsx')
Data Overview
The Customer Cunversion Prediction dataset has 45211 rows and 11 columns. This data frame contains the following columns:
age - Age of a person
job - type of job
marital - marital status
educational_qual - education status
call_type - contact communication type
day - last contact day of the month (numeric)
mon - last contact month of year
dur - last contact duration, in seconds (numeric)
num_calls: - number of contacts performed during this campaign and for this client
prev_outcome - outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")
y - output
In this step first I checked there are any null and duplicate values present or not, and I found there are no null values present but some duplicate values was there, I dropped them and check there are any incorrect data present or not, and I found some amount of wrong data was there and I corrected them.
After correcting incorrect data, I checked whether there are any outliers present or not, and I found that there are some amount of outliers present, for more clarification, I used the IQR method to check outliers and I found that there are no outliers.
In this step I tried to Analyze the data very efficiently and deeply, first I checked correlation between features, then check feature distribution and then relation between features and target by graph
After EDA I did data encoding by label and on hot encoder and create a new data frame for the next process
In this step first I splits features and target, then splits data into train and test data
# split data into features and target
X = n_df.drop('result',axis = 1)
y = n_df['result']
# import train test split to split train and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2)
There was a big problem with this project, I got an imbalance dataset, so before building a model first I had to balance the data, for balance the data I used the SMOTETomek library
from imblearn.combine import SMOTETomek
smt = SMOTETomek(sampling_strategy='all')
X_train, y_train = smt.fit_resample(X_train, y_train)
After data Balancing I scaled features train and test data
# import standard scaler to scale data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In this step, I build different - different models and checked their performance, and then chose the best model for the dataset
After model Building, I checked their performance by accuracy score and the model whose accuracy score was higher I consider that model to be a final model and then I checked important features based on that model
After getting score from the best model I created a Pickle file for the web app
import pickle as pkl
pkl.dump(greed_rfc, (open('ccp_model.pkl','wb')))
After making pickle file I created new virtual environment for the project and install required libraries and created web app in VS Code IDE
conda create -p customerconversion python==3.9 -y
pip install streamlit numpy pandas sklearn xgboost
After installing all required libraries and dependencies I created web app
After creating web app I uploaded all files in github repository by git CLI
git config --global user.name "FIRST_NAME LAST_NAME"
git config --global user.email "myemail@gmail.com"
git add files_name
git commit -m "about the commit"
git push origin main
In the end, I deployed the model on Heroku, so that anybody can use the web app
Customer Conversion Prediction Web App
- Imbalance data
- Based on data there was some amount of outliers present but when I applied some approaches I found that are not outliers
- There were many categorical values present and I had to encode them one by one
- Creating web app and deployment on Heroku