/touchpoint-prediction

Completed Project - Predicting customer touchpoint using XGBoost tuned with GridSearchCV

Primary LanguageJupyter Notebook

Aim: Based on a customer's profile, predict which type of touchpoint has the highest probability of resulting in a purchase

Project overview:

  1. Created a tool to predict touchpoint for a customer based on their profiles
  2. Optimized Random Forest and XGBoost classifiers using GridSearchCV to get the best model
  3. Model made deployment ready with Pickle

Workflow:

data-cleaning_and_eda.ipynb -> model-building.ipynb

Data Cleaning

1. Check for missing values in every column and drop duplicates

2. Removed rows with no touchpoints value / nTouchpoints = 0

EDA

1. Explore the relationship of the segment variable with other variables in the dataset

Income line plot Average Spending line plot

2. Discover any presence of multicollinearity and its degree with a heatmap

Collinearity heatmap

3. Visualizing distribution of variables with distribution plots and barplots

Marital plot Segment plot Social media plot Credit rating plot nTouchpoints plot Age plot Age distribution plot Income distribution plot Average spending dist plot

4. One hot encode categorical variables

For categorical variables, I made columns for each such that they are transformed into binary variables.

Model Building

Metrics for evaluating models:

  1. Multiclass logloss since we are predicting the probabilities of the next touchpoint, I want to find the average difference between all probability distributions.
  2. F1-Score(Micro) since we have imbalanced classes of labels.

1a. Standardize/normalize numerical data

Age distribution plot Income distribution plot Average spending dist plot

1b. Stratified train test split

I wrote a custom script to split my dataset into train, validation and test sets using the stratify strategy. Train size 80%, Validation set and Test set 10% each.

2. Try baseline ensemble model: Random Forest

Random Forest

I picked RF Classifer simply because it runs fast and I am able to use GridSearchCV to iterate to the best model possible efficiently. After initializing and tuning my RandomForestClassifier model with GridSearchCV, I got a train accuracy of 1.0 and test accuracy of 0.77688 which shows overfitting.

FI

Our RF Classifier seems to pay more attention to average spending, income and age.

3. Explore ensemble model: XGBoost

XGBoost

Initial XGB model

mean logloss plot mean error plot

XGB model after tuning with GridSearchCV : max_depth, min_child_weight and reg_alpha

mean logloss plotmean error plot

FI

Our XGBoost model pays high attention to the 'unknown' marital status. This could be due to the fact that there are only 44 customers with 'unknown' marital status, hence to reduce bias, our xgb model assigns more weight to 'unknown' feature.

XGBoost Accuracy: 0.9678972712680578

XGBoost F1-Score (Micro): 0.9678972712680578

I will pick the final XGBoost model since it gives significantly higher F1-score and accuracy. We can also easily control overfitting by further tuning the reg_alpha value in our model.

Model Deployment

I included a pickle file for further deployment of the model into FlaskAPI in the future! For productionization, a flask API endpoint can be hosted on a server and it will take in a list of values from a customer's profile and return the recommended touchpoint.