
Predictive modelling on consumers' choice for the Stage 2 infant formula, showcasing the most crucial features, with Machine Learning models.

Consumer Retention Prediction

1. Introduction

  • Forecasted the probability of a consumer choosing our Stage 2 (infant) product with 87.7% accuracy.
  • Collected approximately 2k records from the customer data platform and a series of consumer surveys.
  • Conducted data cleansing, EDA (exploratory data analysis), and feature engineering.
  • Implemented ML classification models (Logistic Regression and Random Forest), showcasing the most crucial features.

Due to NDA (Non-Disclosure Agreement), the dataset has been modified.

lr fea c2

2. Code and Packages

  • Python Version: 3.9

  • Packages: pandas, numpy, sklearn, matplotlib, seaborn

3. Objectives

  • Consumer insight: Stage 2 users are 12% more likely to become Stage 3 users.​

  • Business goal: Understand the variables that trigger our enrolled members to be Stage 2 formula users.

4. Data Sources

The samples are collected from the customer data plarform and a set of surveys, and the time period is from Q1 2020 to Q1 2021.

5. Methodology

  • Dependent variable: Current Stage 2 formula brand.

  • Independent variables:

    • Behavioral variables: previous product brand, enrollment type, enrollment age, email open rate, click through rate, coupon redemption rate.

    • Demographic variables: hospital zone, province, educational status, number of children.​

  • Machine Learning Models: Logistic Regression and Random Forest.​

  • Oversampling due to the imbalanced dataset.

  • GridSearchCV to tune the hyperparameters.

6. Data Cleaning

  • Manipulated the output variable. (set "MJN" as 1, other brands as 0).

  • Created dummy variables to transform categorical variables to numeric ones.

  • Scaled the columns to ensure they are within the same range (0 to 1).

Final clean data (the dataframe is transposed in order to be better displayed):


7. EDA

7.1 Ouput Variables (MJN - 1, others - 0):

stage 2 brand 4

  • Will oversample in the next session because of the imbalanced dataset.

7.2 Input Variables:

The below visualizations will follow a segment analysis (our Stage 2 product user group vs non-user group)

Previous Brand (behavioral):

pre brand 3

  • On the left-hand side, for the non-MJN Stage 2 consumers, 34.6% of them used Nestle Stage 1 as their previous brand.
  • On the right-hand side, within the MJN Stage 2 user group, 68.8% selected MJN Stage 1 as their previous brand.

Hospital Zone & Province (demographic):


  • The left graph reads a notable lift in terms of the percentage of MJN hospital zone in the MJN Stage 2 user group, compared to the non-user group (72% vs 58%).
  • There isn't a significant difference regarding the province distribution percentage between the MJN Stage 2 user group and the non-user group, which could mean that this feature is less critical in the predictive model.

Education & Number of Children (demographic):


  • According to the left plot, the overall educational status is higher within the MJN user group.​
  • The distribution of number of children between the user and the non-user group is similar.

Enrollment Type & Enrollment Time by Stage (behavioral):


  • The proportion of "self-enrolled" consumers is greater in the user group compared to the non-user group (90% vs 79%).
  • Consumers who enrolled in "Stage 0" (prenatally) take up a higher percentage in the non-user group, while those enrolled in "Stage 1" (0-6 months) account for a larger proportion in the user group.

Email OR, CTR & Coupon Redemption Rate (behavioral):


  • In terms of engagement for the "Prenatal" and "Newborn" emails, the OR and CTR in the user group are drastically higher.
  • Similarly, regarding the "Stage 1" coupon performance, the redemption rate is substantially greater in the user group.

8. Modelling

8.1 Procedure

  • Train-test-split
  • Cross validation
  • Oversampling
  • Machine learning algorithms (Logistic Regression and Random Forest)
  • GridSearchCV

8.2 Model Performance:


  • Considering the prediction accuracy as well as the model simplicity, Logistic Regression in Case 2 would be the optimal model, since it is more straightforward to interpret, with relevantly high prediction accuracy.

8.3 Confusion Matrix (Logistic Regression):


8.4 Feature Importance (Logistic Regression):

lr fea c2

  • Positive features – previous brand as MJN, negative features – previous brand as other brands.
  • If the previous brand is "Abbott Specialty", it plays a positive impact on current MJN Stage 2 brand choice.
  • Number of children is negatively influencing the MJN Stage 2 brand selection .

9. Conclusion

  • Drive MFB members to become Stage 1 users, due to the fact that the previous brand is a primary indicator.
  • Maintain/boost the quantity/quality of email campaigns, since the email performance is an essential driver.
  • Conduct experimentation to market users with only 1 child, because this metric negatively impact the outcome.

10. Next Steps

  • Modify the output and input variables to tailor other business use cases.
  • More experimentation in terms of variables and models.