- Forecasted the probability of a consumer choosing our Stage 2 (infant) product with 87.7% accuracy.
- Collected approximately 2k records from the customer data platform and a series of consumer surveys.
- Conducted data cleansing, EDA (exploratory data analysis), and feature engineering.
- Implemented ML classification models (Logistic Regression and Random Forest), showcasing the most crucial features.
Due to NDA (Non-Disclosure Agreement), the dataset has been modified.
-
Python Version: 3.9
-
Packages: pandas, numpy, sklearn, matplotlib, seaborn
-
Consumer insight: Stage 2 users are 12% more likely to become Stage 3 users.
-
Business goal: Understand the variables that trigger our enrolled members to be Stage 2 formula users.
The samples are collected from the customer data plarform and a set of surveys, and the time period is from Q1 2020 to Q1 2021.
-
Dependent variable: Current Stage 2 formula brand.
-
Independent variables:
-
Behavioral variables: previous product brand, enrollment type, enrollment age, email open rate, click through rate, coupon redemption rate.
-
Demographic variables: hospital zone, province, educational status, number of children.
-
-
Machine Learning Models: Logistic Regression and Random Forest.
-
Oversampling due to the imbalanced dataset.
-
GridSearchCV to tune the hyperparameters.
-
Manipulated the output variable. (set "MJN" as 1, other brands as 0).
-
Created dummy variables to transform categorical variables to numeric ones.
-
Scaled the columns to ensure they are within the same range (0 to 1).
Final clean data (the dataframe is transposed in order to be better displayed):
- Will oversample in the next session because of the imbalanced dataset.
The below visualizations will follow a segment analysis (our Stage 2 product user group vs non-user group)
Previous Brand (behavioral):
- On the left-hand side, for the non-MJN Stage 2 consumers, 34.6% of them used Nestle Stage 1 as their previous brand.
- On the right-hand side, within the MJN Stage 2 user group, 68.8% selected MJN Stage 1 as their previous brand.
Hospital Zone & Province (demographic):
- The left graph reads a notable lift in terms of the percentage of MJN hospital zone in the MJN Stage 2 user group, compared to the non-user group (72% vs 58%).
- There isn't a significant difference regarding the province distribution percentage between the MJN Stage 2 user group and the non-user group, which could mean that this feature is less critical in the predictive model.
Education & Number of Children (demographic):
- According to the left plot, the overall educational status is higher within the MJN user group.
- The distribution of number of children between the user and the non-user group is similar.
Enrollment Type & Enrollment Time by Stage (behavioral):
- The proportion of "self-enrolled" consumers is greater in the user group compared to the non-user group (90% vs 79%).
- Consumers who enrolled in "Stage 0" (prenatally) take up a higher percentage in the non-user group, while those enrolled in "Stage 1" (0-6 months) account for a larger proportion in the user group.
Email OR, CTR & Coupon Redemption Rate (behavioral):
- In terms of engagement for the "Prenatal" and "Newborn" emails, the OR and CTR in the user group are drastically higher.
- Similarly, regarding the "Stage 1" coupon performance, the redemption rate is substantially greater in the user group.
- Train-test-split
- Cross validation
- Oversampling
- Machine learning algorithms (Logistic Regression and Random Forest)
- GridSearchCV
- Considering the prediction accuracy as well as the model simplicity, Logistic Regression in Case 2 would be the optimal model, since it is more straightforward to interpret, with relevantly high prediction accuracy.
- Positive features – previous brand as MJN, negative features – previous brand as other brands.
- If the previous brand is "Abbott Specialty", it plays a positive impact on current MJN Stage 2 brand choice.
- Number of children is negatively influencing the MJN Stage 2 brand selection .
- Drive MFB members to become Stage 1 users, due to the fact that the previous brand is a primary indicator.
- Maintain/boost the quantity/quality of email campaigns, since the email performance is an essential driver.
- Conduct experimentation to market users with only 1 child, because this metric negatively impact the outcome.
- Modify the output and input variables to tailor other business use cases.
- More experimentation in terms of variables and models.