st_powerUser: A Jupyter Notebook repository from belliogluyasemin

About Data Source

Google Merchandise Store](https://shop.merch.google/) is an online store that sells Google-branded merchandise. The site uses Google Analytics 4's standard web ecommerce implementation along with enhanced measurement. The ga4_obfuscated_sample_ecommerce dataset available through the BigQuery Public Datasets program contains a sample of obfuscated BigQuery event export data for three months from 2020-11-01 to 2021-01-31.

Power User Modeling for Mailing Campaign

Welcome to the Power User Modeling for Mailing Campaign repository! This project features a Streamlit interface to analyze user behavior and predict user spending. Below you will find information about the project, data, models, and how to run the Streamlit application.

Streamlit Interface

You can access the Streamlit interface for this project here.

Through this interface, you can analyze the user's 90-day revenue based on their first 15-day behavior and predict whether the user is likely to spend more money.

About the Data

The dataset used in this project includes user transaction data and various features related to user behavior.

LTV Prediction: Predicting the Lifetime Value (LTV) of users over 90 days.
Power User Prediction: Initially, power users were identified based on a z-score method. However, this method resulted in very few power users, making it insufficient. Therefore, based on the distributions, users who spent more than $110 were identified as power users, forming the basis for the classification target.

LTV Prediction

We used Random Forest Regressor for predicting LTV. After hyperparameter optimization.

Power User Prediction

1. Model Training and Oversampling

Training Dataset Preparation: The training dataset was adjusted using various oversampling methods to handle class imbalances.
Oversampling Techniques: Methods like RandomOverSample, SMOTE, and ADASYN were employed to balance the dataset, ensuring that the models could effectively learn from both the majority and minority classes.

2. Model Selection

Hyperparameter Optimization: GridSearchCV was used for hyperparameter tuning to find the best model configurations.
Comparison of Models: Several models were compared based on their performance metrics, including KNN, XGBoost, Logistic Regression, and Random Forest. The XGBoost model with ADASYN oversampling showed the best performance.

Model	Recall	Precision	Log Loss
KNN	51.65%	88.70%	6.88%
KNN_RandomOverSample	47.25%	23.50%	15.45%
KNN_SMOTE	62.64%	19.40%	19.13%
KNN_ADASYN	28.57%	35.60%	24.31%
XGBoost	71.43%	100.00%	5.23%
XGBoost_RandomOverSample	71.43%	55.60%	2.70%
XGBoost_SMOTE	71.43%	97.00%	1.41%
XGBoost_ADASYN	71.43%	100.00%	1.33%
Logistic Regression	67.03%	89.70%	1.21%
Logistic Regression_RandomOverSample	78.02%	10.50%	14.46%
Logistic Regression_SMOTE	76.92%	11.60%	12.64%
Logistic Regression_ADASYN	84.62%	6.40%	24.08%
RandomForest	70.33%	100.00%	4.49%
RandomForest_RandomOverSample	63.74%	87.90%	5.12%
RandomForest_SMOTE	72.53%	42.00%	3.16%
RandomForest_ADASYN	74.73%	32.70%	4.43%

4. Threshold Adjustment and Model Evaluation

Threshold Tuning: The threshold value of the model was adjusted to optimize performance. Despite testing various thresholds, the default value of 0.5 was retained as it provided the best balance between recall and precision.
ROC AUC Curve Analysis: The performance of the XGBoost ADASYN model was further validated using the ROC AUC curve, demonstrating strong predictive capabilities.

Running the Streamlit Application

Run the Streamlit app by uploading the powerUser_streamlit_v002_df_pickle.py file to Streamlit.io. The requirements.txt file in the repository contains all the necessary dependencies.

belliogluyasemin/st_powerUser