Retail Price Optimization

Contributors: Greer Inns, Sarah Kim, Nolan Tonthat, Sam Spillane, Nazeli Mnatsakanyan, Jing Sy

Problem/Objective

Setting the right price for products is essential to maximize profits and meet business goals. However, determining the optimal price for each product can be a very complex and time-consuming task. The challenge that many businesses face is being able to efficiently leverage the data to make informed pricing decisions that find the balance between profitability and market demand.

Solution

We plan to address this problem by utilizing a linear regression model to analyze products and categories, competitive pricing, and customer ratings from the Amazon Product Sales Dataset to accurately recommend optimal prices for various products. This will enable retailers to make strategic business decisions.

Process

1. ETL/EDA

Utilize Pandas:
- Load data, basic exploration (data structure), data cleaning (e.g., df.isnull()), data transformation (e.g., df.drop, pd.to_datetime, df.sort)
Utilize Matplotlib, Seaborn:
- Visualize and identify overall trends and patterns before building models
  - Correlation Heatmap
  - Total Price by Quantity Sold
  - Total Price by Quantity Sold in Garden Tools
  - Sales Volume, Revenue, and Customers Across the Product Categories

Utilize sqlite3:
- Load cleaned data into SQL database titled 'pricing_opt.db'
- Exported pandas df as the table 'retail'

2. Build, Train, Test, Evaluate ML Models

After reading in the cleaned data from our SQLite database, we tested both cumulative and moving window train/test selection methods to optimize our initial regression model:

Cumulative Training:

For cumulative training we started with an initial training dataset that includes data from the beginning of our time series up to a defined point.
We then trained our linear regression model on this dataset
After training, we make predictions for a future time period
Then we took the actual observed data for that period and added it to our training dataset for our model
We then made predictions for the next time step and repeated the above process

Sliding Window Training:

For sliding window we started with a fixed-size training window that would slide forward through our time series data
We then trained our linear regression model on this dataset
After training, we make predictions for a future time period
Then we took the actual observed data for that period and made that our training dataset for our model
We then made predictions for the next time step and repeated the above process
We found better performance for optimization with sliding window, hypothetically this could be due to the fact that for price optimization, historical data eventually loses some relevance.
Following the selection of this method, we created a window size optimization visual, to select our window size of 5

We then utilized three supervised machine learning models to optimize our model for ideal price points, all utilizing the sliding window method for data selection:

Linear Regression

Step 1: Define Window Size (5)
Step 2: Group by product and set features/target variable
Step 3: Sliding Window training and testing to predict data points
Step 4: Linear Regression Model Training using training data (features_train, target_train)
Step 5: Unit price predictions (features_test)
Step 6: Evaluate using Mean Squared Error (MSE) - actual vs. predicted
Step 7: Calculate avg predicted price and MSE and loop to find price with min MSE (Optimal Price)

Random Forest

Step 1: Define Window Size (5)
Step 2: Group by product and set features/target variable
Step 3: Sliding Window training and testing to predict data points
Step 4: Random Forest Regressor Model Training using training data (features_train, target_train)
Step 5: Unit price predictions (features_test)
Step 6: Evaluate using Mean Squared Error (MSE) - actual vs. predicted
Step 7: Calculate avg predicted price and MSE and loop to find price with min MSE (Optimal Model for Predicting Price)

Random Forest Optimization, Standard Scaling and Feature Importance

In our random forest model, we also needed to determine the optimal number of decision trees, which was determined through the visualization below
As the visualization was inconclusive that decision trees had a significant effect on the results, the trees were set as 50.

Additionally in this model, we looked at feature importance, across the 415 iterations of the model across the given window and time series, the iterations shown by the colors in the stacked bar graph below.
Here we see categories such as volume and product rating have little to no impact in our model.
We see freight price and competitor freight price to have the highest impact

- When standard scaling was applied, we saw an increase in avg MSE and an identical feature similar profile, likely due to a very small variance in the smaller non-similar in units categories.

XGBoost

Step 1: Define Window Size (5)
Step 2: Group by product and set features/target variable
Step 3: Sliding Window training and testing to predict data points
Step 4: Create an XGBoost regression model with 50 estimators and a fixed random state
Step 5: Unit price predictions (features_test)
Step 6: Evaluate using Mean Squared Error (MSE) - actual vs. predicted
Step 7: Calculate avg predicted price and MSE and loop to find price with min MSE (Optimal Model for Predicting Price)

4. Export all data and visualize Model Performance/Evaluation

Visualize overall EDA process
Visualize and compare Model accuracy across all 3 methods
Visualize Model performance across iterations/samples

See Public Dashboards 1 and 2: https://public.tableau.com/app/profile/jing.sy/viz/Price_Optimization_Final/Dashboard1?publish=yes

Average accuracies for the prediction phase of each model are outlined below, however as price optimization was the goal of the project, our visualizations focus on MSE and selection of the price with the minimum MSE to optimize the price, in which case accuracy measurements lose weight as we no longer have an optimal price to compare to:

Linear Regression Window: 56.6566%
RF: 56.6566%
XGBoost: 60.0414%
Linear Regression Cumulative: 56.3470%

Samanaan/Project04_Machine_Learning