Contributors: Greer Inns, Sarah Kim, Nolan Tonthat, Sam Spillane, Nazeli Mnatsakanyan, Jing Sy
Setting the right price for products is essential to maximize profits and meet business goals. However, determining the optimal price for each product can be a very complex and time-consuming task. The challenge that many businesses face is being able to efficiently leverage the data to make informed pricing decisions that find the balance between profitability and market demand.
We plan to address this problem by utilizing a linear regression model to analyze products and categories, competitive pricing, and customer ratings from the Amazon Product Sales Dataset to accurately recommend optimal prices for various products. This will enable retailers to make strategic business decisions.
- Utilize
Pandas
:- Load data, basic exploration (data structure), data cleaning (e.g.,
df.isnull()
), data transformation (e.g.,df.drop
,pd.to_datetime
,df.sort
)
- Load data, basic exploration (data structure), data cleaning (e.g.,
- Utilize
Matplotlib
,Seaborn
:- Visualize and identify overall trends and patterns before building models
- Correlation Heatmap
- Total Price by Quantity Sold
- Total Price by Quantity Sold in Garden Tools
- Sales Volume, Revenue, and Customers Across the Product Categories
- Visualize and identify overall trends and patterns before building models
- Utilize
sqlite3
:- Load cleaned data into SQL database titled 'pricing_opt.db'
- Exported pandas df as the table 'retail'
After reading in the cleaned data from our SQLite database, we tested both cumulative and moving window train/test selection methods to optimize our initial regression model:
- For cumulative training we started with an initial training dataset that includes data from the beginning of our time series up to a defined point.
- We then trained our linear regression model on this dataset
- After training, we make predictions for a future time period
- Then we took the actual observed data for that period and added it to our training dataset for our model
- We then made predictions for the next time step and repeated the above process
- For sliding window we started with a fixed-size training window that would slide forward through our time series data
- We then trained our linear regression model on this dataset
- After training, we make predictions for a future time period
- Then we took the actual observed data for that period and made that our training dataset for our model
- We then made predictions for the next time step and repeated the above process
- We found better performance for optimization with sliding window, hypothetically this could be due to the fact that for price optimization, historical data eventually loses some relevance.
- Following the selection of this method, we created a window size optimization visual, to select our window size of 5
We then utilized three supervised machine learning models to optimize our model for ideal price points, all utilizing the sliding window method for data selection:
- Step 1: Define Window Size (5)
- Step 2: Group by product and set features/target variable
- Step 3: Sliding Window training and testing to predict data points
- Step 4: Linear Regression Model Training using training data (features_train, target_train)
- Step 5: Unit price predictions (features_test)
- Step 6: Evaluate using Mean Squared Error (MSE) - actual vs. predicted
- Step 7: Calculate avg predicted price and MSE and loop to find price with min MSE (Optimal Price)
- Step 1: Define Window Size (5)
- Step 2: Group by product and set features/target variable
- Step 3: Sliding Window training and testing to predict data points
- Step 4: Random Forest Regressor Model Training using training data (features_train, target_train)
- Step 5: Unit price predictions (features_test)
- Step 6: Evaluate using Mean Squared Error (MSE) - actual vs. predicted
- Step 7: Calculate avg predicted price and MSE and loop to find price with min MSE (Optimal Model for Predicting Price)
- In our random forest model, we also needed to determine the optimal number of decision trees, which was determined through the visualization below
- As the visualization was inconclusive that decision trees had a significant effect on the results, the trees were set as 50.
- Additionally in this model, we looked at feature importance, across the 415 iterations of the model across the given window and time series, the iterations shown by the colors in the stacked bar graph below.
- Here we see categories such as volume and product rating have little to no impact in our model.
- We see freight price and competitor freight price to have the highest impact
- Step 1: Define Window Size (5)
- Step 2: Group by product and set features/target variable
- Step 3: Sliding Window training and testing to predict data points
- Step 4: Create an XGBoost regression model with 50 estimators and a fixed random state
- Step 5: Unit price predictions (features_test)
- Step 6: Evaluate using Mean Squared Error (MSE) - actual vs. predicted
- Step 7: Calculate avg predicted price and MSE and loop to find price with min MSE (Optimal Model for Predicting Price)
- Visualize overall EDA process
- Visualize and compare Model accuracy across all 3 methods
- Visualize Model performance across iterations/samples
See Public Dashboards 1 and 2: https://public.tableau.com/app/profile/jing.sy/viz/Price_Optimization_Final/Dashboard1?publish=yes
Average accuracies for the prediction phase of each model are outlined below, however as price optimization was the goal of the project, our visualizations focus on MSE and selection of the price with the minimum MSE to optimize the price, in which case accuracy measurements lose weight as we no longer have an optimal price to compare to:
- Linear Regression Window: 56.6566%
- RF: 56.6566%
- XGBoost: 60.0414%
- Linear Regression Cumulative: 56.3470%