Taking my Amazon purchase history for the past 8 years and applying the CRISP-DM methodology for statistical analysis and predictive modeling.
My main motivation for this project was to get ahold of data that would be 1) unique and 2) interesting to explore, and then practice applying the above mentioned framework as well as making predictions with sklearn and statsmodels.
Below is a set of questions that were answered during the analysis:
- How much did I spend by year? What year did I go on a shopping spree and spend the most? (answered with seaborn catplot & relplot)
- What was purchased during that binge-shopping year? (barplot)
- Which categories do most of my expenses fall into? (countpot by qty and barplot by amount)
- If we pick top 6 categories, do I tend to spend the same amount every year? (two-level relplot)
- What is the maximum amount I’ve ever spent on the most common expense categories? (boxplot)
- Lastly, what is my predicted purchase total for 2021? (basic sklearn as well as statsmodel)
Libraries used:
- pandas for dataframe manipulations
- numpy for analysis
- matplotlib.pyplot for data visualizations
- seaborn for data visualizations
for predictive modeling: - scikit-learn
- statsmodels.api
Files in the repository: ipynb workbook (Amazon_Purchase_History_Analysis_Workbook.ipynb) and its html copy (Amazon_Purchase_History_Analysis_Workbook.html)
Acknowledgments: my personal Amazon purchase history used as the dataset.
Summary of the analysis Year when most purchases were made was 2018. Most of the merchandise acquired that year were pet-related goods as well as electronic devises. Categories in which most of the purchases fall into throughout the analyzed period are: Beauty, Pet Products, Shoes & Clothes, Sporting Goods, and electronic devices. The purchase frequency varies YoY. The maximum amount spent in those categories does not exceed $450.
Blog post featured in Medium's Towards Data Science rubric