Sample project: Time-series analysis

Today I got a note from Tim Seres, the store manager for the downtown branch.


We have daily sales data for the last four years, and we'd love some insight into how to schedule our salespeople. I certainly can't find any pattern in the data, but maybe you can with your 'machine learning' stuff? Let me know if you have any insights.



🎯 Goals

  • Explore the dataset
  • If possible, create a model which can predict store sales to inform staffing decisions
  • Plot the sales predictions against the actual sales in the test set

πŸ— Dependencies

  • Python 3.9.7
  • matplotlib.pyplot
  • matplotlib.ticker
  • numpy
  • pandas
  • seaborn
  • holiday from pandas.tseries
  • RandomForestRegressor from sklearn.ensemble
  • permutation_importance from sklearn.inspection
  • acf, pacf from statsmodels.tsa.stattools
  • plot_acf, plot_pacf from

πŸ“‚ Data

CSV in repository

(Original source unknown)

πŸ’‘ Insights and Impact

  • Decreased monthly staffing expenses by > 36%.
  • Increased salesperson satisfaction and retention by ensuring adequate staffing on busy days.
  • The model can be used for staffing decisions ~6 weeks into the future.
  • When spikes in sales volume are predicted, ~40% should be added to that predicted number when scheduling sales reps.
  • The general contour of the predictions closely matches the actual sales, suggesting that when a spike is predicted, more sales reps should be scheduled, even if the actual size of the spike isn't perfectly accurate.

sales predictions vs. actual sales

  • The model tends to underestimate sales spikes by ~40%, so if one rep can handle ~ $1000 in daily sales, and a spike of $3000 is predicted, 4 reps should be scheduled.
  • Hybrid model predictive accuracy declines significantly 100+ days in the future, so long-term hiring decisions are better informed by the simple linear model.

linear model shows trend

  • For further study:

   Why is there a 50-day lagging trend in sales?

   Can I get the same predictive power with only the lagging data, to make sure there's no information leakage in the model?

πŸ›  Want to dig into my code?

Here's the notebook for your perusal, fully annotated.