PyData NYC Timeseries Forecasting Tutorial

This repo contains the material for my PyData NYC tutorial on Large Scale Timeseries Forecasting.

In this tutorial, we use the M5 Forecasting Accuracy competition data. This contains Walmart sales data for the USA for over 3000 products. We will use distributed computing to run multiple models for each timeseries and get the best forecasts.

We will use Nixtla's lightning fast statsforecast library to run statistical and econometric models at scale. In order to preprocess data, we will use Fugue to define logic in Python or Pandas, and then port it to Spark, Dask, or Ray. The combination of these two tools will allow us to develop models on large datasets. Because Fugue is agnostic to any framework, the approach illustrated here will work for Spark, Dask, and Ray with minimal tweaks.

The fourth section of the tutorial focuses on Hierarchical Forecasting, where we want to make sure that the forecasts at different levels (store/region/state) are consistent with each other when we add them up.

The last part will be about distributing the model training on a Dask cluster managed by Coiled. Included are best practices around passing data to workers.

Contact Us

If you want me to give this tutorial at some Meetup or event, feel free to reach out! It took a lot of work to compile this tutorial so I'm more than happy to speak about it anywhere (even in company knowledge sharing sessions).

Fugue Slack

Nixtla Slack

My email: kdykho@gmail.com

RickArko/pydata-timeseries-forecasting

PyData NYC Timeseries Forecasting Tutorial

Contact Us