/Attendance-Prediction-for-MLB

2023 MinneMUDAC Student Data Science Challenge

Primary LanguageJupyter Notebook

Attendance-Prediction-for-MLB

Winner of 2023 MinneMUDAC Student Data Science Challenge

Build predictive models for the game-by-game attendance for home games for all MLB teams for the 2023 season and identify factors that tend to influence home-game attendance.

Team Members

A big thanks and credit to all the team members who made this project possible:

Deliverables

Project Overview

Attendance is a crucial factor for the success of MLB teams. Accurately predicting attendance can have significant impacts on both long-term and short-term profitability and operational efficiency. However, the prediction of MLB attendance can be complicated due to multiple factors, leading to inaccurate forecasts and potentially suboptimal business decisions.

To address this issue:

  • We firstly utilize a pre-season attendance prediction model to help the MLB teams make attendance predictions before the season. The prediction results can help with long-term business planning like game scheduling, staff hiring and season ticket price adjustment.
  • Furthermore, we provide an in-season attendance model which includes data collected during the new season, such as latest game performance and player list, to dynamically predict attendance and facilitate short-term decsions making.
  • Lastly, to identify important factors and understand how they affect attendance across teams, we interpret the important factors from feature importance graph.

image

Feature Engineering

We integrate data from multiple sources and build features that can be grouped into 3 buckets:

  • Team performance,
  • player and,
  • calendar

image

Based on the features, we deliver the following models and results.

Temporal Fusion Transfomer(TFT) Model and 2023 Attendance Predictions

TFT Structure image

We use Temporal Fusion Transfomer for long-term attendance prediction since the model has advantages as:

  • Capable of processing multiple heterogeneous time series data simultaneously.
  • Takes into account the impact of all historical data when forecasting time series, resulting in more accurate predictions.
  • Achieves high performance across multi-horizon forecasting, providing accurate predictions across different time horizons.

Model Outcome

image

LightGBM Model for in-Season Attendance Prediction

We use LightGBM for its characteristics:

  • Capable of efficiently processing a large number of features and automatically generating feature importance, allowing businesses to gain insights into the most important factors affecting their predictions.
  • Fast training speed allows for quick iteration and retraining of the model when new data is received, making it a flexible and adaptable tool for short-term time series forecasting

Model Outcome

image

Feature Importance, Partial Dependence Plot and intepretation

The deliverables are:

  • Feature importance score from the LightGBM model to select important factors
  • Partial Dependence Plot(PDP) to show the relationship between attendance and the factors
  • Selected teams based on clustering results and detailed analysis towards across teams differences
  • Recommendations to optimize game schedule, marketing timing and marketing content

Overall Feature Importance

image

Important Factor #1: Day of Week

image