Winner of 2023 MinneMUDAC Student Data Science Challenge
Build predictive models for the game-by-game attendance for home games for all MLB teams for the 2023 season and identify factors that tend to influence home-game attendance.
A big thanks and credit to all the team members who made this project possible:
- Congyi Zhang (zhan8373@umn.edu)
- Jichen Liu (liu02354@umn.edu)
- Lan Chen (chen7613@umn.edu)
- Rio Pan (pan00246@umn.edu)
- Simin Liao (liao0150@umn.edu)
Attendance is a crucial factor for the success of MLB teams. Accurately predicting attendance can have significant impacts on both long-term and short-term profitability and operational efficiency. However, the prediction of MLB attendance can be complicated due to multiple factors, leading to inaccurate forecasts and potentially suboptimal business decisions.
To address this issue:
- We firstly utilize a pre-season attendance prediction model to help the MLB teams make attendance predictions before the season. The prediction results can help with long-term business planning like game scheduling, staff hiring and season ticket price adjustment.
- Furthermore, we provide an in-season attendance model which includes data collected during the new season, such as latest game performance and player list, to dynamically predict attendance and facilitate short-term decsions making.
- Lastly, to identify important factors and understand how they affect attendance across teams, we interpret the important factors from feature importance graph.
We integrate data from multiple sources and build features that can be grouped into 3 buckets:
- Team performance,
- player and,
- calendar
Based on the features, we deliver the following models and results.
We use Temporal Fusion Transfomer for long-term attendance prediction since the model has advantages as:
- Capable of processing multiple heterogeneous time series data simultaneously.
- Takes into account the impact of all historical data when forecasting time series, resulting in more accurate predictions.
- Achieves high performance across multi-horizon forecasting, providing accurate predictions across different time horizons.
We use LightGBM for its characteristics:
- Capable of efficiently processing a large number of features and automatically generating feature importance, allowing businesses to gain insights into the most important factors affecting their predictions.
- Fast training speed allows for quick iteration and retraining of the model when new data is received, making it a flexible and adaptable tool for short-term time series forecasting
The deliverables are:
- Feature importance score from the LightGBM model to select important factors
- Partial Dependence Plot(PDP) to show the relationship between attendance and the factors
- Selected teams based on clustering results and detailed analysis towards across teams differences
- Recommendations to optimize game schedule, marketing timing and marketing content