An analysis of different time series techniques to predict the performance of a cricket team in the next match
-
Dataset:
- Data of cricket matches between four (4) teams, India, Australia, South Africa, and England from the year 1976-2018.
- Training Data: 1976-2010
- Testing Data: 2011-2018
- Features: Runs scored by either team, balls used by either team, wickets taken by either team, Venue of the match and, of course, result
- We calculated the performance of the team based on a weighted formula:
-
Models used:
- Windowed Linear Regression: Since the data was not stationary, we could not use traditional ANOVA techniques. Hence, we use a windowed regression model with a window size of 5. We then train two different models, one with shuffled block of 5 data points each and one wit the unshuffled blocks. We obtain an MSE value of 5.13 (on an output between 1-12). For the shuffled data, we obtain an MSE of around 3.9
- MLP: Using the same window size, we train an MLP model. The number of layers, the number of hidden neurons, and the learning rate are the hyperparameters. The best model is 2 hidden layers, with 50 neurons each and a learning rate of 2e(-4). We obtain a MSE of 3.93 with the unshuffled dataset
- Random Forest: Using the same window size, we train a random forest model. The number of trees is the hyperparameter. The best model has 570 trees. The MSE on the random forest model was 3.97 on the unshuffled dataset.
- LSTM: Using the original dataset, we develop an single layer LSTM model with 17 hidden neurons. the MSE on the LSTM model is 2.42.
The Data used shows significant variance in the time average of the performance. The performance formula is not time dependent and hence, earlier matches, which were often low-scoring affairs, are marked lower. LSTM works best in identifying this long term creep in performance and hence, performs best.