The goalmodel package let you model the number of goals scored in sport games. The models are primarily aimed at modelling and predicting football (soccer) scores, but could also be applicable for similar sports, such as hockey and handball.

Installation

install.packages("devtools")
devtools::install_github("opisthokonta/goalmodel")

The default model

The goalmodel package models the goal scoring intensity so that it is a function of the two opposing teams attack and defense ratings. Let λ1 and λ2) be the goal scoring intensities for the two sides. The default model in the goalmodel package then models these as

log(λ1) = μ + hfa + attack1 − defense2

log(λ2) = μ + attack2 − defense1

where μ is the intercept (approximately the average number of goals scored) and hfa is the home field advantage given to team 1. By default the number of goals scored, Y1 and Y2 are distributed as

Y1 ∼ Poisson(λ1)

Y2 ∼ Poisson(λ2)

The default model can be modified in numerous ways as detailed in this vignette.

Fitting the default model

Models are fitted with the goalmodel function. The minimum required input is vectors containing the number of goals scored and the names of the teams. Here I use data from the excellent engsoccerdata package.

library(goalmodel)
library(dplyr) # Useful for data manipulation. 
library(engsoccerdata) # Some data.

# Load data from English Premier League, 2011-12 season.
england %>% 
  filter(Season %in% c(2011),
         tier==c(1)) %>% 
  mutate(Date = as.Date(Date)) -> england_2011


# Fit the default model, with the home team as team1.

gm_res <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal,
                     team1 = england_2011$home, team2=england_2011$visitor)

# Show the estimated attack and defense ratings and more.
summary(gm_res)
## Model sucsessfully fitted in 1.01 seconds
## 
## Number of matches           380 
## Number of teams              20 
## 
## Model                     Poisson 
## 
## Log Likelihood            -1088.99 
## AIC                        2259.98 
## R-squared                  0.18 
## Parameters (estimated)       41 
## Parameters (fixed)            0 
## 
## Team                      Attack   Defense
## Arsenal                    0.30     0.03 
## Aston Villa               -0.39    -0.01 
## Blackburn Rovers          -0.10    -0.41 
## Bolton Wanderers          -0.15    -0.40 
## Chelsea                    0.17     0.10 
## Everton                   -0.10     0.26 
## Fulham                    -0.13     0.01 
## Liverpool                 -0.16     0.26 
## Manchester City            0.51     0.53 
## Manchester United          0.47     0.41 
## Newcastle United           0.02     0.01 
## Norwich City              -0.04    -0.25 
## Queens Park Rangers       -0.23    -0.24 
## Stoke City                -0.42    -0.01 
## Sunderland                -0.20     0.12 
## Swansea City              -0.22     0.02 
## Tottenham Hotspur          0.18     0.21 
## West Bromwich Albion      -0.19    -0.00 
## Wigan Athletic            -0.25    -0.18 
## Wolverhampton Wanderers   -0.28    -0.45 
## -------
## Intercept                  0.19 
## Home field advantage       0.27

The Dixon-Coles model

The model in the paper by Dixon and Coles (1997) they introduce a model modifies the probabilities for low scores, where each team scores at most 1 goal each. The amount of adjustment is determined by the parameter ρ (rho), which can be estimated from the data. The Dixon-Coles model can be fitted by setting the dc argument to TRUE.

# Fit the Dixon-Coles model.

gm_res_dc <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal,
                     team1 = england_2011$home, team2=england_2011$visitor,
                     dc=TRUE)

summary(gm_res_dc)
## Model sucsessfully fitted in 1.14 seconds
## 
## Number of matches           380 
## Number of teams              20 
## 
## Model                     Poisson 
## 
## Log Likelihood            -1087.36 
## AIC                        2258.72 
## R-squared                  0.18 
## Parameters (estimated)       42 
## Parameters (fixed)            0 
## 
## Team                      Attack   Defense
## Arsenal                    0.31     0.03 
## Aston Villa               -0.37    -0.03 
## Blackburn Rovers          -0.12    -0.41 
## Bolton Wanderers          -0.14    -0.40 
## Chelsea                    0.17     0.09 
## Everton                   -0.12     0.27 
## Fulham                    -0.13     0.01 
## Liverpool                 -0.17     0.25 
## Manchester City            0.50     0.55 
## Manchester United          0.46     0.43 
## Newcastle United           0.03     0.00 
## Norwich City              -0.04    -0.25 
## Queens Park Rangers       -0.24    -0.23 
## Stoke City                -0.42    -0.01 
## Sunderland                -0.20     0.12 
## Swansea City              -0.21     0.01 
## Tottenham Hotspur          0.18     0.21 
## West Bromwich Albion      -0.20     0.00 
## Wigan Athletic            -0.25    -0.17 
## Wolverhampton Wanderers   -0.27    -0.46 
## -------
## Intercept                  0.18 
## Home field advantage       0.27 
## Dixon-Coles adj. (rho)    -0.13

Notice that the estimated ρ parameter is listed together with the other parameters.

The Rue-Salvesen adjustment

In a paper by Rue and Salvesen (2001) they introduce an adjustment to the goals scoring intesities λ1 and λ2:

log(λ1adj) = log(λ1) − γΔ1,2

log(λ2adj) = log(λ2) + γΔ1,2

where

Δ1,2 = (attack1+defense1−attack2−defense2)/2

and γ is a parameter that modifies the amount of adjustment. This model can be fitted by setting the rs argument to TRUE.

# Fit the model with the Rue-Salvesen adjustment.

gm_res_rs <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal,
                     team1 = england_2011$home, team2=england_2011$visitor,
                     rs=TRUE)

summary(gm_res_rs)
## Model sucsessfully fitted in 1.31 seconds
## 
## Number of matches           380 
## Number of teams              20 
## 
## Model                     Poisson 
## 
## Log Likelihood            -1088.99 
## AIC                        2261.98 
## R-squared                  0.18 
## Parameters (estimated)       42 
## Parameters (fixed)            0 
## 
## Team                      Attack   Defense
## Arsenal                    0.31     0.04 
## Aston Villa               -0.40    -0.02 
## Blackburn Rovers          -0.12    -0.43 
## Bolton Wanderers          -0.16    -0.41 
## Chelsea                    0.18     0.11 
## Everton                   -0.09     0.26 
## Fulham                    -0.13     0.01 
## Liverpool                 -0.16     0.26 
## Manchester City            0.54     0.57 
## Manchester United          0.50     0.44 
## Newcastle United           0.03     0.01 
## Norwich City              -0.04    -0.26 
## Queens Park Rangers       -0.24    -0.25 
## Stoke City                -0.43    -0.02 
## Sunderland                -0.20     0.12 
## Swansea City              -0.22     0.01 
## Tottenham Hotspur          0.19     0.23 
## West Bromwich Albion      -0.20    -0.01 
## Wigan Athletic            -0.27    -0.19 
## Wolverhampton Wanderers   -0.30    -0.47 
## -------
## Intercept                  0.19 
## Home field advantage       0.27 
## Rue-Salvesen adj. (gamma)  0.06

Making predictions

There is little use in fitting the model without making predictions. Several functions are provided that make different types of predictions.

We can get the expected goals with the predict_expg function. Here the return_df is set to TRUE, which returns a convenient data.frame.

to_predict1 <- c('Arsenal', 'Manchester United', 'Bolton Wanderers')
to_predict2 <- c('Fulham', 'Chelsea', 'Liverpool')

predict_expg(gm_res_dc, team1=to_predict1, team2=to_predict2, return_df = TRUE)
##                               team1     team2    expg1     expg2
## Arsenal                     Arsenal    Fulham 2.119342 1.0249899
## Manchester United Manchester United   Chelsea 2.291206 0.9289649
## Bolton Wanderers   Bolton Wanderers Liverpool 1.069256 1.5093093

We can also get the probabilities of the final outcome (team1 win, draw, team2 win) with the predict_result function.

predict_result(gm_res_dc, team1=to_predict1, team2=to_predict2, return_df = TRUE)
##               team1     team2        p1        pd        p2
## 1           Arsenal    Fulham 0.6119676 0.2266340 0.1613984
## 2 Manchester United   Chelsea 0.6684588 0.2051687 0.1263725
## 3  Bolton Wanderers Liverpool 0.2522542 0.2902563 0.4574895

Other functions are predict_goals which return matrices of the probabilities of each scoreline, and predict_ou for getting the probabilities for over/under total scores.

Weights

For predicton purposes it is a good idea to weight the influence of each match on the estimates so that matches far back in time have less influence than the more recent ones. The Dixon-Coles paper describe a function that can help set the weights. This function is implemented in the weights_dc function. The degree more recent matches are weighted relative to earlier ones are determined by the parameter xi. Good values of xi is usually somewhere between 0.001 and 0.003

Here we calculate the weights, plot them, and fit the default model with the weights.

my_weights <- weights_dc(england_2011$Date, xi=0.0019)

length(my_weights)
## [1] 380
plot(england_2011$Date, my_weights)

gm_res_w <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal, 
                     team1 = england_2011$home, team2=england_2011$visitor,
                     weights = my_weights)

Additional covariates

It is possible to use additional variables to predict the outcome. One particular useful thing this can be used for is to have greater control of the home field advantage. For example if you use data from several leagues you might want to have a separate home field advantage for each league, or be able to turn of the advantage for games played on neutral ground.

Here is how to specify the default model with the home field advantage specified as additional covariates. Notice how the hfa argument is set to FALSE, so that the default home field advantage factor is droped. Also notice that only covariates for the first team is given, implicitly setting all values of this covariate to 0 for team2.

# Create a n-by-1 matrix, containing only 1's, to indicate that all 
# games have a home team.
xx1_hfa <- matrix(1, nrow=nrow(england_2011), ncol=1)
colnames(xx1_hfa) <- 'hfa' # must be named.

# Fit the default model, with the home team as team1.
gm_res_hfa <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal,
                     team1 = england_2011$home, team2=england_2011$visitor,
                    hfa = FALSE, # Turn of the built-in Home Field Advantage
                    x1 = xx1_hfa) # Provide the covariate matrix.

# the coefficients for the additional covariates are found in the 'beta' element in 
# the parameter list. They will also show up in the summary.
gm_res_hfa$parameters$beta
##       hfa 
## 0.2679329

The Negative Binomial model

All other models in this vignette has assumed that the number of goals scored are distributed according to the Poisson distribution. The Poisson distribution has a lot of nice properties, but also some limitations. For instance there is only one parameter that describes both the expected value (the average) and the variance. The Negative Binomial (NB) model is a more flexible model that allows higher variance than the Poisson mode. The NB model introduces another parameter called the "dispersion" parameter. This parameter has to be positive, and the greater the value of this parameter, the more variance there is relative to the Poisson model with the same λ (called overdispersion). When the parameter is close to zero, it the result is approximately the Poisson. The NB model can be fitted by setting the "model" argument to 'negbin'. In this example the dispersion parameter is close to zero.

# Fit the model with overdispersion.

gm_res_nb <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal,
                     team1 = england_2011$home, team2=england_2011$visitor,
                     model='negbin')


gm_res_nb$parameters$dispersion
## [1] 4.333503e-05

Fixed parameters - Two step estimation.

If you want to, you can set some of the parameters to a fixed value that is kept constant during model fitting. In other words, you don't estimate the fixed parameters from the data, but set them manually beforehand and all other parameters are estimated with this in mind.

This feature can be useful for doing two-step estimation. Sometimes the model fitting can be unstable, especially when the model is complicated. It can then be a good idea to first fit a simpler model, like the default model, and then with the parameters from the simpler model kept constant, estimate the rest of the parameters in a second step.

Here is an example of two-step fitting of the Dixon-Coles model.

# Fit the Dixon-Coles model, with most of the parameters fixed to the values in the default model.
gm_res_dc_2s <- goalmodel(goals1 = england_2011$hgoal, goals2 = england_2011$vgoal,
                     team1 = england_2011$home, team2=england_2011$visitor, 
                     dc=TRUE, fixed_params = gm_res$parameters)

# Compare the two estimates of rho. They are pretty similar. 
gm_res_dc_2s$parameters$rho
## [1] -0.1261845
gm_res_dc$parameters$rho
## [1] -0.1334634

Offset - Varying playing time

Sometimes the playing time can be extended to extra time. In football this usually happens when there is a draw in a knock-out tournament, where an additional 30 minutes is added. To handle this, we must add what is called an offset to the model. You can read more about the details on wikipedia.

There is no offset option in the goalmodel package, but we can add it as an extra covariate, and fix the parameter for that covariate to be 1. In the example below we add a game from the 2011-12 FA Cup to the data set which was extended to extra time. Then we create the offset variable and fit the model with the fixed parameter. Note that the offset variable must be log-transformed. In this example, the extra game that is added is the only game in the data set where Middlesbrough is playing, so the estimates will be a bit unstable. Also note that the number of minutes played is divided by 90, the ordinary playing time, otherwise the model fitting tend to get unstable.

# Get matches from the FA cup that was extended to extra time.
facup %>% 
  filter(Season == 2011,
         round >= 4, 
         aet == 'yes') %>% 
  select(Date, home, visitor, hgoal, vgoal, aet) %>% 
  mutate(Date = as.Date(Date)) -> facup_2011

facup_2011
##         Date          home    visitor hgoal vgoal aet
## 1 2012-02-08 Middlesbrough Sunderland     1     2 yes
# Merge the additional game with the origianl data frame.
england_2011 %>% 
  bind_rows(facup_2011) %>% 
  mutate(aet = replace(aet, is.na(aet), 'no')) -> england_2011_2

# Create empty matrix for additional covariates
xx_offset <- matrix(nrow=nrow(england_2011_2))
colnames(xx_offset) <- 'Offset'

# Add data
xx_offset[,'Offset'] <- 90/90 # Per 90 minutes

# Extra time is 30 minutes added.
xx_offset[england_2011_2$aet == 'yes','Offset'] <- (90+30)/90

# Offset must be log-transformed.
xx_offset <- log(xx_offset)

# Take a look.
tail(xx_offset)
##           Offset
## [376,] 0.0000000
## [377,] 0.0000000
## [378,] 0.0000000
## [379,] 0.0000000
## [380,] 0.0000000
## [381,] 0.2876821
# fit the model
gm_res_offset <- goalmodel(goals1 = england_2011_2$hgoal, goals2 = england_2011_2$vgoal, 
                           team1 = england_2011_2$home, team2=england_2011_2$visitor,
                          # The offset must be added to both x1 and x2.
                          x1 = xx_offset, x2 = xx_offset,
                          # The offset parameter must be fixed to 1.
                          fixed_params = list(beta = c('Offset' = 1)))

summary(gm_res_offset)
## Model sucsessfully fitted in 1.17 seconds
## 
## Number of matches           381 
## Number of teams              21 
## 
## Model                     Poisson 
## 
## Log Likelihood            -1091.30 
## AIC                        2268.60 
## R-squared                  0.18 
## Parameters (estimated)       43 
## Parameters (fixed)            1 
## 
## Team                      Attack   Defense
## Arsenal                    0.34     0.05 
## Aston Villa               -0.35     0.01 
## Blackburn Rovers          -0.07    -0.39 
## Bolton Wanderers          -0.11    -0.38 
## Chelsea                    0.20     0.12 
## Everton                   -0.06     0.28 
## Fulham                    -0.10     0.03 
## Liverpool                 -0.13     0.28 
## Manchester City            0.55     0.55 
## Manchester United          0.51     0.43 
## Middlesbrough             -0.58    -0.40 
## Newcastle United           0.06     0.03 
## Norwich City               0.00    -0.23 
## Queens Park Rangers       -0.19    -0.22 
## Stoke City                -0.38     0.01 
## Sunderland                -0.16     0.14 
## Swansea City              -0.18     0.04 
## Tottenham Hotspur          0.21     0.23 
## West Bromwich Albion      -0.16     0.02 
## Wigan Athletic            -0.22    -0.16 
## Wolverhampton Wanderers   -0.25    -0.43 
## -------
## Intercept                  0.17 
## Home field advantage       0.27 
## Offset                     1.00

Modifying parameters

In addition to fixing parameters before the model is fit, you can also manually set the parameters after the model is fitted. This can be useful if you believe the attack or defence paramters given by the model is inacurate because you have some additional information that is not captured in the data.

Modifying the paramters can also be useful due to the same reasons as you might want to fit the model in two (or more) steps. For intance, if you look at the log-likelihood of the default model, and the model with the Rue-Salvesen adjustment, you will notice that they are almost identical. This is probably due to the model being nearly unidentifiable. That does not neccecarily mean that the Rue-Salvesen adjustment does not improve prediction, it might mean that it is hard to estimate the amount of adjustment based on the available data. In the paper by Rue & Salvesen they find that setting γ to 0.1 is a good choice. Here is how this can be done:

# Copy the default model.
gm_res_rs2 <- gm_res

# set the gamma parameter to 0.1
gm_res_rs2$parameters$gamma <- 0.1

# Make a few predictions to compare.
predict_result(gm_res, team1=to_predict1, team2=to_predict2, return_df = TRUE)
##               team1     team2        p1        pd        p2
## 1           Arsenal    Fulham 0.6195517 0.2036457 0.1768027
## 2 Manchester United   Chelsea 0.6735208 0.1844373 0.1420419
## 3  Bolton Wanderers Liverpool 0.2611863 0.2567973 0.4820164
predict_result(gm_res_rs2, team1=to_predict1, team2=to_predict2, return_df = TRUE)
##               team1     team2        p1        pd        p2
## 1           Arsenal    Fulham 0.6045881 0.2084048 0.1870071
## 2 Manchester United   Chelsea 0.6538287 0.1918135 0.1543579
## 3  Bolton Wanderers Liverpool 0.2780517 0.2603784 0.4615699

Miscellaneous

It is also possible to use model = 'gaussian' to fit models using the Gaussian (or normal) distribution. This option is intended to be used when the scores are decimal numbers. This could be useful if instead of actual goals scored, you wanted to fit a model to expected goals. You can't combine this with the Dixon-Coles adjustment, except perhaps using a two step procedure, first estimating the attack and defence parameters using expected goals, then estimating the Dixon-Coles adjustment using actual goals. If you use a Gaussian model to make predictions, the Poisson distribution will be used, and a warning will be given.

Other packages

There are some other packages out there that has similar functionality as this package. The regista package implements a more general interface to fit the Dixon-Coles model. The fbRanks package implements only the Dixon-Coles time weighting functionality, but has a lot of other features.

References

  • Mark J. Dixon and Stuart G. Coles (1997) Modelling Association Football Scores and Inefficiencies in the Football Betting Market.
  • Håvard Rue and Øyvind Salvesen (2001) Prediction and Retrospective Analysis of Soccer Matches in a League.