The goal of mvgam
is to use a Bayesian framework to estimate parameters of Generalised Additive Models for discrete time series with dynamic trend components. The motivation for the package and some of its primary objectives are described in detail by Clark & Wells 2022, with additional inspiration on the use of Bayesian probabilistic modelling to quantify uncertainty and advise principled decision making coming from Michael Betancourt, Michael Dietze and Emily Fox, among many others.
Install the development version from GitHub
using: devtools::install_github("nicholasjclark/mvgam")
We can explore the model’s primary functions using a test dataset that is available with all R
installations. We introduce dynamic Generalised Additive Models and some of the key utility functions provided in mvgam
. First, load the lynx
data and plot the series as well as its estimated autocorrelation function
library(mvgam)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
#> Loading required package: rjags
#> Loading required package: coda
#> Linked to JAGS 4.3.0
#> Loaded modules: basemod,bugs
#> Loading required package: parallel
#> Loading required package: runjags
#> Warning: package 'runjags' was built under R version 4.0.5
data(lynx)
lynx_full = data.frame(year = 1821:1934,
population = as.numeric(lynx))
plot(lynx_full$population, type = 'l', ylab = 'Lynx trappings',
xlab = 'Time')
acf(lynx_full$population, main = '')
Along with serial autocorrelation, there is a clear ~19-year cyclic pattern to the data. Create a season
term that can be used to model this effect and give a better representation of the data generating process than we would likely get with a linear model
plot(stl(ts(lynx_full$population, frequency = 19), s.window = 'periodic'))
lynx_full$season <- (lynx_full$year %%19) + 1
For mvgam
models, the response needs to be labelled y
and we also need an indicator of the series name as a factor
variable (if the column series
is missing, this will be added automatically by assuming that all observations are from a single time series). Finally, a time
column is needed to index time
lynx_full$y <- lynx_full$population
lynx_full$time <- 1:NROW(lynx_full)
lynx_full$series <- factor('series1')
Split the data into training (first 50 years) and testing (next 10 years of data) to evaluate multi-step ahead forecasts
lynx_train = lynx_full[1:50, ]
lynx_test = lynx_full[51:60, ]
Now fit an mvgam
model; it fits a GAM in which a cyclic smooth function for season
is estimated jointly with a full time series model for the errors (in this case an AR2
process), rather than relying on smoothing splines that do not incorporate a concept of the future. We assume the outcome follows a Poisson distribution and estimate the model in JAGS
using MCMC sampling (installation links are found here)
lynx_mvgam <- mvjagam(data_train = lynx_train,
data_test = lynx_test,
formula = y ~ s(season, bs = 'cc', k = 19),
knots = list(season = c(0.5, 19.5)),
family = 'poisson',
trend_model = 'AR2',
drift = F,
burnin = 10000,
chains = 4)
#> NOTE: Stopping adaptation
Perform a series of posterior predictive checks to see if the model is able to simulate data for the training period that looks realistic and unbiased. First, examine histograms for posterior predictions (yhat
) and compare to the histogram of the observations (y
)
ppc(lynx_mvgam, series = 1, type = 'hist')
Now plot the distribution of predicted means compared to the observed mean
ppc(lynx_mvgam, series = 1, type = 'mean')
Next examine simulated empirical Cumulative Distribution Functions (CDF) for posterior predictions (yhat
) and compare to the CDF of the observations (y
)
ppc(lynx_mvgam, series = 1, type = 'cdf')
Rootograms are becoming popular graphical tools for checking a discrete model's ability to capture dispersion properties of the response variable. Posterior predictive hanging rootograms can be displayed using the ppc()
function in mvgam
. In the plot below, we bin the unique observed values into 25
bins to prevent overplotting and help with interpretation. This plot compares the frequencies of observed vs predicted values for each bin, which can help to identify aspects of poor model fit. For example, if the gray bars (representing observed frequencies) tend to stretch below zero, this suggests the model's simulations predict the values in that particular bin less frequently than they are observed in the data. A well-fitting model that can generate realistic simulated data will provide a rootogram in which the lower boundaries of the grey bars are generally near zero
ppc(lynx_mvgam, series = 1, type = 'rootogram', n_bins = 25)
Finally look for any biases in predictions by examining a Probability Integral Transform (PIT) histogram. If our predictions are not biased one way or another (i.e. not consistently under- or over-predicting), this histogram should look roughly uniform
ppc(lynx_mvgam, series = 1, type = 'pit')
All of these plots indicate the model is well calibrated against the training data, with no apparent pathological behaviors exhibited. Have a look at this model's summary to see what is being estimated (note that longer MCMC runs would probably be needed to increase effective sample sizes)
summary(lynx_mvgam)
#> GAM formula:
#> y ~ s(season, bs = "cc", k = 19)
#>
#> Family:
#> Poisson
#>
#> Link function:
#> log
#>
#> Trend model:
#> AR2
#>
#> N series:
#> 1
#>
#> N observations:
#> 50
#>
#> Status:
#> Fitted using runjags::run.jags()
#>
#> GAM smooth term estimated degrees of freedom:
#> edf Ref.df
#> s(season) 16.23 17
#>
#> GAM coefficient (beta) estimates:
#> 2.5% 50% 97.5% Rhat n.eff
#> (Intercept) 6.6105156 6.75145688 6.90631540 1.06 79
#> s(season).1 -1.1167398 -0.66686831 -0.19909938 1.15 159
#> s(season).2 -0.2597191 0.21243370 0.68457904 1.02 104
#> s(season).3 0.4239870 0.98865331 1.51423294 1.24 62
#> s(season).4 0.9719459 1.50128572 2.26302065 1.67 53
#> s(season).5 1.3134227 1.77645942 2.56716736 1.53 41
#> s(season).6 0.3180004 1.11004045 1.69276564 1.41 54
#> s(season).7 -0.7186014 -0.11561379 0.62534135 1.23 83
#> s(season).8 -1.2432651 -0.63007374 -0.01391211 1.19 144
#> s(season).9 -1.4660521 -0.78077613 -0.07082851 1.21 138
#> s(season).10 -1.1560260 -0.45833882 0.32669776 1.30 124
#> s(season).11 -0.5575703 0.22134251 0.92264390 1.26 80
#> s(season).12 0.4218646 1.11598797 1.87469199 1.06 42
#> s(season).13 0.5435899 1.28854011 2.17037640 1.07 19
#> s(season).14 0.2201595 1.08458247 1.96196902 1.13 36
#> s(season).15 -0.5518826 0.03652332 0.63087028 1.03 81
#> s(season).16 -1.1719668 -0.68437234 -0.12718952 1.02 192
#> s(season).17 -1.4382975 -0.98883130 -0.52488760 1.07 167
#>
#> GAM smoothing parameter (rho) estimates:
#> 2.5% 50% 97.5% Rhat n.eff
#> s(season) 3.116442 3.924688 4.610654 1.01 1801
#>
#> Latent trend parameter estimates:
#> 2.5% 50% 97.5% Rhat n.eff
#> ar1 0.4343678 0.7345141 1.0307479 1.03 768
#> ar2 -0.4349903 -0.1238582 0.1844518 1.05 368
#> sigma 0.3715101 0.4689521 0.6015719 1.01 715
#>
The plot_mvgam_...()
functions offer more flexibility than the generic S3 plot.mvgam()
functions. For example, we can inpsect traceplots when sampling from a posterior with MCMC
methods. Here for the GAM
component (smoothing parameters).
plot_mvgam_trace(lynx_mvgam, 'rho')
and for the latent trend component parameters
MCMCvis::MCMCtrace(lynx_mvgam$jags_output, c('ar1', 'ar2', 'sigma'), pdf = F, n.eff = T, Rhat = T)
Inspect the model's estimated smooth for the 19-year cyclic pattern, which is shown as a ribbon plot of posterior empirical quantiles. We can also overlay posterior quantiles of partial residuals (shown as ribbon rectangles in red), which represent the leftover variation that the model expects would remain if this smooth term was dropped but all other parameters remained unchanged. Note that these are on a different scale to those from mgcv::plot.gam
as these are randomised quantile residuals that are essentially standard normal in distribution. But either way, a strong pattern in the partial residuals suggests there would be strong patterns left unexplained in the model if we were to drop this term, giving us further confidence that this function is important in the model
plot(lynx_mvgam, type = 'smooths', residuals = T)
First derivatives of smooth functions can also be plotted to inspect how the slope of the function changes across its length. To plot these we use the more flexible plot_mvgam_smooth()
function
plot_mvgam_smooth(lynx_mvgam, 1, 'season', derivatives = T)
We can also view the mvgam's posterior retrodictions and predictions for the entire series (testing and training)
plot(lynx_mvgam, type = 'forecast', data_test = lynx_test)
And the estimated latent trend component, again using the more flexible plot_mvgam_...()
option to show first derivatives of the estimated trend
plot_mvgam_trend(lynx_mvgam, data_test = lynx_test, derivatives = T)
We can also re-do the posterior predictive checks, but this time focusing only on the out of sample period. This will give us better insight into how the model is performing and whether it is able to simulate realistic and unbiased future values
ppc(lynx_mvgam, series = 1, type = 'hist', data_test = lynx_test, n_bins = 150)
ppc(lynx_mvgam, series = 1, type = 'mean', data_test = lynx_test)
ppc(lynx_mvgam, series = 1, type = 'cdf', data_test = lynx_test)
ppc(lynx_mvgam, series = 1, type = 'pit', data_test = lynx_test)
A key aspect of ecological forecasting is to understand how different components of a model contribute to forecast uncertainty. We can estimate relative contributions to forecast uncertainty for the GAM component and the latent trend component using mvgam
plot_mvgam_uncertainty(lynx_mvgam, data_test = lynx_test, legend_position = 'none')
text(1, 0.2, cex = 1.5, label="GAM component",
pos = 4, col="white", family = 'serif')
text(1, 0.8, cex = 1.5, label="Trend component",
pos = 4, col="#7C0000", family = 'serif')
Both components contribute to forecast uncertainty, suggesting we would still need some more work to learn about factors driving the dynamics of the system. But we will leave the model as-is for this example. Diagnostics of the model can also be performed using mvgam
. Have a look at the model's residuals, which are posterior medians of Dunn-Smyth randomised quantile residuals so should follow approximate normality. We are primarily looking for a lack of autocorrelation, which would suggest our AR2 model is appropriate for the latent trend
plot(lynx_mvgam, type = 'residuals')
Another useful utility of mvgam
is the ability to use rolling window forecasts to evaluate competing models that may represent different hypotheses about the series dynamics. Here we will fit a poorly specified model to showcase how this evaluation works. In this model, we ignore the cyclic pattern of seasonality and force it to be fairly non-wiggly. We also use a random walk process for the trend
lynx_mvgam_poor <- mvjagam(data_train = lynx_train,
data_test = lynx_test,
formula = y ~ s(season, bs = 'gp', k = 3),
family = 'poisson',
trend_model = 'RW',
drift = FALSE,
burnin = 10000,
chains = 4)
#> NOTE: Stopping adaptation
We choose a set of timepoints within the training data to forecast from, allowing us to simulate a situation where the model's parameters had already been estimated but we have only observed data up to the evaluation timepoint and would like to generate forecasts from the latent trends. Here we use year 10 as our last observation and forecast ahead for the next 10 years.
mod1_eval <- eval_mvgam(lynx_mvgam, eval_timepoint = 10, fc_horizon = 10)
mod2_eval <- eval_mvgam(lynx_mvgam_poor, eval_timepoint = 10, fc_horizon = 10)
Summary statistics of the two models' out of sample Discrete Rank Probability Score (DRPS) indicate that the well-specified model performs markedly better (far lower DRPS) for this evaluation timepoint
summary(mod1_eval$series1$drps)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.478 11.706 107.337 92.395 125.152 231.358
summary(mod2_eval$series1$drps)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 42.17 49.49 296.16 283.49 438.72 658.31
Nominal coverages for both models' 90% prediction intervals
mean(mod1_eval$series1$in_interval)
#> [1] 0.9
mean(mod2_eval$series1$in_interval)
#> [1] 0.7
The compare_mvgams
function automates this process by rolling along a set of timepoints for each model, ensuring a more in-depth evaluation of each competing model at the same set of timepoints. There are many more extended uses for mvgam
models, including the ability to fit dynamic factor processes for analysing and forecasting sets of multivariate discrete time series
This project is licensed under an MIT
open source license