Author
Justin Ferguson GitHub
Justin Ferguson LinkedIn
Contact
Other Versions
HTML Report
In this notebook, a predictive model using OLS (Ordinary Least Squares) multiple linear regression is created to forecast next day COVID-19 cases in the state of Victoria. Such considerations are of great importance in the current health climate. For example, a model to predict next day cases might be important in enabling the timely allocation of resources to hospitals for immediate patient treatment.
The linear regression model to be considered requires independent variables to model a dependent variable - known as the predictors and response in the machine learning community. In machine learning, a data set of predictor variables are trained to create a response. This variable models a set of true values, and the error between them is a measure of the model's accuracy. Therefore, the challenge in predictive modelling is to find a set of predictor variables that can train a model to suitably minimise the error between the response and true values. Such a model can then be cautiously used to make predictions about the future.
library(readr) # Read csv files
library(ggplot2) # Data visualisation
library(dplyr) # Data frame manipulation
library(cowplot) # For ggplot2 grid formats
library(mlr) # For machine learning regression tasks, model training, predictions and performance metrics
library(forecast) # For Boxcox and inverse Boxcox transformations
library(ggResidpanel) # For ggplot2 visualisation of residuals to check OLS linear regression assumptions
library(stringr) # For string manipulations
library(car) # Load statistical tests
library(DT) # For interactive tables
library(tibble) # For tibble manipulation
library(knitr) # For kable tables
The following model makes use of the Australian COVID-19 data I previously cleaned in this repository.
# Read csv files from my github repository
url <- 'https://raw.githubusercontent.com/j-b-ferguson/covid-19-australia-preprocessing/main/Cleaned%20Data%20after%20Preprocessing/covid19_Australia_data_cleaned.csv'
df <- read_csv(url)
kable(df, caption = 'Table 1: The cleaned COVID-19 data imported from my github repository.')
Table 1: First 5 results of the cleaned COVID-19 data imported from my github repository.
Province/State | Date | Cumulative Cases | Cumulative Tests | Daily Cases | Daily Tests | 7 day daily case moving average | 14 day daily case moving average | 7 day daily test moving average | 14 day daily test moving average |
---|---|---|---|---|---|---|---|---|---|
NSW | 2020-03-09 | 48 | 8371 | 10 | 363 | 0 | 0 | 0 | 0 |
NSW | 2020-03-10 | 55 | 10165 | 7 | 1794 | 0 | 0 | 0 | 0 |
NSW | 2020-03-11 | 65 | 10221 | 10 | 56 | 0 | 0 | 0 | 0 |
NSW | 2020-03-12 | 92 | 14856 | 27 | 4635 | 0 | 0 | 0 | 0 |
NSW | 2020-03-13 | 119 | 16685 | 27 | 1829 | 0 | 0 | 0 | 0 |
In creating this model I am interested in predicting the next day
COVID-19 case numbers in the Australian state of Victoria. To suitably
test a predictive model, a truth column NextDayCases
is created from
the given data set. This column is required in order to measure the
error created by OLS linear regression between the model response and
true values. The truth values will consist of the values in the
DailyCases
column that lead in time by one day. More concisely,
NextDayCases[i]
= df.vic$DailyCases[i+1]
, where i
is any given day
in the set of allowed days. Overall, this simple model considers if
predictors like DailyCases
, SvnDayCaseAvg
, etc., can create a model
response that accurately predicts those values in the NextDayCases
column.
# Make a copy of the original data frame for graph axis labelling
df.new <- df
# Rename to simple column names
colnames(df.new) <- c('State', 'Date',
'CumulativeCases', 'CumulativeTests',
'DailyCases', 'DailyTests',
'SvnDayCaseAvg', 'FrtnDayCaseAvg',
'SvnDayTestAvg', 'FrtnDayTestAvg')
# Convert data type
df.new$State <- df.new$State %>% as.factor()
# Consider just Victorian COVID-19 cases
df.vic <- df.new %>% filter(State == 'VIC') %>% select(-State)
# Filter out observations with zero values
df.vic <- df.vic %>% filter(DailyCases != 0, SvnDayCaseAvg != 0, FrtnDayCaseAvg != 0)
# Create new column and fill with missing values
df.vic$NextDayCases <- NA
# Define values in NextDayCases column as DailyCases[i+1] (the day after)
for (i in 1:nrow(df.vic)) {
if (i < nrow(df.vic)) {
df.vic$NextDayCases[i] <- df.vic$DailyCases[i+1]
} else {
df.vic$NextDayCases[i] <- NA
}
}
# Keep only complete observations
df.vic <- df.vic[complete.cases(df.vic),]
# Now remove the date column as it is an invalid type for the regression task
df.vic <- df.vic %>% select(-Date)
# Show modified table
kable(df.vic, caption = 'Table 2: COVID-19 table showing only Victorian data, with next day case values added.')
Table 2: First 5 results of the COVID-19 table showing only Victorian data, with next day case values added.
CumulativeCases | CumulativeTests | DailyCases | DailyTests | SvnDayCaseAvg | FrtnDayCaseAvg | SvnDayTestAvg | FrtnDayTestAvg | NextDayCases |
---|---|---|---|---|---|---|---|---|
876 | 30300 | 52 | 3000 | 67 | 58 | 2614 | 2164 | 96 |
972 | 33300 | 96 | 3000 | 72 | 63 | 2857 | 2200 | 51 |
1023 | 35300 | 51 | 2000 | 72 | 64 | 3071 | 2271 | 68 |
1091 | 37300 | 68 | 2000 | 74 | 69 | 3157 | 2273 | 49 |
1140 | 39300 | 49 | 2000 | 73 | 73 | 2857 | 2262 | 30 |
Before creating the machine learning task, it is helpful to first
visualise the bivariate relationships between the NextDayCases
variable and the potential model predictors to identify simple linear
relationships.
Below, the SvnDayCaseAvg
and FrtnDayCaseAvg
variables show linearity
with respect to the response variable, and the DailyCases
variable
appears mostly linear but strongly affected by the influence of outliers
at high values. Another observation is the increasing error between the
data points and the regression lines in the forward direction - this is
an indication of unequal variance (heteroscedasticity).
Heteroscedasticity is undesired in a linear regression model and would
indicate that some kind of transformation is required.
# Save column names as vectors
Y <- df.vic[9] %>% unlist() %>% as.vector()
colNames <- names(df.vic)[1:8]
# A loop to create plots showing the predictor variables plotted against "NextDayCases"
for (i in colNames) {
plt <- ggplot(data = df.vic, aes_string(x=i, y=Y)) +
geom_point(colour='black') +
geom_smooth(method='loess', se=FALSE, colour='red') +
xlab(str_to_title(names(df[match(i, colNames)+2]))) +
theme(axis.title.y = element_blank())
assign(paste0('gg.', i), plt)
}
# Create grid of plots
plt.grid <- plot_grid(gg.CumulativeCases, gg.CumulativeTests,
gg.DailyCases, gg.DailyTests,
gg.FrtnDayCaseAvg, gg.FrtnDayTestAvg,
gg.SvnDayCaseAvg, gg.SvnDayTestAvg,
nrow = 4,
ncol = 2)
# Plot title
title <- ggdraw() +
draw_label("The Bivariate Relationships between the Next Day Cases Variable and the Potential Model Predictors",
fontface='bold')
# Show plot grid
plt.grid.title <- plot_grid(title, plt.grid, ncol=1, rel_heights=c(0.1, 1))
plt.grid.title
The OLS linear regression model assumptions must now be defined before continuing further.
1. Linearity: There is a linear relationship between the
independent variable, x, and the dependent variable, y.
2.Independence: Variables are independent of each other as to not be
derived from the same source. Or for time series data, correlation
between time-shifted copies of variables is not present - known as
autocorrelation.
3. Homoscedasticity: The residuals haveconstant variance for any
value of x.
4. Normality: The residuals are normally distributed.
On the basis of assumption 2, a simple linear regression model from the
set of bivariate relationships above should be ignored because
NextDayCases
is a time-shifted copy of the DailyCases
variable.
Therefore, autocorrelation should exist in the relationship between
NextDayCases
and DailyCases
.
This is confirmed with the Durbin-Watson hypothesis test for
autocorrelation from the car
package.
H0 (null hypothesis): There is no correlation among residuals.
HA (alternative hypothesis): There is autocorrelation.
# Create a linear regression model and run the Durbin-Watson test
simplemodel <- lm(NextDayCases ~ DailyCases, data = df.vic)
durbinWatsonTest(simplemodel)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.3803768 2.758981 0
## Alternative hypothesis: rho != 0
At the significance level α = 0.05, p < 0.05 in the
Durbin-Watson test; and so the null hypothesis is rejected, thus
confirming a statistically significant autocorrelation in this simple
linear model. Further, the corresponding bivariate relationships
including SvnDayCaseAvg
and FrtnDayCaseAvg
are sourced from values
of DailyCases
and can also be assumed to violate independence.
Autocorrelation can be prevented by creating a multiple linear regression model. For example, observe the insignificant result obtained from the Durbin-Watson test for the following set of predictors.
# Create a multiple linear regression model and run the Durbin-Watson test
multiplemodel <- lm(NextDayCases ~ DailyCases + SvnDayCaseAvg, data = df.vic)
durbinWatsonTest(multiplemodel)
## lag Autocorrelation D-W Statistic p-value
## 1 0.03035635 1.937744 0.584
## Alternative hypothesis: rho != 0
We established in the last section that the linear regression model
requires at least two predictors in order to satisfy assumption 2. A
regression task using the mlr
package is now created in order to find
a set of suitable predictors for the regression model target. Creating
the regression task returns the number of observations and features used
in the model. The number of features initially returns the total number
of variables in the df.vic
data frame - 8 at this point.
# Create a regression task
cases.task <- makeRegrTask(data = df.vic, target = "NextDayCases")
cases.task
## Supervised task: df.vic
## Type: regr
## Target: NextDayCases
## Observations: 188
## Features:
## numerics factors ordered functionals
## 8 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
Not all features in a regression model task will help to generalise towards model accuracy. In fact, it is common to minimise the number of features in order to avoid overfitting, as well as prevent data sparsity by association with the "curse of dimensionality". Both issues are detrimental to model accuracy.
Dimension reduction with feature selection is a technique used to minimise the number of features before training a predictive model. In this multiple linear regression model, a sensible method is to keep features with a considerably high Pearson correlation coefficient; and so removing features with weak linear relationships, up to an arbitrarily chosen value of 0.7.
# Data set dimension reduction using feature selection
cases.task <- filterFeatures(task = cases.task, method = 'linear.correlation', threshold = 0.7)
getTaskFeatureNames(task = cases.task)
## [1] "DailyCases" "SvnDayCaseAvg" "FrtnDayCaseAvg" "SvnDayTestAvg" "FrtnDayTestAvg"
A linear model is now created with the minimised feature set; and this
model must satisfy the linear regression assumptions mentioned earlier.
The ggResidpanel
package is used below to create a grid of diagnostic
plots to check these assumptions with respect to the model.
# Create the multiple linear regression model
linearmodel <- lm(NextDayCases ~ DailyCases + SvnDayCaseAvg + FrtnDayCaseAvg + SvnDayTestAvg + FrtnDayTestAvg, data = df.vic)
# Check OLS assumptions by inspecting the residual plots
residuals <- resid_panel(linearmodel, plots = "R", smoother = TRUE, qqbands = TRUE, theme = "gray")
residuals
Linearity: The model does not satisfy the requirement of linearity as the LOESS lines in the residual and location-scale plots are not horizontal.
Independence: The requirement for independence has been satisfied as the model now contains multiple predictors and does not autocorrelate.
Homoscedasticity: The non-constant variance observed in the residual and location-scale plots shows the requirement of homoscedasticity has been violated.
Normality: The requirement of normality of residuals has been violated as there are many data points that diverge outside of the 95% confidence interval bands in the Q-Q plot.
Therefore, this linear model cannot be used to make predictions about next day COVID-19 cases. All is not lost, however; as a Boxcox transformation applied to the independent and dependent variables will normalise their values and perhaps improve this linear model.
Before continuing onward, let us first address the residual-leverage plot. Data points with a relatively large Cook's distance have significant influence on the model; and might therefore be identified as a potential outlier. In this plot, we can identify an outlier with a Cook's distance greater than 1. This corresponds to the following observation; and has been removed from the data set to improve model accuracy going forward.
# Show and remove observations with a Cook's distance greater than 1 in the linear model
linearmodel$model[cooks.distance(linearmodel) > 1,]
## NextDayCases DailyCases SvnDayCaseAvg FrtnDayCaseAvg SvnDayTestAvg FrtnDayTestAvg
## 122 579 694 410 375 24106 25522
df.vic <- df.vic[-122,]
A Boxcox transformation is now applied to the original table of variables to improve normality and thus create a model that conforms with respect to the required assumptions. The table below shows the transformed values; and the diagnostic plots reveal that all assumptions have now been satisfied; specifically, the residual and location plots now satisfy linearity, the residual plot is homoscedastic, and the data points in the Q-Q plot are normally distributed.
# Boxcox transformation the original data frame to improve the model
df.vicx <- bind_cols(NextDayCasesx = BoxCox(df.vic$NextDayCases, lambda = BoxCox.lambda(df.vic$NextDayCases, method = 'guerrero')),
DailyCasesx = BoxCox(df.vic$DailyCases, lambda = BoxCox.lambda(df.vic$DailyCases, method = 'guerrero')),
SvnDayCaseAvgx = BoxCox(df.vic$SvnDayCaseAvg, lambda = BoxCox.lambda(df.vic$SvnDayCaseAvg, method = 'guerrero')),
FrtnDayCaseAvgx = BoxCox(df.vic$FrtnDayCaseAvg, lambda = BoxCox.lambda(df.vic$FrtnDayCaseAvg, method = 'guerrero')),
SvnDayTestAvgx = BoxCox(df.vic$SvnDayTestAvg, lambda = BoxCox.lambda(df.vic$SvnDayTestAvg, method = 'guerrero')),
FrtnDayTestAvgx = BoxCox(df.vic$FrtnDayTestAvg, lambda = BoxCox.lambda(df.vic$FrtnDayTestAvg, method = 'guerrero')))
kable(round(df.vicx, 3), caption = 'Table 3: Victoria COVID-19 Boxcox transformed data.')
Table 3: First 5 results of the Victoria COVID-19 Boxcox transformed data.
NextDayCasesx | DailyCasesx | SvnDayCaseAvgx | FrtnDayCaseAvgx | SvnDayTestAvgx | FrtnDayTestAvgx |
---|---|---|---|---|---|
7.068 | 6.210 | 7.822 | 6.737 | 7.439 | 5.854 |
5.712 | 7.735 | 8.048 | 6.951 | 7.518 | 5.863 |
6.310 | 6.165 | 8.048 | 6.992 | 7.582 | 5.881 |
5.631 | 6.853 | 8.135 | 7.191 | 7.607 | 5.881 |
4.687 | 6.072 | 8.092 | 7.342 | 7.518 | 5.879 |
# Create a new linear model with the Boxcox transformed values and then check the OLS linear regression assumptions again
linearmodelx <- lm(NextDayCasesx ~ DailyCasesx + SvnDayCaseAvgx + FrtnDayCaseAvgx + SvnDayTestAvgx + FrtnDayTestAvgx, data = df.vicx)
resid_panel(linearmodelx, "R", smoother = TRUE, qqbands = TRUE, theme = "gray")
With the linear regression assumptions all satisfied, a more advanced form of feature selection can now be used on the transformed model. Backwards elimination is a technique used to sequentially remove predictors of p > 0.05; these predictors are less likely to contribute meaningfully to the linear model. The result is a model with improved accuracy and performance; as well as a reduction in overfitting.
Starting Features: DailyCasesx
, SvnDayCaseAvgx
, FrtnDayCaseAvgx
, SvnDayTestAvgx
, FrtnDayTestAvgx
# Create first linear model with all the transformed predictors and target variables
elim1 <- lm(NextDayCasesx ~ DailyCasesx + SvnDayCaseAvgx + FrtnDayCaseAvgx + SvnDayTestAvgx + FrtnDayTestAvgx, data = df.vicx) %>% summary()
# Extract summary information including p values
elim1.summary <- elim1$coefficients %>% round(2) %>% as.data.frame() %>% rownames_to_column()
elim1.summary <- elim1.summary[, c(1,5)]
# Rename and format table
col.names <- c('Predictor', 'Pr (> |t|)')
colnames(elim1.summary) <- col.names
elim1.summary %>% kable(caption = 'Table 4: Backward elimination step one.')
Table 4: Backward elimination step one.
Predictor | Pr (> |t|) |
---|---|
(Intercept) | 0.80 |
DailyCasesx | 0.10 |
SvnDayCaseAvgx | 0.00 |
FrtnDayCaseAvgx | 0.00 |
SvnDayTestAvgx | 0.37 |
FrtnDayTestAvgx | 0.65 |
Predictor selected for elimination: FrtnDayTestAvgx
# Create second linear model with one predictor removed
elim2 <- lm(NextDayCasesx ~ DailyCasesx + SvnDayCaseAvgx + FrtnDayCaseAvgx + SvnDayTestAvgx, data = df.vicx) %>% summary()
# Extract summary information including p values
elim2.summary <- elim2$coefficients %>% round(2) %>% as.data.frame() %>% rownames_to_column()
elim2.summary <- elim2.summary[, c(1,5)]
# Rename and format table
colnames(elim2.summary) <- col.names
elim2.summary %>% kable(caption = 'Table 5: Backward elimination step two.')
Table 5: Backward elimination step two.
Predictor | Pr (> |t|) |
---|---|
(Intercept) | 0.38 |
DailyCasesx | 0.09 |
SvnDayCaseAvgx | 0.00 |
FrtnDayCaseAvgx | 0.00 |
SvnDayTestAvgx | 0.08 |
Predictor selected for elimination: DailyCasesx
# Create third linear model with one predictor removed
elim3 <- lm(NextDayCasesx ~ SvnDayCaseAvgx + FrtnDayCaseAvgx + SvnDayTestAvgx, data = df.vicx) %>% summary()
# Extract summary information including p values
elim3.summary <- elim3$coefficients %>% round(2) %>% as.data.frame() %>% rownames_to_column()
elim3.summary <- elim3.summary[, c(1,5)]
# Rename and format table
colnames(elim3.summary) <- col.names
elim3.summary %>% kable(caption = 'Table 6: Backward elimination step three.')
Table 6: Backward elimination step three.
Predictor | Pr (> |t|) |
---|---|
(Intercept) | 0.34 |
SvnDayCaseAvgx | 0.00 |
FrtnDayCaseAvgx | 0.00 |
SvnDayTestAvgx | 0.05 |
Optimal features: SvnDayCaseAvgx
, FrtnDayCaseAvgx
, SvnDayTestAvgx
So, we can now modify the df.vicx
table to include only the optimal features in
model fitment.
# Modify df.vicx data frame as per backward elimination
df.vicx.optimal <- df.vicx[, c('NextDayCasesx', 'SvnDayCaseAvgx', 'FrtnDayCaseAvgx', 'SvnDayTestAvgx')]
A generalised linear predictive model can now be trained and tested with
a supervised machine learning task from the mlr
package. The process
is recorded within the subsections below.
A new regression task is first created with the transformed and optimised
predictor and target variables. The features and observations of the task are less than
for the original case.task
due to feature selection and outlier removal.
# Update the regression task with the transformed values from the df.vicx.optimal data frame
cases.taskx <- makeRegrTask(data = df.vicx.optimal, target = 'NextDayCasesx')
cases.taskx
## Supervised task: df.vicx.optimal
## Type: regr
## Target: NextDayCasesx
## Observations: 187
## Features:
## numerics factors ordered functionals
## 3 0 0 0
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
A generalised linear model algorithm is now appropriately chosen to learn from the task and a random subset of the data set for linear regression.
# Create a generalised linear regression learner
lrn <- makeLearner(cl = 'regr.glm')
lrn
## Learner regr.glm from package stats
## Type: regr
## Name: Generalized Linear Regression; Short name: glm
## Class: regr.glm
## Properties: numerics,factors,se,weights
## Predict-Type: response
## Hyperparameters: family=gaussian,model=FALSE
The learner and task are now used in combination with a leave-one-out (LOO) cross validation (CV) resampling method to fit the model.
Cross validation is a resampling method which randomly splits the data into k number of subsets (or 'folds'). The model is then fitted k times, which exactly corresponds to the number of folds used as training data sets.
When cross validation is used exhaustively, all possible combinations of resampling are explored; this corresponds to the total number of observations in the data set. This is otherwise known as LOO CV.
Why use LOO CV in the first place?
Randomly splitting the data into training and test data subsets just once is a commonly used approach to fit a model. However, this approach is not a reliable approach for testing as the model performance can vary between different test sets. LOO CV is a reliable method to measure the model performance because each training/test split is explored and averaged overall.
Model performance is measured in terms of a loss function; a metric to evaluate the average error between the predicted response and true values. Selecting an appropriate loss function requires an understanding of the domain of the machine learning problem; this includes any real-world constraints or conditions.
If the domain of the problem requires large errors (outliers) to carry more weight than smaller errors, then a suitable loss function is mean-square error (MSE). Conversely, if the domain of the problem requires equal weighting for all errors, then mean absolute error (MAE) should be used.
For this linear regression model, error weighting for the transformed COVID-19 cases should scale linearly; and so, MAE is appropriately chosen.
Running the resampling method below to fit the model achieves an average model performance of 0.7102347.
# Set seed value
set.seed(1234)
# Set resample method as leave-one-out cross validation
rdesc <- makeResampleDesc(method = "LOO")
# Perform resampling of the learner on the task with leave-one-out cross validation.
model.performance <- resample(learner = lrn, task = cases.taskx, resampling = rdesc, measures = list(mae))
model.performance
## Resample Result
## Task: df.vicx.optimal
## Learner: regr.glm
## Aggr perf: mae.test.mean=0.7102347
## Runtime: 0.844511
Most predictive models have limitations; this is no exception. To investigate possible constraints, a functional predictive model must first be created from a train/test data split. Make note that cross-validation was performed for testing purposes only; and a working model need only require a conventional data split.
# Split the data into random subsets
set.seed(1234)
n <- nrow(df.vicx.optimal)
train.set <- sample(n, size = 4/5*n)
test.set <- df.vicx.optimal[setdiff(1:n, train.set),]
# Train the learner on the training data set
model.optimal <- train(lrn, cases.taskx, train.set)
In the table below are the Boxcox transformed predictions, 95% upper and lower prediction bounds, and true values of the fitted model.
# Get model prediction and true values
prediction.points <- predict(model.optimal, newdata = test.set) %>%
as.data.frame() %>%
round(3)
# Get 95% prediction interval values
prediction.interval <- predict.lm(
lm(NextDayCasesx ~ SvnDayCaseAvgx + FrtnDayCaseAvgx + SvnDayTestAvgx, data = df.vicx.optimal),
newdata = test.set,
interval = "predict") %>%
as.data.frame() %>%
round(3)
# Combine data into one data frame
predictions<- bind_cols(prediction.points, prediction.interval)
predictions <- predictions[-2]
# Edit names
colnames(predictions) <- c('True Values', 'Predictions', '95% Lower Bound', '95% Upper Bound')
# Output table
kable(predictions, caption = 'Table 7: Table of transformed predictions, true values, and 95% upper and lower prediction bounds.')
Table 7: First 5 results of the table of transformed predictions, true values, and 95% upper and lower prediction bounds.
True Values | Predictions | 95% Lower Bound | 95% Upper Bound |
---|---|---|---|
7.068 | 5.749 | 3.915 | 7.582 |
4.687 | 5.724 | 3.898 | 7.549 |
4.284 | 2.575 | 0.734 | 4.417 |
1.214 | 2.704 | 0.874 | 4.535 |
1.214 | 2.127 | 0.305 | 3.949 |
And below is the mapping between the true values and the Boxcox transformed true values of NextDayCases
from the df.vic
and df.vicx.optimal
tables.
# Merge together the original and transformed next day case values into a new data frame
nextdaycases.compare <- bind_cols(df.vic$NextDayCases ,df.vicx.optimal$NextDayCasesx) %>% round(3)
# Tidy and output merged table
colnames(nextdaycases.compare) <- c('True Values', 'Boxcox Transformed')
kable(nextdaycases.compare, caption = 'Table 8: Mapping between Boxcox transformed values and true values of next day cases.')
Table 8: First 5 results of the mapping between Boxcox transformed values and true values of next day cases.
True Values | Boxcox Transformed |
---|---|
96 | 7.068 |
51 | 5.712 |
68 | 6.310 |
49 | 5.631 |
30 | 4.687 |
In order to extract meaningful next day COVID-19 case predictions from the model, any prediction must undergo an inverse Boxcox transformation to map its predicted value to its true value. The values inside tables 7 and 8 are visualised underneath and reveal this mapping process rises exponentially. Therefore, for increasingly large Boxcox predictions, the 95% prediction interval will also grow exponentially, and so affect model accuracy.
# Create linear regression plot with the 95% prediction interval
plot1 <- ggplot(data = predictions, mapping = aes(y = `True Values`)) +
geom_point(mapping = aes(x = `Predictions`)) +
geom_smooth(mapping = aes(x = `Predictions`), method = "lm", se = FALSE, color = 'red') +
geom_ribbon(mapping = aes(x = `Predictions`, ymin = `95% Lower Bound`, ymax = `95% Upper Bound`), fill = 'steelblue', alpha = 0.2)
# Create a plot to compare true predicted values against mapped predicted Boxcox values
plot2 <- ggplot(data = nextdaycases.compare, mapping = aes(x = `Boxcox Transformed`, y = `True Values`)) +
geom_line(color = 'red', size = 1) +
geom_point()
# Combine plots as grid
plt <- plot_grid(plot1, plot2)
# Create a title
title <- ggdraw() + draw_label("The Transformed Linear Regression Model and Mapping between Boxcox Transformed Values and True Values", fontface='bold')
# Output grid plot
plot_grid(title, plt, ncol=1, rel_heights=c(0.1, 1))
A further analysis of this limitation requires the average 95% prediction width of the model; this is given in the output below. Table 9 below uses arbitrary Boxcox prediction values - ranging from 1 to 10 - and the average prediction width to formulate Boxcox prediction intervals. These values and intervals are mapped into the true values, thus revealing the exponentially increasing true prediction width.
# Calculate and show prediction interval width
prediction.width <- ((sum(predictions$`95% Upper Bound`) - sum(predictions$`95% Lower Bound`)) / nrow(predictions)) %>% round(3)
prediction.width
## [1] 3.625
# Code to produce a table showing 'Boxcox Values', 'Boxcox Prediction Intervals', 'True Values', 'True Prediction Intervals' and 'True Prediction Width'.
width <- prediction.width / 2
boxcox.values <- seq(1, 10, by = 1)
true.values <- InvBoxCox(boxcox.values, lambda = BoxCox.lambda(df.vic$NextDayCases, method = 'guerrero')) %>% round(0)
true.intervals <- c()
boxcox.intervals <- c()
lower.bound <- c()
upper.bound <- c()
true.width <- c()
for (i in 1:10) {
boxcox.intervals[i] <- paste0('[', as.character(i - width), ', ', as.character(i + width), ')')
lower.bound[i] <- InvBoxCox(i - width, lambda = BoxCox.lambda(df.vic$NextDayCases, method = 'guerrero')) %>% round(0)
upper.bound[i] <- InvBoxCox(i + width, lambda = BoxCox.lambda(df.vic$NextDayCases, method = 'guerrero')) %>% round(0)
true.intervals[i] <- paste0('[', as.character(lower.bound[i]), ', ', as.character(upper.bound[i]), ')')
true.width[i] <- upper.bound[i] - lower.bound[i]
}
model.error <- data.frame(boxcox.values, boxcox.intervals, true.values, true.intervals, true.width)
colnames(model.error) <- c('Boxcox Values', 'Boxcox Prediction Intervals', 'True Values', 'True Prediction Intervals', 'True Prediction Width')
kable(model.error, caption = 'Table 9: A table of prediction values and intervals to highlight the increasing width of the true prediction interval.')
Table 9: A table of prediction values and intervals to highlight the increasing width of the true prediction interval.
Boxcox Values | Boxcox Prediction Intervals | True Values | True Prediction Intervals | True Prediction Width |
---|---|---|---|---|
1 | [-0.8125, 2.8125) | 3 | [0, 10) | 10 |
2 | [0.1875, 3.8125) | 6 | [1, 18) | 17 |
3 | [1.1875, 4.8125) | 11 | [3, 32) | 29 |
4 | [2.1875, 5.8125) | 20 | [6, 54) | 48 |
5 | [3.1875, 6.8125) | 35 | [12, 86) | 74 |
6 | [4.1875, 7.8125) | 59 | [23, 132) | 109 |
7 | [5.1875, 8.8125) | 93 | [39, 197) | 158 |
8 | [6.1875, 9.8125) | 143 | [64, 287) | 223 |
9 | [7.1875, 10.8125) | 212 | [101, 408) | 307 |
10 | [8.1875, 11.8125) | 307 | [154, 567) | 413 |
A multiple linear regression model has been created from 3 Boxcox transformed predictors to predict next day COVID-19 cases with Victorian data. Running leave-one-out cross validation achieved an average model performance accuracy of MAE = 0.7102347. To obtain meaningful data, the predicted values must be mapped into meaningful values with an inverse Boxcox transformation. This process reveals an increasingly large 95% true prediction interval, assumingly due to the heteroscedasticity of original data. Therefore, large predictions should be cautiously interpreted due to reducing model accuracy.