Noam Ross 2024-01-18
OK, I want to fit a model that has multiple continuous, correlated
outcomes as a multivariate normal using mgcv::mvn
. However, data from
some of the outcomes are missing. This doc is an exploration of
approaches for this.
First let’s generate some data. In this case make a data framewith two
input (x) variables and 3 output (y) varibales, with y3 missing 90% of
values. My data simulation function creates some random nonlinear
functions with approxfun()
and a random covariance matrix for the
outcome. These can be retrieved as attributes of the data later.
(Hidden here are some data simulation functions)
# Generate a dataset with missing values from a multivariate normal distribution
# @param n number of observations
# @param nx number of x variables
# @param ny number of outcome variables
# @param shared_fns number of functions to share across outcome variables, up to nx (not used yet)
# @param x_range range of x values
# @param yrange range of y values
# @param k number of knots for each function
# @param coef_mat matrix of coefficients for each function, typically 1 or zero
# @param miss proportion of missing values for each outcome variable
# @param V covariance matrix for the outcome variables
# @param seed random seed
simulate_mvn_missing <- function(n = 300, nx = 2, ny = 3, x_range = c(0,1), yrange = c(0,1), k = 4,
coef_mat = matrix(1, nx, ny), miss = c(0, 0, 0.2), shared_fns = 1,
V = generate_cov_matrix(ny, scale = 1), seed = 0) {
# Generate a series of nonlinear functions
set.seed(seed)
fns <- replicate(nx*ny, {
x <- c(x_range[1], runif(k - 2, x_range[1], x_range[2]), x_range[2])
y <- runif(1) + runif(1)*x + runif(k, yrange[1], yrange[2])
splinefun(x = x, y = y, method = "fmm")
})
dim(fns) <- c(ny, nx)
# Random X values
x <- matrix(0, n, nx)
set.seed(seed)
for (i in seq_len(nx)) {
x[,i] <- runif(n, x_range[1], x_range[2])
}
colnames(x) <- paste0("x", seq_len(nx))
# Generate Y values
y <- matrix(0,n,ny)
for (i in seq_len(ny)) {
for (j in seq_len(nx)) {
y[,i] <- y[,i] + fns[i,j][[1]](x[,j])
}
}
y <- y + mgcv::rmvn(n, mu = rep(0, ny), V = V)
colnames(y) <- paste0("y", seq_len(ny))
# Missing data
y_miss <- y
set.seed(seed)
for (i in seq_len(ny)) {
y_miss[sample(n, floor(n * miss[i])), i] <- NA
}
df <- as.data.frame(cbind(x, y_miss))
attr(df, "true_V") <- V
attr(df, "true_fns") <- fns
attr(df, "true_data") <- as.data.frame(cbind(x, y))
df
}
generate_cov_matrix <- function(dim, scale = 1) {
U <- matrix(rnorm(dim^2), dim, dim)
U[lower.tri(U)] <- 0
# Ensure diagonal elements are positive
diag(U) <- abs(diag(U)) + 1e-6 # Adding a small constant for numerical stability
# Construct the covariance matrix
covMatrix <- U %*% t(U) * scale
return(covMatrix)
}
set.seed(0)
V <- matrix(1 + rnorm(9,sd = 0.1), 3) + diag(3)*0.5
V[lower.tri(V)] <- V[t(lower.tri(V))]
data <- simulate_mvn_missing(n = 300, miss = c(0,0,0.9), seed = 13, V = V)
OK, first strategy. Following the approach in ?mgcv::missing.data
, we
create new index variables that indicate whether the outcome is missing
as an ordered factor, and use by=
in smooth terms. In this case I also
center the outcome variables so we don’t have to deal with intercepts.
xvars <- c("x1", "x2")
yvars = c("y1", "y2", "y3")
data_missing <- data # The data we'll fit
data_full <- attr(data, "true_data")
ymeans <- numeric(length(yvars))
idvars <- character(length(yvars))
# Make ordered ID variables (0 = missing), center the outcome variables, and set missing values to zero
for (i in seq_along(yvars)) {
yvar <- yvars[i]
idvar <- paste0("id_", yvar)
idvars[i] <- idvar
# Center the outcome variables so we don't deal with intercepts, save the means
ymeans[i] <- mean(data_missing[[yvar]], na.rm = TRUE)
data_missing[[yvar]] <- data_missing[[yvar]] - ymeans[i]
data_full[[yvar]] <- data_full[[yvar]] - ymeans[i]
# Create indicate variables (id_*)as to whether to include an observation, as ordered factors
# with 0 being missing and 1 being present
data_missing[[idvar]] <- ordered(ifelse(is.na(data_missing[[yvar]]), 0, 1), levels = c("0", "1"))
# Set missing values to zero
data_missing[[yvar]][is.na(data_missing[[yvar]])] <- 0
}
# Create no-intercept formulas where all terms are conditional on the id value of the outcome
frms <- lapply(seq_along(yvars), function(i) {
paste0(yvars[i], " ~ 0 + ", paste0("s(", xvars, ", by = ", idvars[i], ", k = 4)", collapse = " + ")) |>
as.formula()
})
frms
## [[1]]
## y1 ~ 0 + s(x1, by = id_y1, k = 4) + s(x2, by = id_y1, k = 4)
## <environment: 0x7fe2972d5498>
##
## [[2]]
## y2 ~ 0 + s(x1, by = id_y2, k = 4) + s(x2, by = id_y2, k = 4)
## <environment: 0x7fe2972e01b8>
##
## [[3]]
## y3 ~ 0 + s(x1, by = id_y3, k = 4) + s(x2, by = id_y3, k = 4)
## <environment: 0x7fe2956c7978>
# Create formulas for the full model without missing data or index terms
frms_full <- lapply(seq_along(yvars), function(i) {
paste0(yvars[i], " ~ 0 + ", paste0("s(", xvars, ", k = 4)", collapse = " + ")) |>
as.formula()
})
# Model with missing outcomes
mod_miss <- mgcv::gam(
frms,
family = mgcv::mvn(d = length(yvars)),
data = data_missing,
method = "REML"
)
# Full model
mod_full <- mgcv::gam(
frms_full,
family = mgcv::mvn(d = length(yvars)),
data = data_full,
method = "REML"
)
# Plot each model
plot(mod_full, pages = 1, shade = TRUE, ylim = c(-3, 3), xlim = c(0, 1))
plot(mod_miss, pages = 1, shade = TRUE, ylim = c(-3, 3), xlim = c(0, 1))
In mod_miss
, in this case, the smooths are different than mod_full
,
and due to the correlation it makes sense that they are different for
more than just the last two, missing smooths. However, the scale of
uncertainty is the same between the models, Despite having 90% less data
for y3
in the missing model.
I assume this is because in the current model the missing values are zero and for the rows with missing data, the zero-intercept model is very good at estimating a zero value!
But the terms with by=
seem to have variances estimated as if the had
all the values, rather than the few non-missing values. Is there a way
to let the smooth know that it’s n
value is 30 rather than 300? This
seems like it might be able to be done by modifying the penalty matrix
somehow.
One option for getting around this could be, instead of replacing the
missing values with zeros, replacing them with random values with the
same variance as the non-missing values. However, this would change the
covariance between the outcomes the model would estimate. (To be fair, I
might be doing this already by replacing them with zeros.). I’m
interested in the covariance as an outcome, In theory I could also
calculate the covariance between the outcomes by doing
cov(..., "pairwise.complete.obs")
on the response residuals. The model
estimates would still be different, though, and I’m not sure how they
would be different.
The other problem, that I’ve not yet addressed: What if I have a shared
term across variables such as 1 + 2 + 3 ~ 0 + s(x4) + s(x5)
in the
model. The best idea I can come up with is to make several terms, each
representing a condition where different combinations of variables are
missing, and then sum them up. OK, let’s try the random values approach
data_missing_rand <- data
for (i in seq_along(yvars)) {
yvar <- yvars[i]
idvar <- paste0("id_", yvar)
idvars[i] <- idvar
# Center the outcome variables so we don't deal with intercepts, save the means
ymeans[i] <- mean(data_missing_rand[[yvar]], na.rm = TRUE)
data_missing_rand[[yvar]] <- data_missing_rand[[yvar]] - ymeans[i]
# Create indicate variables (id_*)as to whether to include an observation, as ordered factors
# with 0 being missing and 1 being present
data_missing_rand[[idvar]] <- ordered(ifelse(is.na(data_missing_rand[[yvar]]), 0, 1), levels = c("0", "1"))
# Set missing values to zero
data_missing_rand[[yvar]][is.na(data_missing_rand[[yvar]])] <- rnorm(sum(is.na(data_missing_rand[[yvar]])), mean = 0, sd = sd(data_missing_rand[[yvar]], na.rm = TRUE))
}
mod_miss_random <- mgcv::gam(
frms,
family = mgcv::mvn(d = length(yvars)),
data = data_missing_rand,
method = "REML"
)
plot(mod_full, pages = 1, shade = TRUE, ylim = c(-3, 3), xlim = c(0, 1))
plot(mod_miss_random, pages = 1, shade = TRUE, ylim = c(-3, 3), xlim = c(0, 1))
OK, the random value approach does give us more appropriate uncertainty for the specific smooth terms. What are the consequences? Let’s look at the covariance matrix:
(V_true <- attr(data, "true_V"))
## [,1] [,2] [,3]
## [1,] 1.6262954 1.127243 0.9071433
## [2,] 1.1272429 1.541464 0.9705280
## [3,] 0.9071433 0.970528 1.4994233
(V_full <- solve(crossprod(mod_full$family$data$R)))
## [,1] [,2] [,3]
## [1,] 1.5294827 0.9250831 0.8932648
## [2,] 0.9250831 1.4173779 0.8996986
## [3,] 0.8932648 0.8996986 1.5900505
(V_miss <- solve(crossprod(mod_miss$family$data$R)))
## [,1] [,2] [,3]
## [1,] 1.57845050 0.95521788 0.05894369
## [2,] 0.95521788 1.43520526 0.04660502
## [3,] 0.05894369 0.04660502 0.09346221
(V_miss_random <- solve(crossprod(mod_miss_random$family$data$R)))
## [,1] [,2] [,3]
## [1,] 1.5347552 0.92767526 0.18493357
## [2,] 0.9276753 1.41796461 0.07082077
## [3,] 0.1849336 0.07082077 1.57273981
The missing data approach underestimates both varince and co-variance. This missing data with random approach underestimates only the covariance. Let’s look at the covariance if we estimate it from the residuals
(V_full_res <- cov(residuals(mod_full, type = "response"))) # Same as V_full, as expected
## [,1] [,2] [,3]
## [1,] 1.5345981 0.9281770 0.8962523
## [2,] 0.9281770 1.4221183 0.9027077
## [3,] 0.8962523 0.9027077 1.4939191
For the missing data cases we estimate the covaraince pairwise only from the non-missing residuals
# Discard the zero or randomly inserted values for the Y response
res_miss <- residuals(mod_miss, type = "response")
res_miss[is.na(as.matrix(data_missing[yvars]))] <- NA
(V_miss_res <- cov(residuals(mod_miss, type = "response"), use = "pairwise.complete.obs")) # Same as V_miss
## [,1] [,2] [,3]
## [1,] 1.58372960 0.95841259 0.05914083
## [2,] 0.95841259 1.44000528 0.04676089
## [3,] 0.05914083 0.04676089 0.09372410
res_miss_random <- residuals(mod_miss_random, type = "response")
res_miss_random[is.na(as.matrix(data_missing[yvars]))] <- NA
(V_miss_random_res <- cov(residuals(mod_miss_random, type = "response"), use = "pairwise.complete.obs")) # Same a V_miss_random
## [,1] [,2] [,3]
## [1,] 1.5398882 0.93077785 0.18555207
## [2,] 0.9307779 1.42270697 0.07105762
## [3,] 0.1855521 0.07105762 1.57737015
These also turn out the same as the estimated value from the model. The random data approach underestimates variance/covariance less than the zeroes-for-missing-data approach relative to the full model (which itself overestimates the true data). Though looking at correlation rather than covariance shows it’s not quite as intuitive. Since the zeros approach underestimates the overall variance it estimates higher correlation than the random data approach, and both are underestimates:
cov2cor(V_true)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.7119526 0.5809170
## [2,] 0.7119526 1.0000000 0.6383799
## [3,] 0.5809170 0.6383799 1.0000000
cov2cor(V_full)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.6282979 0.5727993
## [2,] 0.6282979 1.0000000 0.5993062
## [3,] 0.5727993 0.5993062 1.0000000
cov2cor(V_miss)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.6346437 0.153463
## [2,] 0.6346437 1.0000000 0.127250
## [3,] 0.1534630 0.1272500 1.000000
cov2cor(V_miss_random) # Way underestimates correlation
## [,1] [,2] [,3]
## [1,] 1.0000000 0.62884515 0.11903310
## [2,] 0.6288452 1.00000000 0.04742414
## [3,] 0.1190331 0.04742414 1.00000000
Anectodotally, the general patterns above are consistent across different random seeds.
Crap, am I going to have to fit all those latent random effects as in
?mgcv::missing.data
? That’s both ugly and computationally intense, as
I’ll need to fit a random effects term for each output in each formula
each with all those effect levels, and that will blow up in the real
model that has both more outcomes and more parameters.