Add `acf` (and maybe vignette with some `plot` examples)

Question

Add `acf` (and maybe vignette with some `plot` examples)

Opened this issue 10 months ago · 3 comments

In the ideal world the "unstructured" pattern of covariance should save us from complex, hard-to-guess-apriori considerations about the expected residual cov. structure. If only it always converged :)

If it fails to converge, there are some options to account for it in our SAPs:

we can pre-specify the path of "fallback", like: US -> ADH, -> AD -> AR1 H or Toeplitz (stationary-M-dependent(m)) or spatial -> CSH -> CS
guess the covariance structure a priori, based on some prior investigations, literature, common sense. People often assume just AR1 or SP(POW) [for unequal visits/timepoints].
follow the least-resistance line or not care by specifying CS. Seems legit :) Nobody will argue "hey, you ignored the within-subject covariance", easy to fit, easy to talk about, compliant with tons of literature. Only it can be far from reality and even farther from any reasonable scientific justification for the strong assumption that "all subjects' organisms respond with the same rate" (approximately translates to random-intercept LMM, if we enforce only non-negative covariances).
some specify automated selection methods, e.g. based on AIC/BIC/AICC (or QIC/AGPC/GHYC/etc. in GEE). BTW, I found mmrm:::AIC.mmrm_tmb() and mmrm:::BIC.mmrm_tmb() but missed AICC.

In either case, some people (including myself) are curious about the autocorrelation and look at ACF and/or (semi)variogram to check if and how each structure helped (or messed): 1) acf(residuals(model, type="response")) then 2) acf(residuals(model, type="normalized"))

For nlme::gls() we can use either:

acf(residuals(model, type="normalized")), but it loses the information about clusters (subjects)

layout(1:2)
acf(resid(ma, type = "response"))
acf(resid(ma, type = "normalized"))

plot(ACF(ma, form=~1|PatientId, resType = "response"), grid = TRUE), which preserves the number of timepoints (visits)

gridExtra::grid.arrange(plot(ACF(ma, form=~1|PatientId,resType = "response"), alpha = .05),
                        plot(ACF(ma, form=~1|PatientId,resType = "normalized"), alpha = .05))

Optionally also:

gridExtra::grid.arrange(
   plot(Variogram(ma, resType = "response"), smooth = FALSE, sigma = ma$sigma, ylim = c(0, 1.1*ma$sigma)),
   plot(Variogram(ma, resType = "normalized"), smooth = FALSE, sigma = ma$sigma, ylim = c(0, 1.1*ma$sigma))
)

For the mmrm we can still use acf(residuals(model, type="normalized")), but, as mentioned previously, it loses the information about clusters (subjects). OK, if used just for some very rough exploration, it can be so. Some patterns will be visible anyway.

For semi-variogram I can employ the nlme::gls() followed by its Variogram() function with appropriate type of residuals ("response" for the raw, "normalized" accounting for the covariance), which isn't very much convenient in terms of translating correlation + weight terms in gls() into mmrm's us/ar1/csh/cs, etc.

If it can be done better, than why not :)

For example, my assessments look like:

$ICs

struct                          AIC        BIC
------------------------  ---------  ---------
1) unstructured             821.482   868.7004
2) ante-dependence het.    846.7565     871.49
3) ante-dependence         845.5287   859.0197
4) AR(1) het.              845.9508   861.6902
5) AR(1)                   847.4463   851.9432
6) exchangeable het.       867.0957   882.8352
7) exchangeable             869.382    873.879

$ACF_plot

PS: just a very minor thing - for xIC comparisons it would be helpful to have "ind()" (independence) structure in mmrm. You know, something like nlme::gls(correlation = NULL, weights = varIdent(form =~1|Visit)). Just a technical "sugar". Just saying :)

Answer 1 · 2023-11-16T04:52:10.000Z

OK, I extracted the code of nlme:::ACF.gls and made a few quick and dirty adjustments to make it working with mmrm. You may polish it and make it more functional than this rudimentary one:

mmrm_ACF <- function(mmrm, maxLag = NULL, resType = "response") {

  stopifnot(inherits(mmrm, "mmrm_fit"))
  
  res <- resid(mmrm, type = resType)
  grps <- mmrm$data[[mm$formula_parts$subject_var]]
  res <- split(res, grps)
  
  if (is.null(maxLag)) {
    maxLag <- min(c(maxL <- max(lengths(res)) - 1L,
                    as.integer(10 * log10(maxL + 1))))
  }
  
  val <- lapply(res, function(el, maxLag) {
    N <- maxLag + 1L
    tt <- double(N)
    nn <- integer(N)
    N <- min(c(N, n <- length(el)))
    nn[1:N] <- n + 1L - 1:N
    for (i in 1:N) {
      el1 <- el[1:(n - i + 1L)]
      el2 <- el[i:n]
      tt[i] <- sum(el1 * el2)
    }
    array(c(tt, nn), c(length(tt), 2L))
  }, maxLag = maxLag)
  val0 <- apply(sapply(val, function(x) x[, 2L]), 1, sum)
  val1 <- apply(sapply(val, function(x) x[, 1L]), 1, sum)/val0
  val2 <- val1/val1[1L]
  
# I added n.used also to the final data.frame for convenience 
# and left it as an attribute for the existing printing function.
  return(structure(data.frame(lag = 0:maxLag, 
                              ACF = val2, 
                              n.used = val0), 
                   n.used = val0, 
                   class = c("ACF", "data.frame")))
}

Let's check:

> (x <- mmrm_ACF(mm, resType = "normalized"))
  lag          ACF n.used
1   0  1.000000000    357
2   1  0.008928009    287
3   2 -0.000286937    217
4   3  0.020181483    150
5   4 -0.027496040     85
6   5  0.054971756     28
> attributes(x)
$names
[1] "lag" "ACF"

$class
[1] "ACF"        "data.frame"

$row.names
[1] 1 2 3 4 5 6

$n.used
[1] 357 287 217 150  85  28

It assigns the ACF class name, so the standard nlme::plot.ACF can handle it. We will use it to quickly check the result:

gridExtra::grid.arrange(
plot(mmrm_ACF(mm), alpha = 0.05),
plot(mmrm_ACF(mm, resType = "normalized"), alpha = 0.05))

OK, it works. But I much prefer the ggplot2 version, so I made a quick function returning ggplots for the response residuals (to check what's going on before we apply the GLS) and the normalized ones (to check whether it helped), respectively.

It returns pure ggplot2, so it can be modified any way we want. If it's just fine to print both without any changes, we can request displaying the combined (via patchwork) figure.

plot_mmrm_ACF <- function(mmrm, alpha = NULL, print_combined = FALSE) {
  
  response_ACF   <- mmrm_ACF(mmrm, resType = "response")
  normalized_ACF <- mmrm_ACF(mmrm, resType = "normalized")

  plot_ACF <- function(acf, alpha = NULL) {
    acf %>%
      {
        if(!is.null(alpha))
          mutate(., ci_line = qnorm(1 - alpha/2)/sqrt(n.used))
        else 
          mutate(., ci_line = NA_real_)
      } %>% 
      ggplot(aes(x = lag, y = ACF)) +
      geom_hline(aes(yintercept = 0)) +
      geom_segment(mapping = aes(xend = lag, yend = 0), show.legend = FALSE) +
      theme_bw() +
      {
        if(!is.null(alpha))
          list(geom_line(aes(y =  as.numeric(ci_line), group = 1), color = 'darkblue', linetype="dashed"),
               geom_line(aes(y = -as.numeric(ci_line), group = 1),  color = 'darkblue', linetype="dashed"))
      } +
      ylab("Autocorrelation") + 
      xlab("Lag")
  }
  
  plot_response_ACF   <- plot_ACF(acf = response_ACF, alpha = alpha)
  plot_normalized_ACF <- plot_ACF(acf = normalized_ACF, alpha = alpha)
  
  if(print_combined)
    print(patchwork::wrap_plots((plot_response_ACF + ggtitle("ACF of response residuals")) /
                                (plot_normalized_ACF + ggtitle("ACF of normalized residuals"))))
  
  return(
    invisible(
      list(plot_response_ACF = plot_response_ACF,
           plot_normalized_ACF = plot_normalized_ACF)))
}

plot_mmrm_ACF(mmrm = mm, print_combined = TRUE, alpha = 0.05) produces:

If we skip the print_combined parameter, it will return both ggplot2 objects:

> names(plot_mmrm_ACF(mmrm = mm, alpha = 0.05))
[1] "plot_response_ACF"   "plot_normalized_ACF"

which we can adjust:

plot_mmrm_ACF(mmrm = mm)[[1]] + theme_classic() + ggtitle("Some title")

and finally:

Answer 2 · 2023-11-16T17:55:04.000Z

Just by the way (I didn't want to multiply threads), when we speak about the residuals, it would be beneficial IMO to implement some ready-to-use tools for the diagnostics. lm() + nlme + lmer and a few more have them - just type plot(model) and you obtain a set of plots. There are specialized packages dedicated to it, but like olsrr, but dedicated only to the general linear model via OLS (lm()). When dealing with the general linear model via GLS or GEE, the other types of residuals become essential for the assessment what happens / what helped.

It's really cool mmrm now provides the basic types of residuals, so everyone can implement the needed plots ad hoc, but anyway, it's always better to have at least a basic set of tools ready to use, because it's consistent with the rest of the package.

For example, the fitted vs residual plot

plot_residual_vs_fitted <- function(model, res_type = "normalized") {

 plots <- list()
 
 plots[[1]] <- data.frame(norm_resid = resid(model, type = "response"),
                          fitted = fitted(model)) %>% 
  ggplot(aes(x = fitted, y = norm_resid)) +
  geom_point() +
  geom_hline(yintercept=0, linetype="dashed") +
  theme_bw() +
  xlab("Fitted") + ylab("Response (raw) residuals") +
  labs(title = "Response (raw) residuals vs. fitted values diagnostic plot",
       subtitle = sprintf("Breusch-Pagan: %s | Goldfeld-Quandt: %s\nRainbow: %s | Harvey-Collier: %s | Ramsey’s RESET: %s", 
                          scales::pvalue(lmtest::bptest(model)$p.value, add_p = TRUE),
                          scales::pvalue(lmtest::gqtest(model)$p.value, add_p = TRUE),
                          scales::pvalue(lmtest::raintest(model)$p.value, add_p = TRUE),
                          scales::pvalue(lmtest::harvtest(model)$p.value, add_p = TRUE),
                          scales::pvalue(lmtest::resettest(model)$p.value, add_p = TRUE)))
  
 if(!all(class(model) %in% "lm")) {
 
   plots[[2]] <- data.frame(norm_resid = resid(model, type = res_type),
                            fitted = fitted(model)) %>% 
   ggplot(aes(x = fitted, y = norm_resid)) +
   geom_point() +geom_hline(yintercept=0, linetype="dashed") +
   theme_bw() +
   xlab("Fitted") +
   ylab(paste(stringr::str_to_title(res_type), "residuals")) +
   labs(title = paste(stringr::str_to_title(res_type), "residuals vs. fitted values diagnostic plot"))
  }
  patchwork::wrap_plots(plots, ncol = 1)
}

By the way, most tests from the lmtest package work here (I put them on the plot above), or need the lm class to be added (I mean c(class(model), "lm")). Yes, some tests, like Durbin-Watson and similar ones, are more specific to lm() than mmrm(), where without grouping (by patient id / SUBJID) it treats all data independently, but even they can indicate some problem, as I showed in the first post.

Answer 3 · 2023-11-27T14:05:14.000Z

Check also this article: https://aosmith.rbind.io/2018/06/27/uneven-grouped-autocorrelation/

nlme::ACF will work well for balanced and equidistant observations.

I had incomplete observations and took the naive approach from nlme;:ACF, and quickly run into nonsenses (here I used GLS for simplicity, but the same problem was with my adjusted mmrm version)

m <- gls(TiffeneauPostDiff ~ Timepoint,
correlation = corSymm(form=~as.numeric(Timepoint)|PatientId),
weights = varIdent(form =~1|Timepoint),
data = shifts)

> nlme::ACF(m, resType = "response", maxLag = 2)
  lag       ACF
1   0 1.0000000
2   1 0.9014348
3   2 1.4210898 # ???

while it should be:

shifts$res_rep <- residuals(model, type="response")
shifts$time <- as.numeric(shifts$Timepoint)

shifts %>%
  group_by(PatientId) %>%
  complete(time = 1:3)  -> shifts_exp  # Pad gaps, so now every patient has 3 timepoints

> print(acf(shifts_exp$res_rep, na.action = na.pass, lag.max = max(shifts_exp$time)-1))

Autocorrelations of series ‘shifts_exp$res_rep’, by lag

    0     1     2 
1.000 0.688 0.265

Oh man, how easy is to forget basic things...

mmrm_ACF <- function(model, type = "response", conf.level = 0.95) {

  data <- model$data %>% 
          arrange(PatientId, Timepoint)  # don't miss that point
  
  data$residuals <- residuals(model, type = type)
  data$time      <- as.numeric(data$Timepoint)
  
  data_ext <- data %>%
    group_by(PatientId) %>%
    complete(time = seq(unique(Timepoint))) 
  
  do.call(rbind, lapply(seq(unique(data$Timepoint)), function(l) {
    data %>%
      group_by(PatientId) %>%
      arrange(PatientId, Timepoint) %>%
      summarise(lag = list( diff(time, lag = l ) ))
  })) %>% 
    unnest(lag) %>%
    group_by(lag) %>%
    summarise(n = n()) -> lag_n 

  ACF <- acf(data_ext$residuals, na.action = na.pass, lag.max = max(data_ext$time)-1, plot = FALSE)
  
  data.frame(lag = ACF$lag,
             ACF = ACF$acf) %>% 
    left_join(lag_n) %>% 
    mutate(ci_line = qnorm((1-conf.level)/2, lower.tail = FALSE)/sqrt(n))
}