Rank-deficient models
rvlenth opened this issue · 2 comments
Classification:
Enhancement suggestion
Summary
An emmeans user posted a question about emmprep
whereby he wasn't able to use emmprep()
for a model because it had nested fixed effects, and that caused rank deficiencies in the model.
Reproducible Example (if applicable)
library(metafor)
## Loading required package: Matrix
## Loading required package: metadat
## Loading required package: numDeriv
##
## Loading the 'metafor' package (version 4.2-0). For an
## introduction to the package please type: help(metafor)
library(emmeans)
dat <- escalc(measure="RR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg)
dat <- transform(dat,
acat = factor(ablat < 40, labels = c("high", "low")),
era = factor(1+as.integer((year-1940)/15),
labels = c("early","mid","late")))
with(dat, table(acat, era))
## era
## acat early mid late
## high 3 2 1
## low 0 2 5
mod <- rma(yi, vi, mods = ~ acat * era, data = dat)
## Warning: Redundant predictors dropped from the model.
mod
##
## Mixed-Effects Model (k = 13; tau^2 estimator: REML)
##
## tau^2 (estimated amount of residual heterogeneity): 0.0854 (SE = 0.0810)
## tau (square root of estimated tau^2 value): 0.2923
## I^2 (residual heterogeneity / unaccounted variability): 68.48%
## H^2 (unaccounted variability / sampling variability): 3.17
## R^2 (amount of heterogeneity accounted for): 72.73%
##
## Test for Residual Heterogeneity:
## QE(df = 8) = 23.1259, p-val = 0.0032
##
## Test of Moderators (coefficients 2:5):
## QM(df = 4) = 18.5649, p-val = 0.0010
##
## Model Results:
##
## estimate se zval pval ci.lb ci.ub
## intrcpt -0.9706 0.2437 -3.9823 <.0001 -1.4483 -0.4929 ***
## acatlow 1.1731 0.3627 3.2348 0.0012 0.4623 1.8839 **
## eramid -0.3952 0.4240 -0.9320 0.3513 -1.2262 0.4359
## eralate -0.4710 0.4060 -1.1600 0.2461 -1.2667 0.3248
## acatlow:eramid -0.1059 0.6060 -0.1747 0.8613 -1.2937 1.0819
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RG <- emmprep(mod)
## Error in emmprep(mod): Cannot use function when some redundant predictors were dropped from the model.
# compare with lm:
LM <- lm(yi ~ acat * era, data = dat)
coef(LM)
## (Intercept) acatlow eramid eralate acatlow:eramid
## -1.0869385 1.0049180 -0.2727705 -0.3546127 0.3430389
## acatlow:eralate
## NA
(RG <- ref_grid(LM))
## 'emmGrid' object with variables:
## acat = high, low
## era = early, mid, late
## Some things are non-estimable (null space dim = 1)
confint(RG)
## acat era prediction SE df lower.CL upper.CL
## high early -1.0869 0.331 8 -1.851 -0.323
## low early nonEst NA NA NA NA
## high mid -1.3597 0.406 8 -2.295 -0.424
## low mid -0.0118 0.406 8 -0.947 0.924
## high late -1.4416 0.574 8 -2.765 -0.118
## low late -0.4366 0.257 8 -1.028 0.155
##
## Confidence level used: 0.95
Created on 2023-07-22 with reprex v2.0.2
Created on 2023-07-22 with reprex v2.0.2
Notes
The emmeans package is capable of supporting rank-deficient models. However, it needs information about which predictors were dropped. That information is also useful to some users. With some small changes in the object created it would be possible to enable that support and take advantage of emmeans's ability to handle those cases rather than just issuing an error message.
The conventional way to handle rank deficiencies is the way lm()
does it. The cofficients for all the predictors are returned by coef()
, and the ones that are dropped are given the value NA
. The vcov()
function, on the other hand, just returns the covariances for the included predictors. Note in the confidence intervals for RG
(the reference grid for LM
), we were able to identify the non-estimable case (the cell that was empty). It would seem that if rma()
also returned the NA
s as coefficients for the dropped predictors, that is all it would take, as the qdrg()
function can figure this stuff out too).
One point of common confusion (or at least to me, a number of years ago): The NA
coefficients do not actually represent missing values! Instead, they represent coefficients that were set to zero. This is clear as soon as you make the connection that dropping a predictor is the same thing as setting its coefficient to zero. When there are rank deficiencies, the solution to the normal equations is not unique; and in R, the particular solution returned is the one with those constraints to zero. There are many other valid solutions, and most of them give non-zero weight to those predictors. So we really are not saying that the dropped predictors are unimportant -- just that we didn't use them for this one particular solution.
sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] metafor_4.2-0 numDeriv_2016.8-1.1 metadat_1.2-0 Matrix_1.5-4.1
[5] emmeans_1.8.7-090005
loaded via a namespace (and not attached):
[1] styler_1.10.0 sandwich_3.0-2 utf8_1.2.3 generics_0.1.3 lattice_0.21-8
[6] digest_0.6.33 magrittr_2.0.3 evaluate_0.21 grid_4.3.1 estimability_1.4.1
[11] mvtnorm_1.1-3 fastmap_1.1.1 R.oo_1.25.0 R.cache_0.16.0 processx_3.8.1
[16] R.utils_2.12.2 survival_3.5-5 ps_1.7.5 multcomp_1.4-23 purrr_1.0.1
[21] fansi_1.0.4 scales_1.2.1 TH.data_1.1-2 codetools_0.2-19 cli_3.6.1
[26] rlang_1.1.1 R.methodsS3_1.8.2 reprex_2.0.2 munsell_0.5.0 splines_4.3.1
[31] yaml_2.3.7 withr_2.5.0 tools_4.3.1 parallel_4.3.1 coda_0.19-4
[36] dplyr_1.1.2 colorspace_2.1-0 ggplot2_3.4.2 mathjaxr_1.6-0 vctrs_0.6.3
[41] R6_2.5.1 zoo_1.8-12 lifecycle_1.0.3 fs_1.6.2 MASS_7.3-60
[46] callr_3.7.3 clipr_0.8.0 pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.3
[51] glue_1.6.2 xfun_0.39 tibble_3.2.1 tidyselect_1.2.0 highr_0.10
[56] rstudioapi_0.14 knitr_1.43 farver_2.1.1 xtable_1.8-4 htmltools_0.5.5
[61] nlme_3.1-162 rmarkdown_2.21 labeling_0.4.2 compiler_4.3.1
Follow-up: It turns out I already had code to figure out where to fill-in the NA
s. I think you can just take the check on coef.na
and it'll work just fine. See the answer I posted on the SO site I referenced above.
PS -- there was a glitch in the code for qdrg
that made it not always get the V
matrix right in rank-deficient models, but I don't think it affects your models as it only comes into play when the number of rows/columns is too large. But I will be sending an update within a week or two that will have that fixed.
Hi Russell -- thanks for the suggestion. I am just coming back to the office from a 5-week absence due to conferences and holidays. It will take me a while to get back to this, but will respond when I find the time.