wviechtb/metafor

Rank-deficient models

rvlenth opened this issue · 2 comments

Classification:

Enhancement suggestion

Summary

An emmeans user posted a question about emmprep whereby he wasn't able to use emmprep() for a model because it had nested fixed effects, and that caused rank deficiencies in the model.

Reproducible Example (if applicable)

library(metafor)
## Loading required package: Matrix
## Loading required package: metadat
## Loading required package: numDeriv
## 
## Loading the 'metafor' package (version 4.2-0). For an
## introduction to the package please type: help(metafor)
library(emmeans)

dat <- escalc(measure="RR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg)

dat <- transform(dat,
                 acat = factor(ablat < 40, labels = c("high", "low")),
                 era = factor(1+as.integer((year-1940)/15), 
                              labels = c("early","mid","late")))

with(dat, table(acat, era))
##       era
## acat   early mid late
##   high     3   2    1
##   low      0   2    5

mod <- rma(yi, vi, mods = ~ acat * era, data = dat)
## Warning: Redundant predictors dropped from the model.
mod
## 
## Mixed-Effects Model (k = 13; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of residual heterogeneity):     0.0854 (SE = 0.0810)
## tau (square root of estimated tau^2 value):             0.2923
## I^2 (residual heterogeneity / unaccounted variability): 68.48%
## H^2 (unaccounted variability / sampling variability):   3.17
## R^2 (amount of heterogeneity accounted for):            72.73%
## 
## Test for Residual Heterogeneity:
## QE(df = 8) = 23.1259, p-val = 0.0032
## 
## Test of Moderators (coefficients 2:5):
## QM(df = 4) = 18.5649, p-val = 0.0010
## 
## Model Results:
## 
##                 estimate      se     zval    pval    ci.lb    ci.ub      
## intrcpt          -0.9706  0.2437  -3.9823  <.0001  -1.4483  -0.4929  *** 
## acatlow           1.1731  0.3627   3.2348  0.0012   0.4623   1.8839   ** 
## eramid           -0.3952  0.4240  -0.9320  0.3513  -1.2262   0.4359      
## eralate          -0.4710  0.4060  -1.1600  0.2461  -1.2667   0.3248      
## acatlow:eramid   -0.1059  0.6060  -0.1747  0.8613  -1.2937   1.0819      
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

RG <- emmprep(mod)
## Error in emmprep(mod): Cannot use function when some redundant predictors were dropped from the model.

# compare with lm:
LM <- lm(yi ~ acat * era, data = dat)
coef(LM)
##     (Intercept)         acatlow          eramid         eralate  acatlow:eramid 
##      -1.0869385       1.0049180      -0.2727705      -0.3546127       0.3430389 
## acatlow:eralate 
##              NA

(RG <- ref_grid(LM))
## 'emmGrid' object with variables:
##     acat = high, low
##     era = early, mid, late
## Some things are non-estimable (null space dim = 1)

confint(RG)
##  acat era   prediction    SE df lower.CL upper.CL
##  high early    -1.0869 0.331  8   -1.851   -0.323
##  low  early     nonEst    NA NA       NA       NA
##  high mid      -1.3597 0.406  8   -2.295   -0.424
##  low  mid      -0.0118 0.406  8   -0.947    0.924
##  high late     -1.4416 0.574  8   -2.765   -0.118
##  low  late     -0.4366 0.257  8   -1.028    0.155
## 
## Confidence level used: 0.95

Created on 2023-07-22 with reprex v2.0.2

Created on 2023-07-22 with reprex v2.0.2

Notes

The emmeans package is capable of supporting rank-deficient models. However, it needs information about which predictors were dropped. That information is also useful to some users. With some small changes in the object created it would be possible to enable that support and take advantage of emmeans's ability to handle those cases rather than just issuing an error message.

The conventional way to handle rank deficiencies is the way lm() does it. The cofficients for all the predictors are returned by coef(), and the ones that are dropped are given the value NA. The vcov() function, on the other hand, just returns the covariances for the included predictors. Note in the confidence intervals for RG (the reference grid for LM), we were able to identify the non-estimable case (the cell that was empty). It would seem that if rma() also returned the NAs as coefficients for the dropped predictors, that is all it would take, as the qdrg() function can figure this stuff out too).

One point of common confusion (or at least to me, a number of years ago): The NA coefficients do not actually represent missing values! Instead, they represent coefficients that were set to zero. This is clear as soon as you make the connection that dropping a predictor is the same thing as setting its coefficient to zero. When there are rank deficiencies, the solution to the normal equations is not unique; and in R, the particular solution returned is the one with those constraints to zero. There are many other valid solutions, and most of them give non-zero weight to those predictors. So we really are not saying that the dropped predictors are unimportant -- just that we didn't use them for this one particular solution.

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] metafor_4.2-0        numDeriv_2016.8-1.1  metadat_1.2-0        Matrix_1.5-4.1      
[5] emmeans_1.8.7-090005

loaded via a namespace (and not attached):
 [1] styler_1.10.0      sandwich_3.0-2     utf8_1.2.3         generics_0.1.3     lattice_0.21-8    
 [6] digest_0.6.33      magrittr_2.0.3     evaluate_0.21      grid_4.3.1         estimability_1.4.1
[11] mvtnorm_1.1-3      fastmap_1.1.1      R.oo_1.25.0        R.cache_0.16.0     processx_3.8.1    
[16] R.utils_2.12.2     survival_3.5-5     ps_1.7.5           multcomp_1.4-23    purrr_1.0.1       
[21] fansi_1.0.4        scales_1.2.1       TH.data_1.1-2      codetools_0.2-19   cli_3.6.1         
[26] rlang_1.1.1        R.methodsS3_1.8.2  reprex_2.0.2       munsell_0.5.0      splines_4.3.1     
[31] yaml_2.3.7         withr_2.5.0        tools_4.3.1        parallel_4.3.1     coda_0.19-4       
[36] dplyr_1.1.2        colorspace_2.1-0   ggplot2_3.4.2      mathjaxr_1.6-0     vctrs_0.6.3       
[41] R6_2.5.1           zoo_1.8-12         lifecycle_1.0.3    fs_1.6.2           MASS_7.3-60       
[46] callr_3.7.3        clipr_0.8.0        pkgconfig_2.0.3    pillar_1.9.0       gtable_0.3.3      
[51] glue_1.6.2         xfun_0.39          tibble_3.2.1       tidyselect_1.2.0   highr_0.10        
[56] rstudioapi_0.14    knitr_1.43         farver_2.1.1       xtable_1.8-4       htmltools_0.5.5   
[61] nlme_3.1-162       rmarkdown_2.21     labeling_0.4.2     compiler_4.3.1  

Follow-up: It turns out I already had code to figure out where to fill-in the NAs. I think you can just take the check on coef.na and it'll work just fine. See the answer I posted on the SO site I referenced above.

PS -- there was a glitch in the code for qdrg that made it not always get the V matrix right in rank-deficient models, but I don't think it affects your models as it only comes into play when the number of rows/columns is too large. But I will be sending an update within a week or two that will have that fixed.

Hi Russell -- thanks for the suggestion. I am just coming back to the office from a 5-week absence due to conferences and holidays. It will take me a while to get back to this, but will respond when I find the time.