Bad (non-robust) standard-errors using "bootstrap" => CI's and p-values do not match
LGraz opened this issue · 2 comments
Dear Team,
I appreciate this great product, THANKS!
I encountered a case where the confidence interval of or_fraction_mediated
does not include 0
, but the p-value is 0.9
I noticed that the SE estimate from the bootstrap is corrupted by outlier from the bootstraps (Hence the p-value might be too big). If thats so, I suggest using some more robust version (below I compare the SE with the MAD).
Here my minimal reproducible example using the data_WV_scaled.csv :
library(lavaan)
D <- read.csv("./data_WV_scaled.csv")
sem_mod_def <- {
"# Mediator
F ~ df*D + of*O
A ~ oa*O + da*D
R ~ fr*F + ar*A
# direct effect
R ~ dr*D + or*O
# Mediated effects
dr_mediated := df*fr + da*ar
or_mediated := of*fr + oa*ar
# total effect (direct + mediated)
dr_total := dr + dr_mediated
or_total := or + or_mediated
# mediated_effect / total_effect:
dr_fraction_mediated := dr_mediated / dr_total
or_fraction_mediated := or_mediated / or_total
"
}
# fit SEM and return parameters
set.seed(123)
object <- sem(sem_mod_def, D, se = "boot", # else "robust.sem"?
bootstrap = 2000)
BOOT <- lavaan:::lav_object_inspect_boot(object)
BOOT.def <- apply(BOOT, 1, object@Model@def.function)
or_fraction_mediated <- BOOT.def[6,]
mad(or_fraction_mediated)
#> [1] 0.38009
sd(or_fraction_mediated)
#> [1] 11.44
# ----> THATS A BIG DIFFERENCE !!!
parameterEstimates(object)
#> lhs op rhs label est se z pvalue ci.lower ci.upper
#> 1 F ~ D df 0.411 0.023 17.665 0.000 0.362 0.453
#> 2 F ~ O of -0.101 0.025 -4.021 0.000 -0.150 -0.051
#> 3 A ~ O oa 0.479 0.025 19.148 0.000 0.431 0.528
#> 4 A ~ D da -0.044 0.024 -1.839 0.066 -0.091 0.006
#> 5 R ~ F fr 0.573 0.029 19.600 0.000 0.514 0.632
#> 6 R ~ A ar -0.140 0.025 -5.534 0.000 -0.188 -0.090
#> 7 R ~ D dr -0.024 0.024 -1.015 0.310 -0.069 0.023
#> 8 R ~ O or 0.037 0.025 1.492 0.136 -0.011 0.085
#> 9 F ~~ F 0.803 0.033 24.295 0.000 0.737 0.867
#> 10 A ~~ A 0.759 0.025 30.706 0.000 0.708 0.806
#> 11 R ~~ R 0.624 0.026 23.794 0.000 0.572 0.674
#> 12 D ~~ D 0.999 0.000 NA NA 0.999 0.999
#> 13 D ~~ O -0.206 0.000 NA NA -0.206 -0.206
#> 14 O ~~ O 0.999 0.000 NA NA 0.999 0.999
#> 15 dr_mediated := df*fr+da*ar dr_mediated 0.242 0.019 12.966 0.000 0.204 0.278
#> 16 or_mediated := of*fr+oa*ar or_mediated -0.125 0.020 -6.231 0.000 -0.163 -0.085
#> 17 dr_total := dr+dr_mediated dr_total 0.218 0.025 8.742 0.000 0.168 0.267
#> 18 or_total := or+or_mediated or_total -0.088 0.028 -3.161 0.002 -0.141 -0.033
#> 19 dr_fraction_mediated := dr_mediated/dr_total dr_fraction_mediated 1.110 0.122 9.118 0.000 0.907 1.380
#> 20 or_fraction_mediated := or_mediated/or_total or_fraction_mediated 1.422 11.440 0.124 0.901 0.909 3.317
Unrelated to this robustness issue: Would you recommend to me using se = "robust.sem"
instead?
I would not recommend se = "robust.sem"
here, as you define new variables that are clearly nonlinear functions of the original parameters.
I also hesitate to just replace 'sd' by 'mad' while calculating the bootstrap-based standard errors. Surely, sd() is very sensitive to outliers, but if an outlier effectively occurs every now and then, then this should also be reflected in the standard errors. I would have to dive into the literature to better understand the theoretical implications if mad() is used instead of sd() while computing bootstrap based standard errors.
I would rather try to pinpoint when/why you often get outlying parameter estimates. Perhaps using bounded estimation may prevent this? (Try running this with the argument bounds = "standard"
).
bounds = "standard"
does not have an effect.
pinpoint why outlying parameter:
or_fraction_mediated := or_mediated/or_total
so if or_total
is very close to 0 in one sample, then or_fraction_mediated
explodes and hence also sd
breaks down.
Simple exchanging sd() with mad() is not advisable.
What do you think about a warning if the sd and the mad are far away from each other? like:
# BOOT <- lavaan:::lav_object_inspect_boot(object)
# BOOT.def <- apply(BOOT, 1, object@Model@def.function)
sd_mad_ratio <- apply(BOOT.def, 1, sd) / apply(BOOT.def, 1, mad)
if(any(5 > sd_mad_ratio)){
params_w_outliers <- names(sd_mad_ratio)[5 < sd_mad_ratio]
warning(paste("The following bootstrap-parameters have a high
ratio of standard deviation to median absolute deviation:",
params_w_outliers,
"\nP-values and Confidence Intervals might not match."))
}