adjBGammaHat > 1 and BRMSEA is NA
littlehifive opened this issue · 12 comments
Hi Ed,
For the following model, I got NAs for BRMSEA and a value over 1 for adjBGammaHat (which should be bounded between 0 and 1 right?) Do you know why that could be the case?
For BTLI, I understand it can be over 1 from the formula, but is it the larger the better or the closer to 1 the better?
Thanks!
> summary(FI_fit_egra_test_2, central.tendency = c("mean","median","mode"), prob = .90)
Posterior summary statistics and highest posterior density (HPD) 90% credible intervals for devm-based fit indices:
EAP Median MAP SD lower upper
BRMSEA NaN NA NA NA NaN NA
BGammaHat 0.783 0.784 0.784 0.002 0.781 0.786
adjBGammaHat 1.468 1.467 1.466 0.003 1.462 1.473
BMc 0.436 0.437 0.438 0.003 0.431 0.442
BCFI 0.632 0.632 0.633 0.004 0.626 0.637
BTLI 1.082 1.082 1.082 0.001 1.081 1.083
BNFI 0.648 0.649 0.649 0.003 0.642 0.654
Warning message:
In (function (..., deparse.level = 1) :
number of columns of result is not a multiple of vector length (arg 1)
blavaan (0.4-2.912) results of 1000 samples after 500 adapt/burnin iterations
Number of observations 611
Number of missing patterns 1
Statistic MargLogLik PPP
Value -9681.216 0.000
Latent Variables:
Estimate Post.SD pi.lower pi.upper Rhat Prior
EGRA_2m1 =~
VOC_2m1 0.106 0.044 0.022 0.195 1.000 normal(0, 2)
LTRC_2m1 0.257 0.062 0.172 0.347 1.000 normal(0, 2)
GRPH_2m1 0.288 0.065 0.207 0.377 1.000 normal(0, 2)
INV_2m1 0.599 0.114 0.530 0.690 1.000 normal(0, 2)
OPR_2m1 0.921 0.168 0.851 1.006 1.000 normal(0, 2)
RDCP_2m1 0.680 0.129 0.587 0.781 1.000 normal(0, 2)
EGRA_3m1 =~
VOC_3m1 0.158 0.045 0.072 0.245 1.000 normal(0, 2)
LTRC_3m1 0.225 0.042 0.144 0.307 1.000 normal(0, 2)
GRPH_3m1 0.243 0.043 0.160 0.325 1.001 normal(0, 2)
INV_3m1 0.738 0.045 0.667 0.809 1.000 normal(0, 2)
OPR_3m1 0.872 0.050 0.793 0.947 1.004 normal(0, 2)
RDCP_3m1 0.790 0.051 0.707 0.871 0.999 normal(0, 2)
Covariances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.LTRC_2m1 ~~
.GRPH_2m1 0.447 0.042 0.367 0.534 1.000 beta(1,1)
.VOC_2m1 ~~
.RDCP_2m1 0.023 0.038 -0.051 0.096 1.000 beta(1,1)
.OPR_2m1 ~~
.RDCP_2m1 -0.188 0.033 -0.244 -0.113 1.000 beta(1,1)
.LTRC_3m1 ~~
.GRPH_3m1 0.475 0.044 0.394 0.566 1.001 beta(1,1)
.VOC_3m1 ~~
.RDCP_3m1 0.028 0.035 -0.042 0.095 1.000 beta(1,1)
.OPR_3m1 ~~
.RDCP_3m1 -0.241 0.029 -0.290 -0.178 1.002 beta(1,1)
EGRA_2m1 ~~
EGRA_3m1 0.620 0.030 0.563 0.679 1.000 lkj_corr(1)
Intercepts:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.VOC_2m1 -0.000 0.040 -0.079 0.080 0.999 normal(0, 5)
.LTRC_2m1 0.001 0.041 -0.076 0.082 1.001 normal(0, 5)
.GRPH_2m1 -0.000 0.040 -0.080 0.077 1.003 normal(0, 5)
.INV_2m1 -0.001 0.041 -0.084 0.079 1.001 normal(0, 5)
.OPR_2m1 -0.001 0.041 -0.082 0.079 1.001 normal(0, 5)
.RDCP_2m1 -0.001 0.041 -0.079 0.082 1.001 normal(0, 5)
.VOC_3m1 -0.000 0.041 -0.078 0.080 0.999 normal(0, 5)
.LTRC_3m1 -0.001 0.040 -0.080 0.077 1.000 normal(0, 5)
.GRPH_3m1 -0.000 0.040 -0.077 0.078 1.001 normal(0, 5)
.INV_3m1 -0.001 0.041 -0.083 0.077 1.001 normal(0, 5)
.OPR_3m1 -0.002 0.041 -0.082 0.078 1.002 normal(0, 5)
.RDCP_3m1 -0.002 0.043 -0.087 0.079 1.000 normal(0, 5)
EGRA_2m1 0.000
EGRA_3m1 0.000
Variances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.VOC_2m1 0.995 0.057 0.892 1.111 0.999 gamma(1,.5)[sd]
.LTRC_2m1 0.938 0.053 0.844 1.050 0.999 gamma(1,.5)[sd]
.GRPH_2m1 0.922 0.054 0.822 1.034 1.000 gamma(1,.5)[sd]
.INV_2m1 0.640 0.041 0.561 0.723 1.001 gamma(1,.5)[sd]
.OPR_2m1 0.140 0.043 0.071 0.242 1.003 gamma(1,.5)[sd]
.RDCP_2m1 0.533 0.053 0.432 0.643 1.001 gamma(1,.5)[sd]
.VOC_3m1 0.982 0.056 0.880 1.096 1.000 gamma(1,.5)[sd]
.LTRC_3m1 0.957 0.055 0.857 1.074 1.000 gamma(1,.5)[sd]
.GRPH_3m1 0.949 0.056 0.847 1.065 1.000 gamma(1,.5)[sd]
.INV_3m1 0.467 0.034 0.401 0.534 1.004 gamma(1,.5)[sd]
.OPR_3m1 0.251 0.042 0.177 0.341 1.008 gamma(1,.5)[sd]
.RDCP_3m1 0.384 0.043 0.303 0.470 1.001 gamma(1,.5)[sd]
EGRA_2m1 1.000
EGRA_3m1 1.000
Thanks for the report, I will take a look soon. I am also looping in the two people who know this part of the code best:
@maugavilla @TDJorgensen : in recent months, Zezhen has helpfully found a variety of bugs and unclear issues in blavaan. I am wondering whether you have seen the issues that he describes here. The NAs could possibly be related to recent changes that I made to logl computations, but I am not certain about that.
@littlehifive could you share more information so we can reporduce the error? Something like sample size, variable means, covariance matrix, and full code? With this we could simulate similar data and reproduce it.
I think I know what might be but would like to see more ifnormation before
I created a simulated dataset based on my data, and I was able to reproduce the problems with BRMSEA and adjusted GammaHat using the following model. Let me know if you need anything else from me, thanks!
library(blavaan)
# load test_dat.csv here
# lavaan
model <- '
# measurement model
EGRA_2m1 =~ VOC_2m1 + LTRC_2m1 + GRPH_2m1 + INV_2m1 + OPR_2m1 + RDCP_2m1
GRPH_2m1 ~~ LTRC_2m1
RDCP_2m1 ~~ VOC_2m1
RDCP_2m1 ~~ OPR_2m1
# measurement model
EGRA_3m1 =~ VOC_3m1 + LTRC_3m1 + GRPH_3m1 + INV_3m1 + OPR_3m1 + RDCP_3m1
GRPH_3m1 ~~ LTRC_3m1
RDCP_3m1 ~~ VOC_3m1
RDCP_3m1 ~~ OPR_3m1
# correlation
EGRA_3m1 ~~ EGRA_2m1
'
# priors
mydp <- dpriors(tau = "normal(0, 0.5)",
lambda = "normal(0, 2)",
beta = "normal(0.5,0.2)",
rho = "beta(1,1)",
nu = "normal(0, 5)")
# fit model
fit <- bcfa(model, data = test_dat,
dp = mydp,
std.lv=TRUE,
seed = 1234)
# fit indices
blavFitIndices(fit)
# Posterior mean (EAP) of devm-based fit indices:
#
# BRMSEA BGammaHat adjBGammaHat BMc
# NaN 0.801 1.680 0.475
# Warning messages:
# 1:
# 41 (20.5%) p_waic estimates greater than 0.4. We recommend trying loo instead.
# 2: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
#
# 3: In sqrt(nonc/(dif.ppD * N)) : NaNs produced
Thnsk, I see what the problem is. I think this change is related to the logl change that @ecmerkle mentioned earlier. I think that change affected the pD calculation for loo and waic. As you can see in the fitMeasures() function, the p_dic, p_waic, and p_loo are different calculation, and vary, but are usually close to each other, but now p_dic=36, and p_waic=126, and p_loo=121
So, these differences should not be this large
`
fitMeasures(fit)
npar logl ppp bic dic p_dic waic p_waic se_waic
43.000 -3081.635 0.000 6390.883 6235.755 36.242 6343.645 126.500 271.855
looic p_loo se_loo margloglik
6333.824 121.589 264.850 -3217.457
`
The think is that pD is used in the fit indices calculations, for now the quick fix is to change function to use DIC pD like this
bfits <- blavFitIndices(fit, pD="dic") summary(bfits)
With this change works fine
@ecmerkle could you check why the p_waic and p_loo might be so much larger now? Curiosly, this doesnt happens with the example model, so my guess would be that is related to the residual correlations?
`HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit1 <- bcfa(HS.model, data = HolzingerSwineford1939,
n.chains = 2, burnin = 1000, sample = 1000)
fitMeasures(fit1)`
> fitMeasures(fit1) npar logl ppp bic dic p_dic waic p_waic se_waic 30.000 -3738.344 0.000 7647.802 7535.950 29.631 7539.915 33.019 86.116 looic p_loo se_loo margloglik 7540.114 33.119 86.137 -3862.744
Thanks for looking at it. This is an interesting issue that I haven't yet figured out. I have drastically simplified the model, and I still see inflated p_waic
and p_loo
:
model <- ' EGRA_2m1 =~ VOC_2m1 + LTRC_2m1 + GRPH_2m1 '
fit <- bcfa(model, data = test_dat, dp = mydp, std.lv = TRUE)
I have also tried it with older versions of blavaan (0.4-1 and 0.3-17) and with an older version of loo (2.4-1). For all versions, I see similar behavior. And, like Mauricio mentions, the usual test models always seem to work as expected.
One thing I noticed is that, for the simplified model above, the estimated factor variance can be close to 0 if you use std.lv = FALSE
. I will look at it some more tomorrow.
For BTLI, I understand it can be over 1 from the formula
In frequentist SEM, adjusted GFI (and its small-sample improvement: adjusted-GammaHat) can be negative in very poorly fitting models. But values > 1 only happen in this case of trying to derive a Bayesian analogous (not equivalent) quantity.
adjBGammaHat
can exceed 1 whenever the analogous quantity to "degrees of freedom" (DF
below) is negative:
adjBGammaHat <- 1 - (pStar / DF) * (1 - BGammaHat)
In our paper, we use the canonical number of sufficient statistics minus the (estimated) number of estimated parameters (pD)
# number of sufficient stats
pStar <- ((nvar * (nvar + 3))/2) # multiplied by number of groups when relevant
# estimated number of estimated parameters
pD # borrowed from DIC, WAIC, or LOO-IC
# analogous to degrees of freedom
DF <- pStar - pD
The negative DF
is occurring with your data, and it is difficult to understand why.
- I initially thought it might be the residual correlations, which were handled by adding factors in
blavaan
's JSS paper, but I think that is no longer relevant with the faster Stan model that marginalizes out the latent variables by using summary stats. Also, as @ecmerkle found with a smaller model without residual correlations, the issue persists. - I also noticed the indicators are not all highly correlated, and their pattern is not easy to discern. I tried a different set of indicators that were substantially correlated (the last 3 indicators of the 2m1 factor), but the issue persists.
- I noticed some extreme outliers and set them to
NA
, but there is still a larger pD from WAIC and LOO than pStar (=9). Looking at their histograms, there is heavy inflation among the last 3 indicators(> 50 onOPR_2m1
, > 70% onINV_2m1
andRDCP_2m1
). The first 3 indicators look more normal, but the issue persists in all these cases. - Note from our paper that these indices are not designed to be used in combination with informative priors, and doing so changed their behavior in our applied examples. However, I tried your model (and smaller combinations) without your
mydp
priors, and the issue persisted.
Some syntax I wrote to try the above:
## try getting rid of extreme outliers
dat <- test_dat
vv <- c("VOC_2m1","LTRC_2m1","GRPH_2m1","INV_2m1","OPR_2m1","RDCP_2m1")
for (i in vv) {
sc <- abs(scale(dat[,i]))
dat[,i] <- ifelse(sc > 3, NA, dat[,i])
}
## model with first 3 indicators (more normally distributed, but low correlations)
model <- ' EGRA_2m1 =~ VOC_2m1 + LTRC_2m1 + GRPH_2m1 '
## or try fitting model to most highly correlated indicators (but highly inflated)
model <- ' EGRA_2m1 =~ INV_2m1 + OPR_2m1 + RDCP_2m1 '
#fit <- cfa(model, data = test_dat, std.lv = TRUE) # quick check with ML
bfit <- bcfa(model, data = dat, dp = mydp, std.lv = TRUE)
blavFitIndices(bfit)
So I can't rule out that there is something pathological in these simulated data, but I can't be sure that is what is causing the pD estimates to exceed pStar in all these cases. I hope some of these explorations might help @ecmerkle or @maugavilla by triggering ideas about where to look, but I am not convinced it is caused by the newer logl
calculations. If I interpreted Ed's last message correctly, the issue existed before the logl
change.
Instead of a problem with blavaan
(or loo
) source code, I expect it is an issue that can arise with particular data characteristics (or their interaction with model characteristics) that we haven't discovered yet. It is possible that the normal likelihood function is inappropriate for your data, which causes more variation across the parameter space, leading to greater pD estimates (especially in the WAIC and LOO cases, which more completely utilize the available posterior information).
Thank you all for your comments! I am looping in Ben Goodrich (@bgoodri) here too.
I am not sure if some background information about my data would be helpful to solving this issue:
- EGRA means early grade reading assessment, and the items are specific domains of reading skills (e.g., VOC is vocabulary, LTRC is letter recognition).
- Because the assessment was done a quite sensitive child population in a crisis-affected country, many assessed kids had zero score on many tested domains (e.g., they were tested on French when they did not know anything about French). The simulated variables here were standardized, but in the original data most variables are very zero-inflated.
- Because of this issue, Ben and I weren't able to build a CFA model that converges for these zero-inflated EGRA measures at each wave of the three-wave study, and he suggested that I try to subtract wave 1 score from wave 2 and wave 3 scores, and model the change scores as if they were continuous.
- But as it turns out, even when we model the change scores, some variables still have quite extreme distributions because many of those who had zero in baseline also had zero in endline (i.e., they didn't learn French at all due to issues like non-schooling or poverty).
Terrence's interpretation is right: because this also happens in older versions of blavaan, I don't think that recent changes broke something. I will double-check the log-likelihood computations to make sure, but I am leaning towards "particular data characteristics" as the explanation here.
Instead of defining outliers as Terrence did, I used the Pareto k values coming out of loo (removing cases where k is > .5). This leads to about 5 cases being removed (sorry I didn't set the seed). But after removing them, the DIC p_d drops to around -9. The old winbugs user manual says that negative p_d can happen with prior-data conflict, or when posterior means fall in an area of low density (say, when there is a bimodal posterior). I think this is some evidence for the "data characteristics" explanation.
If you compute DIC in the JAGS way (using jags.ic = TRUE), the p_d for DIC seems more reasonable regardless of whether the cases are included.
Code is below:
model_ll <- loo::extract_log_lik(fit@external$mcmcout)
loo_out <- loo(model_ll)
outl <- which(loo_out$diagnostics$pareto_k > .5)
reduced_dat <- test_dat[-outl,]
fitr1 <- bcfa(model, data = reduced_dat, jags.ic = TRUE)
fitMeasures(fitr1)
# npar logl ppp bic dic p_dic dic_jags
# 9.000 -747.657 0.482 1542.678 1477.012 -9.151 1511.448
#p_dic_jags waic p_waic se_waic looic p_loo se_loo
# 8.067 1498.936 12.245 55.656 1499.041 12.298 55.685
#margloglik
# -781.064
Giving the exaplanation about the original data, I find it interesting that this seems to be a good diagnostic of the data mismatching the models likelihood, and that loo and waic are sensnitive to this but not dic
When I thry ecluding outliers with the larger model, it needs several rounds of excluding them based on the pareto_k, which doesnt seem like a sustainable solution to me
If the data is actually zero-inflated, I think you should try to run the indicators as categorical, with one category for 0, and then ordered the rest of the answers. This would make it similar to a Graded response Model, which is better for this cases than treating as continuous.
In another project we modeled zero-inflation as a mixture IRT model, modeling the zero-inflation as a correlated mixture of the GRM model, and can find the Stan code here https://osf.io/hvkfn/
Magnus, B. E., & Garnier-Villarreal, M. (2022). A multidimensional zero-inflated graded response model for ordinal symptom data. Psychological Methods, 27(2), 261–279. https://doi.org/10.1037/met0000395
I agree that excluding people on the basis of pareto_k is not a solution... I was just looking at how sensitive the information criteria were to those observations.
It seems like we are agreeing that there is not a bug here, but that the behavior of these fit metrics can be suboptimal for models that do not fit so well.
So it seems a solution could be a warning from fitMeasures
if the effective number of parameters are negative, or if effective number of parameters have large variability across metrics. Also, related to the above post from @TDJorgensen , does it make sense to restrict DF to never go below 0 there?
does it make sense to restrict DF to never go below 0
I would not advise that. It could end up hiding problems that should be brought to attention.
a warning from
fitMeasures
if the effective number of parameters are negative, or if effective number of parameters have large variability across metrics
That sounds like a good solution. By "different metrics" do you mean DIC vs. WAIC vs. LOO_IC? We don't have a lot to go on here, so I don't think the warning can be very informative about what to do (maybe there is nothing to do) or what to investigate (there might be other issues). Perhaps something like:
warning("The estimated number of model parameters exceeds the number of sample statistics (covariances, etc.), so fit-index calculations may lead to uninterpretable results.")
Yes, I mean that we look at effective number of parameters under DIC, WAIC, LOO. If DIC is negative, or if they differ by "alot" (maybe some percentage away from a simple parameter count?), then a warning appears.
I am not sure this is exactly the same as the negative DF issue (definitely related, maybe not exactly the same). Maybe we warn about negative DF (in blavFitIndices
) separately from the effective number of parameter warning (from fitMeasures
).