ecmerkle/blavaan

adjBGammaHat > 1 and BRMSEA is NA

littlehifive opened this issue · 12 comments

Hi Ed,

For the following model, I got NAs for BRMSEA and a value over 1 for adjBGammaHat (which should be bounded between 0 and 1 right?) Do you know why that could be the case?

For BTLI, I understand it can be over 1 from the formula, but is it the larger the better or the closer to 1 the better?

Thanks!

> summary(FI_fit_egra_test_2, central.tendency = c("mean","median","mode"), prob = .90)

Posterior summary statistics and highest posterior density (HPD) 90% credible intervals for devm-based fit indices:

               EAP Median   MAP    SD lower upper
BRMSEA         NaN     NA    NA    NA   NaN    NA
BGammaHat    0.783  0.784 0.784 0.002 0.781 0.786
adjBGammaHat 1.468  1.467 1.466 0.003 1.462 1.473
BMc          0.436  0.437 0.438 0.003 0.431 0.442
BCFI         0.632  0.632 0.633 0.004 0.626 0.637
BTLI         1.082  1.082 1.082 0.001 1.081 1.083
BNFI         0.648  0.649 0.649 0.003 0.642 0.654
Warning message:
In (function (..., deparse.level = 1)  :
  number of columns of result is not a multiple of vector length (arg 1)
blavaan (0.4-2.912) results of 1000 samples after 500 adapt/burnin iterations

  Number of observations                           611

  Number of missing patterns                         1

  Statistic                                 MargLogLik         PPP
  Value                                      -9681.216       0.000

Latent Variables:
                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior       
  EGRA_2m1 =~                                                                  
    VOC_2m1           0.106    0.044    0.022    0.195    1.000    normal(0, 2)
    LTRC_2m1          0.257    0.062    0.172    0.347    1.000    normal(0, 2)
    GRPH_2m1          0.288    0.065    0.207    0.377    1.000    normal(0, 2)
    INV_2m1           0.599    0.114    0.530    0.690    1.000    normal(0, 2)
    OPR_2m1           0.921    0.168    0.851    1.006    1.000    normal(0, 2)
    RDCP_2m1          0.680    0.129    0.587    0.781    1.000    normal(0, 2)
  EGRA_3m1 =~                                                                  
    VOC_3m1           0.158    0.045    0.072    0.245    1.000    normal(0, 2)
    LTRC_3m1          0.225    0.042    0.144    0.307    1.000    normal(0, 2)
    GRPH_3m1          0.243    0.043    0.160    0.325    1.001    normal(0, 2)
    INV_3m1           0.738    0.045    0.667    0.809    1.000    normal(0, 2)
    OPR_3m1           0.872    0.050    0.793    0.947    1.004    normal(0, 2)
    RDCP_3m1          0.790    0.051    0.707    0.871    0.999    normal(0, 2)

Covariances:
                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior       
 .LTRC_2m1 ~~                                                                  
   .GRPH_2m1          0.447    0.042    0.367    0.534    1.000       beta(1,1)
 .VOC_2m1 ~~                                                                   
   .RDCP_2m1          0.023    0.038   -0.051    0.096    1.000       beta(1,1)
 .OPR_2m1 ~~                                                                   
   .RDCP_2m1         -0.188    0.033   -0.244   -0.113    1.000       beta(1,1)
 .LTRC_3m1 ~~                                                                  
   .GRPH_3m1          0.475    0.044    0.394    0.566    1.001       beta(1,1)
 .VOC_3m1 ~~                                                                   
   .RDCP_3m1          0.028    0.035   -0.042    0.095    1.000       beta(1,1)
 .OPR_3m1 ~~                                                                   
   .RDCP_3m1         -0.241    0.029   -0.290   -0.178    1.002       beta(1,1)
  EGRA_2m1 ~~                                                                  
    EGRA_3m1          0.620    0.030    0.563    0.679    1.000     lkj_corr(1)

Intercepts:
                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior       
   .VOC_2m1          -0.000    0.040   -0.079    0.080    0.999    normal(0, 5)
   .LTRC_2m1          0.001    0.041   -0.076    0.082    1.001    normal(0, 5)
   .GRPH_2m1         -0.000    0.040   -0.080    0.077    1.003    normal(0, 5)
   .INV_2m1          -0.001    0.041   -0.084    0.079    1.001    normal(0, 5)
   .OPR_2m1          -0.001    0.041   -0.082    0.079    1.001    normal(0, 5)
   .RDCP_2m1         -0.001    0.041   -0.079    0.082    1.001    normal(0, 5)
   .VOC_3m1          -0.000    0.041   -0.078    0.080    0.999    normal(0, 5)
   .LTRC_3m1         -0.001    0.040   -0.080    0.077    1.000    normal(0, 5)
   .GRPH_3m1         -0.000    0.040   -0.077    0.078    1.001    normal(0, 5)
   .INV_3m1          -0.001    0.041   -0.083    0.077    1.001    normal(0, 5)
   .OPR_3m1          -0.002    0.041   -0.082    0.078    1.002    normal(0, 5)
   .RDCP_3m1         -0.002    0.043   -0.087    0.079    1.000    normal(0, 5)
    EGRA_2m1          0.000                                                    
    EGRA_3m1          0.000                                                    

Variances:
                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior       
   .VOC_2m1           0.995    0.057    0.892    1.111    0.999 gamma(1,.5)[sd]
   .LTRC_2m1          0.938    0.053    0.844    1.050    0.999 gamma(1,.5)[sd]
   .GRPH_2m1          0.922    0.054    0.822    1.034    1.000 gamma(1,.5)[sd]
   .INV_2m1           0.640    0.041    0.561    0.723    1.001 gamma(1,.5)[sd]
   .OPR_2m1           0.140    0.043    0.071    0.242    1.003 gamma(1,.5)[sd]
   .RDCP_2m1          0.533    0.053    0.432    0.643    1.001 gamma(1,.5)[sd]
   .VOC_3m1           0.982    0.056    0.880    1.096    1.000 gamma(1,.5)[sd]
   .LTRC_3m1          0.957    0.055    0.857    1.074    1.000 gamma(1,.5)[sd]
   .GRPH_3m1          0.949    0.056    0.847    1.065    1.000 gamma(1,.5)[sd]
   .INV_3m1           0.467    0.034    0.401    0.534    1.004 gamma(1,.5)[sd]
   .OPR_3m1           0.251    0.042    0.177    0.341    1.008 gamma(1,.5)[sd]
   .RDCP_3m1          0.384    0.043    0.303    0.470    1.001 gamma(1,.5)[sd]
    EGRA_2m1          1.000                                                    
    EGRA_3m1          1.000                                                    

Thanks for the report, I will take a look soon. I am also looping in the two people who know this part of the code best:

@maugavilla @TDJorgensen : in recent months, Zezhen has helpfully found a variety of bugs and unclear issues in blavaan. I am wondering whether you have seen the issues that he describes here. The NAs could possibly be related to recent changes that I made to logl computations, but I am not certain about that.

@littlehifive could you share more information so we can reporduce the error? Something like sample size, variable means, covariance matrix, and full code? With this we could simulate similar data and reproduce it.

I think I know what might be but would like to see more ifnormation before

@maugavilla

I created a simulated dataset based on my data, and I was able to reproduce the problems with BRMSEA and adjusted GammaHat using the following model. Let me know if you need anything else from me, thanks!

test_dat.csv

library(blavaan)

# load test_dat.csv here

# lavaan
model <- '
  # measurement model
  EGRA_2m1 =~ VOC_2m1 + LTRC_2m1 + GRPH_2m1 + INV_2m1 + OPR_2m1 + RDCP_2m1
  GRPH_2m1 ~~ LTRC_2m1
  RDCP_2m1 ~~ VOC_2m1
  RDCP_2m1 ~~ OPR_2m1
  
  # measurement model
  EGRA_3m1 =~ VOC_3m1 + LTRC_3m1 + GRPH_3m1 + INV_3m1 + OPR_3m1 + RDCP_3m1
  GRPH_3m1 ~~ LTRC_3m1
  RDCP_3m1 ~~ VOC_3m1
  RDCP_3m1 ~~ OPR_3m1
  
  # correlation
  EGRA_3m1 ~~ EGRA_2m1
'

# priors
mydp <- dpriors(tau = "normal(0, 0.5)",
                lambda = "normal(0, 2)",
                beta = "normal(0.5,0.2)",
                rho = "beta(1,1)",
                nu = "normal(0, 5)")

# fit model
fit <- bcfa(model, data = test_dat,
            dp = mydp,
            std.lv=TRUE,
            seed = 1234)

# fit indices
blavFitIndices(fit)

# Posterior mean (EAP) of devm-based fit indices:
#   
#   BRMSEA    BGammaHat adjBGammaHat          BMc 
# NaN        0.801        1.680        0.475 
# Warning messages:
#   1: 
#   41 (20.5%) p_waic estimates greater than 0.4. We recommend trying loo instead. 
# 2: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
# 
# 3: In sqrt(nonc/(dif.ppD * N)) : NaNs produced



@littlehifive

Thnsk, I see what the problem is. I think this change is related to the logl change that @ecmerkle mentioned earlier. I think that change affected the pD calculation for loo and waic. As you can see in the fitMeasures() function, the p_dic, p_waic, and p_loo are different calculation, and vary, but are usually close to each other, but now p_dic=36, and p_waic=126, and p_loo=121
So, these differences should not be this large

`

fitMeasures(fit)
npar logl ppp bic dic p_dic waic p_waic se_waic
43.000 -3081.635 0.000 6390.883 6235.755 36.242 6343.645 126.500 271.855
looic p_loo se_loo margloglik
6333.824 121.589 264.850 -3217.457
`

The think is that pD is used in the fit indices calculations, for now the quick fix is to change function to use DIC pD like this
bfits <- blavFitIndices(fit, pD="dic") summary(bfits)
With this change works fine

@ecmerkle could you check why the p_waic and p_loo might be so much larger now? Curiosly, this doesnt happens with the example model, so my guess would be that is related to the residual correlations?

`HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '

fit1 <- bcfa(HS.model, data = HolzingerSwineford1939,
n.chains = 2, burnin = 1000, sample = 1000)
fitMeasures(fit1)`

> fitMeasures(fit1) npar logl ppp bic dic p_dic waic p_waic se_waic 30.000 -3738.344 0.000 7647.802 7535.950 29.631 7539.915 33.019 86.116 looic p_loo se_loo margloglik 7540.114 33.119 86.137 -3862.744

Thanks for looking at it. This is an interesting issue that I haven't yet figured out. I have drastically simplified the model, and I still see inflated p_waic and p_loo:

model <- ' EGRA_2m1 =~ VOC_2m1 + LTRC_2m1 + GRPH_2m1 '

fit <- bcfa(model, data = test_dat, dp = mydp, std.lv = TRUE)

I have also tried it with older versions of blavaan (0.4-1 and 0.3-17) and with an older version of loo (2.4-1). For all versions, I see similar behavior. And, like Mauricio mentions, the usual test models always seem to work as expected.

One thing I noticed is that, for the simplified model above, the estimated factor variance can be close to 0 if you use std.lv = FALSE. I will look at it some more tomorrow.

For BTLI, I understand it can be over 1 from the formula

In frequentist SEM, adjusted GFI (and its small-sample improvement: adjusted-GammaHat) can be negative in very poorly fitting models. But values > 1 only happen in this case of trying to derive a Bayesian analogous (not equivalent) quantity.

adjBGammaHat can exceed 1 whenever the analogous quantity to "degrees of freedom" (DF below) is negative:

adjBGammaHat <- 1 - (pStar / DF) * (1 -  BGammaHat)

In our paper, we use the canonical number of sufficient statistics minus the (estimated) number of estimated parameters (pD)

# number of sufficient stats
pStar <- ((nvar * (nvar + 3))/2) # multiplied by number of groups when relevant
# estimated number of estimated parameters
pD # borrowed from DIC, WAIC, or LOO-IC
# analogous to degrees of freedom
DF <- pStar - pD

The negative DF is occurring with your data, and it is difficult to understand why.

  • I initially thought it might be the residual correlations, which were handled by adding factors in blavaan's JSS paper, but I think that is no longer relevant with the faster Stan model that marginalizes out the latent variables by using summary stats. Also, as @ecmerkle found with a smaller model without residual correlations, the issue persists.
  • I also noticed the indicators are not all highly correlated, and their pattern is not easy to discern. I tried a different set of indicators that were substantially correlated (the last 3 indicators of the 2m1 factor), but the issue persists.
  • I noticed some extreme outliers and set them to NA, but there is still a larger pD from WAIC and LOO than pStar (=9). Looking at their histograms, there is heavy inflation among the last 3 indicators(> 50 on OPR_2m1, > 70% on INV_2m1 and RDCP_2m1). The first 3 indicators look more normal, but the issue persists in all these cases.
  • Note from our paper that these indices are not designed to be used in combination with informative priors, and doing so changed their behavior in our applied examples. However, I tried your model (and smaller combinations) without your mydp priors, and the issue persisted.

Some syntax I wrote to try the above:

## try getting rid of extreme outliers
dat <- test_dat
vv <- c("VOC_2m1","LTRC_2m1","GRPH_2m1","INV_2m1","OPR_2m1","RDCP_2m1")
for (i in vv) {
  sc <- abs(scale(dat[,i]))
  dat[,i] <- ifelse(sc > 3, NA, dat[,i])
}

## model with first 3 indicators (more normally distributed, but low correlations)
model <- ' EGRA_2m1 =~ VOC_2m1 + LTRC_2m1 + GRPH_2m1 '
## or try fitting model to most highly correlated indicators (but highly inflated)
model <- ' EGRA_2m1 =~ INV_2m1 + OPR_2m1 + RDCP_2m1 '

#fit  <-  cfa(model, data = test_dat,            std.lv = TRUE) # quick check with ML
bfit <- bcfa(model, data =      dat, dp = mydp, std.lv = TRUE)
blavFitIndices(bfit)

So I can't rule out that there is something pathological in these simulated data, but I can't be sure that is what is causing the pD estimates to exceed pStar in all these cases. I hope some of these explorations might help @ecmerkle or @maugavilla by triggering ideas about where to look, but I am not convinced it is caused by the newer logl calculations. If I interpreted Ed's last message correctly, the issue existed before the logl change.

Instead of a problem with blavaan (or loo) source code, I expect it is an issue that can arise with particular data characteristics (or their interaction with model characteristics) that we haven't discovered yet. It is possible that the normal likelihood function is inappropriate for your data, which causes more variation across the parameter space, leading to greater pD estimates (especially in the WAIC and LOO cases, which more completely utilize the available posterior information).

Thank you all for your comments! I am looping in Ben Goodrich (@bgoodri) here too.

I am not sure if some background information about my data would be helpful to solving this issue:

  • EGRA means early grade reading assessment, and the items are specific domains of reading skills (e.g., VOC is vocabulary, LTRC is letter recognition).
  • Because the assessment was done a quite sensitive child population in a crisis-affected country, many assessed kids had zero score on many tested domains (e.g., they were tested on French when they did not know anything about French). The simulated variables here were standardized, but in the original data most variables are very zero-inflated.
  • Because of this issue, Ben and I weren't able to build a CFA model that converges for these zero-inflated EGRA measures at each wave of the three-wave study, and he suggested that I try to subtract wave 1 score from wave 2 and wave 3 scores, and model the change scores as if they were continuous.
  • But as it turns out, even when we model the change scores, some variables still have quite extreme distributions because many of those who had zero in baseline also had zero in endline (i.e., they didn't learn French at all due to issues like non-schooling or poverty).

Terrence's interpretation is right: because this also happens in older versions of blavaan, I don't think that recent changes broke something. I will double-check the log-likelihood computations to make sure, but I am leaning towards "particular data characteristics" as the explanation here.

Instead of defining outliers as Terrence did, I used the Pareto k values coming out of loo (removing cases where k is > .5). This leads to about 5 cases being removed (sorry I didn't set the seed). But after removing them, the DIC p_d drops to around -9. The old winbugs user manual says that negative p_d can happen with prior-data conflict, or when posterior means fall in an area of low density (say, when there is a bimodal posterior). I think this is some evidence for the "data characteristics" explanation.

If you compute DIC in the JAGS way (using jags.ic = TRUE), the p_d for DIC seems more reasonable regardless of whether the cases are included.

Code is below:

model_ll <- loo::extract_log_lik(fit@external$mcmcout)

loo_out <- loo(model_ll)

outl <- which(loo_out$diagnostics$pareto_k > .5)

reduced_dat <- test_dat[-outl,]
fitr1 <- bcfa(model, data = reduced_dat, jags.ic = TRUE)
fitMeasures(fitr1)
#      npar       logl        ppp        bic        dic      p_dic   dic_jags 
#     9.000   -747.657      0.482   1542.678   1477.012     -9.151   1511.448 
#p_dic_jags       waic     p_waic    se_waic      looic      p_loo     se_loo 
#     8.067   1498.936     12.245     55.656   1499.041     12.298     55.685 
#margloglik 
#  -781.064 

Giving the exaplanation about the original data, I find it interesting that this seems to be a good diagnostic of the data mismatching the models likelihood, and that loo and waic are sensnitive to this but not dic

When I thry ecluding outliers with the larger model, it needs several rounds of excluding them based on the pareto_k, which doesnt seem like a sustainable solution to me

If the data is actually zero-inflated, I think you should try to run the indicators as categorical, with one category for 0, and then ordered the rest of the answers. This would make it similar to a Graded response Model, which is better for this cases than treating as continuous.

In another project we modeled zero-inflation as a mixture IRT model, modeling the zero-inflation as a correlated mixture of the GRM model, and can find the Stan code here https://osf.io/hvkfn/
Magnus, B. E., & Garnier-Villarreal, M. (2022). A multidimensional zero-inflated graded response model for ordinal symptom data. Psychological Methods, 27(2), 261–279. https://doi.org/10.1037/met0000395

I agree that excluding people on the basis of pareto_k is not a solution... I was just looking at how sensitive the information criteria were to those observations.

It seems like we are agreeing that there is not a bug here, but that the behavior of these fit metrics can be suboptimal for models that do not fit so well.

So it seems a solution could be a warning from fitMeasures if the effective number of parameters are negative, or if effective number of parameters have large variability across metrics. Also, related to the above post from @TDJorgensen , does it make sense to restrict DF to never go below 0 there?

does it make sense to restrict DF to never go below 0

I would not advise that. It could end up hiding problems that should be brought to attention.

a warning from fitMeasures if the effective number of parameters are negative, or if effective number of parameters have large variability across metrics

That sounds like a good solution. By "different metrics" do you mean DIC vs. WAIC vs. LOO_IC? We don't have a lot to go on here, so I don't think the warning can be very informative about what to do (maybe there is nothing to do) or what to investigate (there might be other issues). Perhaps something like:

warning("The estimated number of model parameters exceeds the number of sample statistics (covariances, etc.), so fit-index calculations may lead to uninterpretable results.")

Yes, I mean that we look at effective number of parameters under DIC, WAIC, LOO. If DIC is negative, or if they differ by "alot" (maybe some percentage away from a simple parameter count?), then a warning appears.

I am not sure this is exactly the same as the negative DF issue (definitely related, maybe not exactly the same). Maybe we warn about negative DF (in blavFitIndices) separately from the effective number of parameter warning (from fitMeasures).