group 2 latent mean has wrong sign?

Question

group 2 latent mean has wrong sign?

jpritikin opened this issue 2 months ago · 9 comments

lavaan 0.6-19 (cran), blavaan 0.5-6.1311 (via github)

It seems like the pm group (group 2) is not informed by data. The latent mean is obtaining something like the prior distribution. This seems to happen when I simulate very high signal-to-noise ratio data where some of the columns have zero variance.

    phenm            12.915    3.409    7.493   20.490    1.001 normal(0,10)

Code:

library(rstan)
options(mc.cores = parallel::detectCores()/2)
library(blavaan)

future::plan("multicore")

NoYesLevels <- c('No','Not sure','Yes')
EuphoricLevels <- c('Euphoric', 'Mildly euphoric', 'No change', 'Mild discomfort', 'Severe discomfort')
ImpactLevels <- c('Helped','Slightly helped','No impact','Slightly hindered','Hindered')
AbsItemNames <- c('boredom','frustration','anxiety', 'abort','bodyAcute','bodyAfter','sleep')

ImportStudyItems <- function(df) {
  if (any(is.na(match(AbsItemNames, colnames(df))))) stop("Items are mising")
  
  df[['boredom']] <- ordered(df[['boredom']], levels=NoYesLevels)
  df[['frustration']] <- ordered(df[['frustration']], levels=NoYesLevels)
  df[['anxiety']] <- ordered(df[['anxiety']], levels=NoYesLevels)
  df[['abort']] <- ordered(df[['abort']], levels=NoYesLevels)
  df[['bodyAcute']] <- ordered(df[['bodyAcute']], levels=EuphoricLevels)
  df[['bodyAfter']] <- ordered(df[['bodyAfter']], levels=EuphoricLevels)
  df[['sleep']] <- ordered(df[['sleep']], levels=ImpactLevels)
  df
}

psiData <- ImportStudyItems(read.csv("psi.csv"))
pmData <- ImportStudyItems(read.csv("pm.csv"))
combinedData <- rbind(cbind(psiData, group="psi"),
                      cbind(pmData, group="pm"))
combinedData$group <- factor(combinedData$group, levels=c('psi','pm'))

spec <- '
phenom =~ boredomL*boredom + frustrationL*frustration + anxietyL*anxiety + abortL*abort + bodyAcuteL*bodyAcute + bodyAfterL*bodyAfter + sleepL*sleep

phenom ~ c(0,NA)*1
boredom ~ boredomM*1
frustration ~ frustrationM*1
anxiety ~ anxietyM*1
abort ~ abortM*1
bodyAcute ~ bodyAcuteM*1
bodyAfter ~ bodyAfterM*1
sleep ~ sleepM*1

phenom ~~ 1*phenom
boredom ~~ 1*boredom
frustration ~~ 1*frustration
anxiety ~~ 1*anxiety
abort ~~ 1*abort
bodyAcute ~~ 1*bodyAcute
bodyAfter ~~ 1*bodyAfter
sleep ~~ 1*sleep

boredom | -.431*t1 + .431*t2
frustration | -.431*t1 + .431*t2
anxiety | -.431*t1 + .431*t2
abort | -.431*t1 + .431*t2
bodyAcute | -.842*t1 + -.253*t2 + .253*t3 + .842*t4
bodyAfter | -.842*t1 + -.253*t2 + .253*t3 + .842*t4
sleep | -.842*t1 + -.253*t2 + .253*t3 + .842*t4
'

fit <- blavaan(spec, data=combinedData,
               n.chains = 4,
               ordered = AbsItemNames,
               dp = dpriors(nu="normal(0,1)", alpha="normal(0,10)", lambda="normal(0,1)"),
               group = "group",
               std.lv=TRUE, allow.empty.cell = TRUE, test = "none")
summary(fit)

Data:
pm.csv
psi.csv

Answer 1 · 2024-11-01T22:07:00.000Z

Okay, I'm not sure what's going on. I tried a standard normal prior alpha="normal(0,1)". I also tried adding this code:

psiData <- ImportStudyItems(read.csv("psi.csv"))
pmData <- ImportStudyItems(read.csv("pm.csv"))
if (TRUE) {
  pmData <- pmData[apply(pmData, 1, function(x) sum(is.na(x))==0),]
  extraRows <- 15
  df <- as.data.frame(lapply(pmData, function(x) {
    if (is.factor(x)) sample(levels(x), size=extraRows, replace = TRUE)
    else rep(1,extraRows) 
  }))
  pmData <- rbind(pmData, df)
}

Why do I get a positive estimate for phenm?

                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior   
    phenm             1.934    0.338    1.333    2.638    1.001 normal(0,1)

I'm pretty sure it should be negative.

Answer 2 · 2024-11-01T22:17:16.000Z

You can easily eyeball that sapply(pmData, unclass) is more negative on the latent trait than sapply(psiData, unclass) given the estimated factor loadings.

Answer 3 · 2024-11-02T16:46:50.000Z

For example, if I fit the same model with OpenMx then the pm group mean is very negative.

library(OpenMx)

mxOption(key='Number of Threads', value=parallel::detectCores()) #now

NoYesLevels <- c('No','Not sure','Yes')
EuphoricLevels <- c('Euphoric', 'Mildly euphoric', 'No change', 'Mild discomfort', 'Severe discomfort')
ImpactLevels <- c('Helped','Slightly helped','No impact','Slightly hindered','Hindered')
AbsItemNames <- c('boredom','frustration','anxiety', 'abort','bodyAcute','bodyAfter','sleep')

ImportStudyItems <- function(df) {
  if (any(is.na(match(AbsItemNames, colnames(df))))) stop("Items are mising")
  
  df[['boredom']] <- ordered(df[['boredom']], levels=NoYesLevels)
  df[['frustration']] <- ordered(df[['frustration']], levels=NoYesLevels)
  df[['anxiety']] <- ordered(df[['anxiety']], levels=NoYesLevels)
  df[['abort']] <- ordered(df[['abort']], levels=NoYesLevels)
  df[['bodyAcute']] <- ordered(df[['bodyAcute']], levels=EuphoricLevels)
  df[['bodyAfter']] <- ordered(df[['bodyAfter']], levels=EuphoricLevels)
  df[['sleep']] <- ordered(df[['sleep']], levels=ImpactLevels)
  df
}

psiData <- ImportStudyItems(read.csv("psi.csv"))
pmData <- ImportStudyItems(read.csv("pm.csv"))

sg1 <- mxModel(
  name="substance", type="RAM",
  manifestVars = AbsItemNames, latentVars = "phenom",
  mxThreshold(AbsItemNames, nThresh = sapply(psiData, nlevels)[AbsItemNames] - 1L),
  mxPath("one", AbsItemNames, labels=paste0(AbsItemNames, 'M')),
  mxPath(AbsItemNames, arrows=2, values=1, free=FALSE),
  mxPath('phenom', AbsItemNames, labels=paste0(AbsItemNames, 'L')),
  mxPath('phenom', arrows=2, values=1, free=FALSE))
sg1$A$lbound['boredom','phenom'] <- 0

psiG <- mxModel(sg1, name="psi", mxData(psiData, type="raw"))
pmG <- mxModel(sg1, name="pm", mxData(pmData, type="raw"),
               mxPath('one', 'phenom', labels = 'pmM', lbound=-10,ubound=10))

baseModel <- mxModel("base", psiG, pmG,
        mxFitFunctionMultigroup(c("psi", "pm")), mxCI('pmM'))
baseModel <- mxOption(baseModel, 'Calculate Hessian','No')
baseModel <- mxOption(baseModel, 'Standard Errors','No')

fit1 <- mxRun(baseModel) # , intervals = TRUE
summary(fit1)

Answer 4 · 2024-11-02T17:37:22.000Z

I wrote some code to fit the Stan model and OpenMx to the same data. The first 3 cols (N, insight, rep) are data simulation parameters. The Rhat and ess columns are Stan diagnostics. The cor column is the correlation between Stan and OpenMx parameters. The pmL and pmU columns reflect the 0.95 hpd from Stan.

As you can see, cor is often above 0.95. This is mysterious because if the Stan model wasn't working then we'd expect it to not work consistently, but the parameter estimates do often match very well. Due to how I simulate data, pmL and pmU should always be negative.

     N insight rep  d Rhat     ess    cor   pmL   pmU
1   40    -2.0   1  0 1.01  760.64 -0.943  3.20  5.48
2   30    -2.0   1  0 1.01  973.78  0.968 -4.80 -2.56
3   40    -1.5   1  0 1.00 1004.68  0.978 -5.05 -2.89
4   30    -1.5   1  0 1.00  828.98  0.973 -5.06 -2.82
5   40    -1.0   1  0 1.53    7.17  0.848 -4.19  3.64
6   30    -1.0   1  0 1.01 1263.20 -0.854  1.58  3.55
7   40    -0.5   1  0 1.00 1221.60  0.925 -3.00 -1.26
8   30    -0.5   1  0 1.53    7.19  0.790 -2.89  2.56

I wonder if there is something wonky about how you constrain the factor loadings positive? The adjustment to fix the sign of the factor loadings are applied to both groups?

Answer 5 · 2024-11-02T19:32:09.000Z

The thing that constrains a factor loading to be positive for identification is only used if you add the argument std.lv = TRUE. Right now, I believe the signs of the factor loadings become negative which makes your latent mean positive.

But when I add std.lv = TRUE to your model estimation, the model has trouble converging. This makes me wonder whether the small sample size leads to a model that is weakly identified, which becomes problematic for certain datasets. When I changed the lambda prior to normal(1,.1), I get convergence and the latent mean is in the correct direction. Maybe that prior is too strong, but I would at least use a prior that places most density above 0 because we have sign indeterminacy here.

Answer 6 · 2024-11-02T20:19:36.000Z

But when I add std.lv = TRUE to your model estimation, the model has trouble converging.

How so? I consistently get a maximum R-hat of 1.0044 and a minimum ESS of 920.2.

But when I add std.lv = TRUE

Add it? All the code I posted in this bug has std.lv=TRUE already.

Right now, I believe the signs of the factor loadings become negative

Hm?

Group 2 [pm]:

Latent Variables:
                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior   
  phenom =~                                                                
    bordm   (brdL)    0.699    0.174    0.421    1.098    1.002            
    frstr   (frsL)    0.674    0.163    0.409    1.043    1.005            
    anxty   (anxL)    0.520    0.122    0.314    0.789    1.002            
    abort   (abrL)   -0.404    0.093   -0.611   -0.245    1.001            
    bdyAc (bdyAcL)    0.305    0.071    0.177    0.455    0.999            
    bdyAf (bdyAfL)    0.362    0.079    0.223    0.533    1.000            
    sleep   (slpL)    0.195    0.054    0.095    0.301    1.000            

Intercepts:
                   Estimate  Post.SD pi.lower pi.upper     Rhat    Prior   
    phenm             4.318    0.572    3.273    5.491    1.001 normal(0,1)

You mean with std.lv=FALSE? Sure, it could happen that most loadings get a negative sign and the group mean becomes positive. Gosh, I'd really prefer to keep std.lv=TRUE and be able to trust that the reported signs are consistent across datasets.

But when I add std.lv = TRUE to your model estimation, the model has trouble converging.

Why should std.lv=FALSE make any difference to estimation? If I remember your paper accurately, I thought that the standardization was purely a post-processing step to make the R-hats work, etc. That's how I implemented it in the pcFactorStan package.

Answer 7 · 2024-11-02T20:34:14.000Z

I was talking about std.lv=TRUE as opposed to FALSE, it was just that I got your model from this issue confused with the model from the earlier issue (which did not have std.lv=TRUE). I will need to look at it some more...

Answer 8 · 2024-11-07T19:11:19.000Z

Ok, I managed to look some more and think the problem is related to wonky factor loading constraints. When I flip the signs of the loadings in generated quantities, I neglected the factor means. So really the model is estimating negative loadings which leads to a positive factor mean, then I flip the sign of loadings but forget the factor mean.

Answer 9 · 2024-11-07T19:40:16.000Z

Glad you are getting close to a fix!

As far as I can tell, this might be the last problem with the model.