group 2 latent mean has wrong sign?
jpritikin opened this issue · 9 comments
lavaan 0.6-19 (cran), blavaan 0.5-6.1311 (via github)
It seems like the pm group (group 2) is not informed by data. The latent mean is obtaining something like the prior distribution. This seems to happen when I simulate very high signal-to-noise ratio data where some of the columns have zero variance.
phenm 12.915 3.409 7.493 20.490 1.001 normal(0,10)
Code:
library(rstan)
options(mc.cores = parallel::detectCores()/2)
library(blavaan)
future::plan("multicore")
NoYesLevels <- c('No','Not sure','Yes')
EuphoricLevels <- c('Euphoric', 'Mildly euphoric', 'No change', 'Mild discomfort', 'Severe discomfort')
ImpactLevels <- c('Helped','Slightly helped','No impact','Slightly hindered','Hindered')
AbsItemNames <- c('boredom','frustration','anxiety', 'abort','bodyAcute','bodyAfter','sleep')
ImportStudyItems <- function(df) {
if (any(is.na(match(AbsItemNames, colnames(df))))) stop("Items are mising")
df[['boredom']] <- ordered(df[['boredom']], levels=NoYesLevels)
df[['frustration']] <- ordered(df[['frustration']], levels=NoYesLevels)
df[['anxiety']] <- ordered(df[['anxiety']], levels=NoYesLevels)
df[['abort']] <- ordered(df[['abort']], levels=NoYesLevels)
df[['bodyAcute']] <- ordered(df[['bodyAcute']], levels=EuphoricLevels)
df[['bodyAfter']] <- ordered(df[['bodyAfter']], levels=EuphoricLevels)
df[['sleep']] <- ordered(df[['sleep']], levels=ImpactLevels)
df
}
psiData <- ImportStudyItems(read.csv("psi.csv"))
pmData <- ImportStudyItems(read.csv("pm.csv"))
combinedData <- rbind(cbind(psiData, group="psi"),
cbind(pmData, group="pm"))
combinedData$group <- factor(combinedData$group, levels=c('psi','pm'))
spec <- '
phenom =~ boredomL*boredom + frustrationL*frustration + anxietyL*anxiety + abortL*abort + bodyAcuteL*bodyAcute + bodyAfterL*bodyAfter + sleepL*sleep
phenom ~ c(0,NA)*1
boredom ~ boredomM*1
frustration ~ frustrationM*1
anxiety ~ anxietyM*1
abort ~ abortM*1
bodyAcute ~ bodyAcuteM*1
bodyAfter ~ bodyAfterM*1
sleep ~ sleepM*1
phenom ~~ 1*phenom
boredom ~~ 1*boredom
frustration ~~ 1*frustration
anxiety ~~ 1*anxiety
abort ~~ 1*abort
bodyAcute ~~ 1*bodyAcute
bodyAfter ~~ 1*bodyAfter
sleep ~~ 1*sleep
boredom | -.431*t1 + .431*t2
frustration | -.431*t1 + .431*t2
anxiety | -.431*t1 + .431*t2
abort | -.431*t1 + .431*t2
bodyAcute | -.842*t1 + -.253*t2 + .253*t3 + .842*t4
bodyAfter | -.842*t1 + -.253*t2 + .253*t3 + .842*t4
sleep | -.842*t1 + -.253*t2 + .253*t3 + .842*t4
'
fit <- blavaan(spec, data=combinedData,
n.chains = 4,
ordered = AbsItemNames,
dp = dpriors(nu="normal(0,1)", alpha="normal(0,10)", lambda="normal(0,1)"),
group = "group",
std.lv=TRUE, allow.empty.cell = TRUE, test = "none")
summary(fit)
Okay, I'm not sure what's going on. I tried a standard normal prior alpha="normal(0,1)"
. I also tried adding this code:
psiData <- ImportStudyItems(read.csv("psi.csv"))
pmData <- ImportStudyItems(read.csv("pm.csv"))
if (TRUE) {
pmData <- pmData[apply(pmData, 1, function(x) sum(is.na(x))==0),]
extraRows <- 15
df <- as.data.frame(lapply(pmData, function(x) {
if (is.factor(x)) sample(levels(x), size=extraRows, replace = TRUE)
else rep(1,extraRows)
}))
pmData <- rbind(pmData, df)
}
Why do I get a positive estimate for phenm?
Estimate Post.SD pi.lower pi.upper Rhat Prior
phenm 1.934 0.338 1.333 2.638 1.001 normal(0,1)
I'm pretty sure it should be negative.
You can easily eyeball that sapply(pmData, unclass)
is more negative on the latent trait than sapply(psiData, unclass)
given the estimated factor loadings.
For example, if I fit the same model with OpenMx then the pm group mean is very negative.
library(OpenMx)
mxOption(key='Number of Threads', value=parallel::detectCores()) #now
NoYesLevels <- c('No','Not sure','Yes')
EuphoricLevels <- c('Euphoric', 'Mildly euphoric', 'No change', 'Mild discomfort', 'Severe discomfort')
ImpactLevels <- c('Helped','Slightly helped','No impact','Slightly hindered','Hindered')
AbsItemNames <- c('boredom','frustration','anxiety', 'abort','bodyAcute','bodyAfter','sleep')
ImportStudyItems <- function(df) {
if (any(is.na(match(AbsItemNames, colnames(df))))) stop("Items are mising")
df[['boredom']] <- ordered(df[['boredom']], levels=NoYesLevels)
df[['frustration']] <- ordered(df[['frustration']], levels=NoYesLevels)
df[['anxiety']] <- ordered(df[['anxiety']], levels=NoYesLevels)
df[['abort']] <- ordered(df[['abort']], levels=NoYesLevels)
df[['bodyAcute']] <- ordered(df[['bodyAcute']], levels=EuphoricLevels)
df[['bodyAfter']] <- ordered(df[['bodyAfter']], levels=EuphoricLevels)
df[['sleep']] <- ordered(df[['sleep']], levels=ImpactLevels)
df
}
psiData <- ImportStudyItems(read.csv("psi.csv"))
pmData <- ImportStudyItems(read.csv("pm.csv"))
sg1 <- mxModel(
name="substance", type="RAM",
manifestVars = AbsItemNames, latentVars = "phenom",
mxThreshold(AbsItemNames, nThresh = sapply(psiData, nlevels)[AbsItemNames] - 1L),
mxPath("one", AbsItemNames, labels=paste0(AbsItemNames, 'M')),
mxPath(AbsItemNames, arrows=2, values=1, free=FALSE),
mxPath('phenom', AbsItemNames, labels=paste0(AbsItemNames, 'L')),
mxPath('phenom', arrows=2, values=1, free=FALSE))
sg1$A$lbound['boredom','phenom'] <- 0
psiG <- mxModel(sg1, name="psi", mxData(psiData, type="raw"))
pmG <- mxModel(sg1, name="pm", mxData(pmData, type="raw"),
mxPath('one', 'phenom', labels = 'pmM', lbound=-10,ubound=10))
baseModel <- mxModel("base", psiG, pmG,
mxFitFunctionMultigroup(c("psi", "pm")), mxCI('pmM'))
baseModel <- mxOption(baseModel, 'Calculate Hessian','No')
baseModel <- mxOption(baseModel, 'Standard Errors','No')
fit1 <- mxRun(baseModel) # , intervals = TRUE
summary(fit1)
I wrote some code to fit the Stan model and OpenMx to the same data. The first 3 cols (N, insight, rep) are data simulation parameters. The Rhat and ess columns are Stan diagnostics. The cor column is the correlation between Stan and OpenMx parameters. The pmL and pmU columns reflect the 0.95 hpd from Stan.
As you can see, cor is often above 0.95. This is mysterious because if the Stan model wasn't working then we'd expect it to not work consistently, but the parameter estimates do often match very well. Due to how I simulate data, pmL and pmU should always be negative.
N insight rep d Rhat ess cor pmL pmU
1 40 -2.0 1 0 1.01 760.64 -0.943 3.20 5.48
2 30 -2.0 1 0 1.01 973.78 0.968 -4.80 -2.56
3 40 -1.5 1 0 1.00 1004.68 0.978 -5.05 -2.89
4 30 -1.5 1 0 1.00 828.98 0.973 -5.06 -2.82
5 40 -1.0 1 0 1.53 7.17 0.848 -4.19 3.64
6 30 -1.0 1 0 1.01 1263.20 -0.854 1.58 3.55
7 40 -0.5 1 0 1.00 1221.60 0.925 -3.00 -1.26
8 30 -0.5 1 0 1.53 7.19 0.790 -2.89 2.56
I wonder if there is something wonky about how you constrain the factor loadings positive? The adjustment to fix the sign of the factor loadings are applied to both groups?
The thing that constrains a factor loading to be positive for identification is only used if you add the argument std.lv = TRUE
. Right now, I believe the signs of the factor loadings become negative which makes your latent mean positive.
But when I add std.lv = TRUE
to your model estimation, the model has trouble converging. This makes me wonder whether the small sample size leads to a model that is weakly identified, which becomes problematic for certain datasets. When I changed the lambda prior to normal(1,.1)
, I get convergence and the latent mean is in the correct direction. Maybe that prior is too strong, but I would at least use a prior that places most density above 0 because we have sign indeterminacy here.
But when I add
std.lv = TRUE
to your model estimation, the model has trouble converging.
How so? I consistently get a maximum R-hat of 1.0044 and a minimum ESS of 920.2.
But when I add
std.lv = TRUE
Add it? All the code I posted in this bug has std.lv=TRUE
already.
Right now, I believe the signs of the factor loadings become negative
Hm?
Group 2 [pm]:
Latent Variables:
Estimate Post.SD pi.lower pi.upper Rhat Prior
phenom =~
bordm (brdL) 0.699 0.174 0.421 1.098 1.002
frstr (frsL) 0.674 0.163 0.409 1.043 1.005
anxty (anxL) 0.520 0.122 0.314 0.789 1.002
abort (abrL) -0.404 0.093 -0.611 -0.245 1.001
bdyAc (bdyAcL) 0.305 0.071 0.177 0.455 0.999
bdyAf (bdyAfL) 0.362 0.079 0.223 0.533 1.000
sleep (slpL) 0.195 0.054 0.095 0.301 1.000
Intercepts:
Estimate Post.SD pi.lower pi.upper Rhat Prior
phenm 4.318 0.572 3.273 5.491 1.001 normal(0,1)
You mean with std.lv=FALSE
? Sure, it could happen that most loadings get a negative sign and the group mean becomes positive. Gosh, I'd really prefer to keep std.lv=TRUE
and be able to trust that the reported signs are consistent across datasets.
But when I add std.lv = TRUE to your model estimation, the model has trouble converging.
Why should std.lv=FALSE
make any difference to estimation? If I remember your paper accurately, I thought that the standardization was purely a post-processing step to make the R-hats work, etc. That's how I implemented it in the pcFactorStan package.
I was talking about std.lv=TRUE
as opposed to FALSE
, it was just that I got your model from this issue confused with the model from the earlier issue (which did not have std.lv=TRUE
). I will need to look at it some more...
Ok, I managed to look some more and think the problem is related to wonky factor loading constraints. When I flip the signs of the loadings in generated quantities, I neglected the factor means. So really the model is estimating negative loadings which leads to a positive factor mean, then I flip the sign of loadings but forget the factor mean.
Glad you are getting close to a fix!
As far as I can tell, this might be the last problem with the model.