Two-outcome dependent variables
Closed this issue · 2 comments
Hi Manuel,
Thanks for putting this package together! I had a question about an error I received when using the mnl_pred_ova
function: "Please supply a dataset with a dependent variable that has a sufficient number of outcomes (> 2)."
I'm using this package to work with covariate predictors of class membership in latent class analysis, so the dependent variable in the multinomial logistic regression is the predicted latent class for each observation. For my purposes, the number of classes is generally low (between 2-4), so there are relatively frequent cases where the dependent variable has only two outcomes.
I've been able to get mnl_pred_ova
to work nicely for 3-class models, but I get the above error with a 2-class model, which makes sense, since there are indeed only two outcomes for the DV. My question is, is there a way for me to use this package's methods with a two-outcome dependent variable, or is that a structural impossibility? I'm a programmer by trade, so I don't have the strongest grasp of the underlying statistical methods – I apologize if this is perhaps a stupid question!
Best,
Alex
Hi Alex,
no worries: not a stupid question :)
In short: if there are just two outcomes in the dependent variable, it is a simple logistic regression model. You can use glm()
with a binomial link function and other visualization packages (for example with sjPlot).
A multinomial-logit estimates the odds to choose one of a set of options over a reference category. If there is only one other option to choose from, the "multinomial" part is not necessary anymore. Therefore, a model is sufficient that can handle to estimate the choice of one option over the other. This is what any binomial model, such as a logit regression, does.
(Tbh, I didn't know that multinom()
allows less than three outcomes.)
But here is a short MWE to demonstrate what I mean. This is in essence what a multinomial model does:
library(nnet)
ya <- rep(c("a", "b", "c"), each = 100)
x1a <- rnorm(300)
x2a <- rnorm(300, mean = 2)
data_a <- data.frame(ya, x1a, x2a)
mod1 <- multinom(yb ~ x1a + x2a,
data = data_a)
summary(mod1)
This yields:
> summary(mod1)
Call:
multinom(formula = yb ~ x1a + x2a, data = data_a)
Coefficients:
Values Std. Err.
(Intercept) -4.972273e-02 0.2759537
x1a -4.010516e-07 0.1099197
x2a 2.453166e-02 0.1236715
Residual Deviance: 415.8488
AIC: 421.8488
Now, let's try it with two outcomes:
yb <- rep(c("a", "b"), each = 150) # Once as a string
y_log <- rep(c(0, 1), each = 150) # Once as a numeric for the logit-model
x1b <- rnorm(300)
x2b <- rnorm(300, mean = 2)
data_b <- data.frame(yb, y_log, x1b, x2b)
mod2 <- multinom(yb ~ x1b + x2b,
data_b)
summary(mod2)
This yields the following results:
> summary(mod2)
Call:
multinom(formula = yb ~ x1b + x2b, data = data_b)
Coefficients:
Values Std. Err.
(Intercept) 0.05457562 0.2714678
x1b 0.28130618 0.1235326
x2b -0.02662859 0.1219669
Residual Deviance: 410.461
AIC: 416.461
If we now estimate it with glm()
and a binomial link function...
mod2_log <- glm(y_log ~ x1b + x2b,
data = data_b,
family = binomial(link = "logit"))
summary(mod2_log)
... the results are identical:
> summary(mod2_log)
Call:
glm(formula = y_log ~ x1b + x2b, family = binomial(link = "logit"),
data = data_b)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.41358 -1.16156 -0.02177 1.17331 1.45432
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05458 0.27147 0.201 0.8407
x1b 0.28131 0.12353 2.277 0.0228 *
x2b -0.02663 0.12197 -0.218 0.8272
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 415.89 on 299 degrees of freedom
Residual deviance: 410.46 on 297 degrees of freedom
AIC: 416.46
Number of Fisher Scoring iterations: 4
Ahh, of course, that makes sense! Thanks so much for the detailed response, and the package suggestion. Combining these two approaches for 2-class and 3-class or higher latent class models should yield all the covariate data I need!