ManuelNeumann/MNLpred

Two-outcome dependent variables

Closed this issue · 2 comments

Hi Manuel,

Thanks for putting this package together! I had a question about an error I received when using the mnl_pred_ova function: "Please supply a dataset with a dependent variable that has a sufficient number of outcomes (> 2)."

I'm using this package to work with covariate predictors of class membership in latent class analysis, so the dependent variable in the multinomial logistic regression is the predicted latent class for each observation. For my purposes, the number of classes is generally low (between 2-4), so there are relatively frequent cases where the dependent variable has only two outcomes.

I've been able to get mnl_pred_ova to work nicely for 3-class models, but I get the above error with a 2-class model, which makes sense, since there are indeed only two outcomes for the DV. My question is, is there a way for me to use this package's methods with a two-outcome dependent variable, or is that a structural impossibility? I'm a programmer by trade, so I don't have the strongest grasp of the underlying statistical methods – I apologize if this is perhaps a stupid question!

Best,
Alex

Hi Alex,
no worries: not a stupid question :)

In short: if there are just two outcomes in the dependent variable, it is a simple logistic regression model. You can use glm() with a binomial link function and other visualization packages (for example with sjPlot).

A multinomial-logit estimates the odds to choose one of a set of options over a reference category. If there is only one other option to choose from, the "multinomial" part is not necessary anymore. Therefore, a model is sufficient that can handle to estimate the choice of one option over the other. This is what any binomial model, such as a logit regression, does.

(Tbh, I didn't know that multinom() allows less than three outcomes.)

But here is a short MWE to demonstrate what I mean. This is in essence what a multinomial model does:

library(nnet)

ya <- rep(c("a", "b", "c"), each = 100)
x1a <- rnorm(300)
x2a <- rnorm(300, mean = 2)

data_a <- data.frame(ya, x1a, x2a)

mod1 <- multinom(yb ~ x1a + x2a,
                 data = data_a)

summary(mod1)

This yields:

> summary(mod1)
Call:
multinom(formula = yb ~ x1a + x2a, data = data_a)

Coefficients:
                   Values Std. Err.
(Intercept) -4.972273e-02 0.2759537
x1a         -4.010516e-07 0.1099197
x2a          2.453166e-02 0.1236715

Residual Deviance: 415.8488 
AIC: 421.8488 

Now, let's try it with two outcomes:

yb <- rep(c("a", "b"), each = 150) # Once as a string
y_log <- rep(c(0, 1), each = 150) # Once as a numeric for the logit-model
x1b <- rnorm(300)
x2b <- rnorm(300, mean = 2)

data_b <- data.frame(yb, y_log, x1b, x2b)


mod2 <- multinom(yb ~ x1b + x2b,
                 data_b)

summary(mod2)

This yields the following results:

> summary(mod2)
Call:
multinom(formula = yb ~ x1b + x2b, data = data_b)

Coefficients:
                 Values Std. Err.
(Intercept)  0.05457562 0.2714678
x1b          0.28130618 0.1235326
x2b         -0.02662859 0.1219669

Residual Deviance: 410.461 
AIC: 416.461 

If we now estimate it with glm() and a binomial link function...

mod2_log <- glm(y_log ~ x1b + x2b,
                data = data_b,
                family = binomial(link = "logit"))

summary(mod2_log)

... the results are identical:

> summary(mod2_log)

Call:
glm(formula = y_log ~ x1b + x2b, family = binomial(link = "logit"), 
    data = data_b)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.41358  -1.16156  -0.02177   1.17331   1.45432  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.05458    0.27147   0.201   0.8407  
x1b          0.28131    0.12353   2.277   0.0228 *
x2b         -0.02663    0.12197  -0.218   0.8272  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 415.89  on 299  degrees of freedom
Residual deviance: 410.46  on 297  degrees of freedom
AIC: 416.46

Number of Fisher Scoring iterations: 4

Ahh, of course, that makes sense! Thanks so much for the detailed response, and the package suggestion. Combining these two approaches for 2-class and 3-class or higher latent class models should yield all the covariate data I need!