Wrong header when using identity-link with GLMs

Question

Wrong header when using identity-link with GLMs

snhansen opened this issue a year ago · 11 comments

When using the identity-link with glm the header says "Log-Risk" but should just be "Risk".

glm(am ~ vs, data = mtcars, family = binomial(link = "identity")) |>
  parameters()

# Parameter   | Log-Risk |   SE |        95% CI |    z |     p
# ------------------------------------------------------------
# (Intercept) |     0.33 | 0.11 | [ 0.15, 0.56] | 3.00 | 0.003
# vs          |     0.17 | 0.17 | [-0.17, 0.49] | 0.96 | 0.338

Answer 1 · 2023-09-10T19:24:35.000Z

I'm not sure, is "Risk" or "Probability" the better term here? See chapter 11.2.3.4.1 in https://bookdown.org/marklhc/notes_bookdown/generalized-linear-models.html.

Answer 2 · 2023-09-10T19:32:20.000Z

I'm not sure, is "Risk" or "Probability" the better term here?

Probability would definitely apply to all situations whereas "risk" would only apply to something we'd consider adverse. But you'd have to rename "Log-Risk" as well and any other instances of risk to be consistent then.

Answer 3 · 2023-09-10T19:43:47.000Z

binomial(link = "log") returns Log-Risk resp. Risk Ratio when exponentiate = TRUE, because using a log-link indeed returns risk ratios (and not probabilities, as binomial(link = "identity") does). BTW, I didn't know of that option before, it's also not documented in ?family.

Answer 4 · 2023-09-10T19:55:44.000Z

All GLMs of the binomial family model probabilities. So binomial(link = "identity") returns a (baseline) probability for the intercept and differences in probabilities for all other parameters. We can call these risks and risk differences if we'd like. In the same spirit, binomial(link = "log") returns a (baseline) log-probability for the intercept and differences in log-probabilities for all other parameters. We can call these log-probabilities for log-risks if we'd like. If we exponentiate the coefficients, we get a probability/risk for the intercept and a probability ratio/risk ratio for the other parameters.

So if you want consistency you'd use probability everywhere or risk everywhere.

The identity link is the default for the Gaussian family which models mean differences. It's documented here: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/family

Answer 5 · 2023-09-10T21:08:12.000Z

The identity link is the default for the Gaussian family

Yes, but it's not documented for the binomial family:

the binomial family the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);

All GLMs of the binomial family model probabilities.

Yes, and the predictions on the response-scale are probabilities, but the coefficients are not. binomial(link = "identity") seems to return differences-in-probabilities, while binomial(link = "log") returns log-risks (or risk ratios, when exponentiated). You see this from the coefficients. In m1, the coefficients sum up to the predicted probabilities, in m2, -1.6275 is no difference in probability, because it's (in asbolute terms) larger than 1 (100%).

m1 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "identity"))
m2 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "log"))

summary(m1)
#> 
#> Call:
#> glm(formula = am ~ as.factor(cyl), family = binomial(link = "identity"), 
#>     data = mtcars)
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)       0.7273     0.1343   5.416 6.09e-08 ***
#> as.factor(cyl)6  -0.2987     0.2303  -1.297 0.194539    
#> as.factor(cyl)8  -0.5844     0.1636  -3.571 0.000355 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 43.230  on 31  degrees of freedom
#> Residual deviance: 33.935  on 29  degrees of freedom
#> AIC: 39.935
#> 
#> Number of Fisher Scoring iterations: 2
summary(m2)
#> 
#> Call:
#> glm(formula = am ~ as.factor(cyl), family = binomial(link = "log"), 
#>     data = mtcars)
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)      -0.3185     0.1846  -1.725   0.0846 .
#> as.factor(cyl)6  -0.5288     0.4739  -1.116   0.2644  
#> as.factor(cyl)8  -1.6275     0.6802  -2.393   0.0167 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 43.230  on 31  degrees of freedom
#> Residual deviance: 33.935  on 29  degrees of freedom
#> AIC: 39.935
#> 
#> Number of Fisher Scoring iterations: 6

d <- ggeffects::data_grid(m1, "cyl")

predict(m1, newdata = d, type = "response")
#>         1         2         3 
#> 0.7272727 0.4285714 0.1428571
predict(m2, newdata = d, type = "response")
#>         1         2         3 
#> 0.7272727 0.4285714 0.1428571

^{Created on 2023-09-10 with reprex v2.0.2}

Answer 6 · 2023-09-11T05:51:59.000Z

but the coefficients are not. binomial(link = "identity") seems to return differences-in-probabilities, while binomial(link = "log") returns log-risks (or risk ratios, when exponentiated).

binomial(link = "log") returns differences-in-log-risks or equivalently log to the risk-ratio (except for the intercept which is a log-risk). So if you were to be consistent with the current naming scheme, it should be

# parameters(m1): Risk difference
# parameters(m2): Log-risk difference / Log risk ratio
# parameters(m2, exponentiate = TRUE): Risk ratio

with log-risk difference meaning a difference in log-risks, and not the log of a risk difference.

And very odd indeed that the documentation doesn't mention the identity-link for the binomial family!

Answer 7 · 2023-09-11T05:55:59.000Z

Sounds reasonable. I'll go with Risk then

Answer 8 · 2023-09-11T06:06:46.000Z

One problem, in general, seems to me that SEs are also on a probability-scale, thus the CI are not accurate. Would you suggest back-transforming the SE onto the link-scale, calculate CI and then back to response scale?

Answer 9 · 2023-09-11T06:12:49.000Z

Yes, CIs should be calculated on the link-scale and transformed to the response scale. Whether to report the SE on the link-scale or response-scale, I'm not sure about. I know that Stata's lincom function would report the SE on the response scale although tests and confidence intervals are based on the SE on the link-scale. I think this would probably be the most reasonable to do although I don't see many use cases for the SE on the response-scale though.

Answer 10 · 2023-09-11T06:13:01.000Z

The "linear"-alike models have negative CIs for the predicted probabilities, but maybe that's an inaccuracy that can't be handled inside the parameters package.

m1 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "identity"))
m2 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "log"))
m3 <- lm(am ~ as.factor(cyl), data = mtcars)

parameters::model_parameters(m1)
#> Parameter   | Log-Risk |   SE |         95% CI |     z |      p
#> ---------------------------------------------------------------
#> (Intercept) |     0.73 | 0.13 | [ 0.46,  0.99] |  5.42 | < .001
#> cyl [6]     |    -0.30 | 0.23 | [-0.75,  0.15] | -1.30 | 0.195 
#> cyl [8]     |    -0.58 | 0.16 | [-0.91, -0.26] | -3.57 | < .001
#> 
#> Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
#>   computed using a Wald z-distribution approximation.

ggeffects::ggpredict(m1, "cyl")
#> # Predicted probabilities of am
#> 
#> cyl | Predicted |        95% CI
#> -------------------------------
#>   4 |      0.73 | [ 0.46, 0.99]
#>   6 |      0.43 | [ 0.06, 0.80]
#>   8 |      0.14 | [-0.04, 0.33]

parameters::model_parameters(m2)
#> Parameter   | Log-Risk |   SE |         95% CI |     z |     p
#> --------------------------------------------------------------
#> (Intercept) |    -0.32 | 0.18 | [-0.83, -0.08] | -1.72 | 0.085
#> cyl [6]     |    -0.53 | 0.47 | [-1.77,  0.27] | -1.12 | 0.264
#> cyl [8]     |    -1.63 | 0.68 | [-3.38, -0.53] | -2.39 | 0.017
#> 
#> Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
#>   computed using a Wald z-distribution approximation.
#> 
#> The model has a log- or logit-link. Consider using `exponentiate =
#>   TRUE` to interpret coefficients as ratios.

ggeffects::ggpredict(m2, "cyl")
#> # Predicted probabilities of am
#> 
#> cyl | Predicted |       95% CI
#> ------------------------------
#>   4 |      0.73 | [0.51, 1.04]
#>   6 |      0.43 | [0.18, 1.01]
#>   8 |      0.14 | [0.04, 0.52]

parameters::model_parameters(m3)
#> Parameter   | Coefficient |   SE |         95% CI | t(29) |      p
#> ------------------------------------------------------------------
#> (Intercept) |        0.73 | 0.13 | [ 0.46,  1.00] |  5.48 | < .001
#> cyl [6]     |       -0.30 | 0.21 | [-0.73,  0.14] | -1.40 | 0.171 
#> cyl [8]     |       -0.58 | 0.18 | [-0.95, -0.22] | -3.30 | 0.003
#> 
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#>   using a Wald t-distribution approximation.

ggeffects::ggpredict(m3, "cyl")
#> # Predicted values of am
#> 
#> cyl | Predicted |        95% CI
#> -------------------------------
#>   4 |      0.73 | [ 0.46, 1.00]
#>   6 |      0.43 | [ 0.09, 0.77]
#>   8 |      0.14 | [-0.10, 0.38]

^{Created on 2023-09-11 with reprex v2.0.2}

Answer 11 · 2023-09-11T06:18:09.000Z

Yeah, that's because the confidence intervals are based on asymptotic theory, so small datasets can result in this behavior. This is the same issue with predicted probabilities being above 1 for family = binomial(link = "log"). This is not something parameters should deal with in any way, imo.