easystats/parameters

Wrong header when using identity-link with GLMs

snhansen opened this issue · 11 comments

When using the identity-link with glm the header says "Log-Risk" but should just be "Risk".

glm(am ~ vs, data = mtcars, family = binomial(link = "identity")) |>
  parameters()

# Parameter   | Log-Risk |   SE |        95% CI |    z |     p
# ------------------------------------------------------------
# (Intercept) |     0.33 | 0.11 | [ 0.15, 0.56] | 3.00 | 0.003
# vs          |     0.17 | 0.17 | [-0.17, 0.49] | 0.96 | 0.338

I'm not sure, is "Risk" or "Probability" the better term here? See chapter 11.2.3.4.1 in https://bookdown.org/marklhc/notes_bookdown/generalized-linear-models.html.

I'm not sure, is "Risk" or "Probability" the better term here?

Probability would definitely apply to all situations whereas "risk" would only apply to something we'd consider adverse. But you'd have to rename "Log-Risk" as well and any other instances of risk to be consistent then.

binomial(link = "log") returns Log-Risk resp. Risk Ratio when exponentiate = TRUE, because using a log-link indeed returns risk ratios (and not probabilities, as binomial(link = "identity") does). BTW, I didn't know of that option before, it's also not documented in ?family.

All GLMs of the binomial family model probabilities. So binomial(link = "identity") returns a (baseline) probability for the intercept and differences in probabilities for all other parameters. We can call these risks and risk differences if we'd like. In the same spirit, binomial(link = "log") returns a (baseline) log-probability for the intercept and differences in log-probabilities for all other parameters. We can call these log-probabilities for log-risks if we'd like. If we exponentiate the coefficients, we get a probability/risk for the intercept and a probability ratio/risk ratio for the other parameters.

So if you want consistency you'd use probability everywhere or risk everywhere.

The identity link is the default for the Gaussian family which models mean differences. It's documented here: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/family

The identity link is the default for the Gaussian family

Yes, but it's not documented for the binomial family:

the binomial family the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);

All GLMs of the binomial family model probabilities.

Yes, and the predictions on the response-scale are probabilities, but the coefficients are not. binomial(link = "identity") seems to return differences-in-probabilities, while binomial(link = "log") returns log-risks (or risk ratios, when exponentiated). You see this from the coefficients. In m1, the coefficients sum up to the predicted probabilities, in m2, -1.6275 is no difference in probability, because it's (in asbolute terms) larger than 1 (100%).

m1 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "identity"))
m2 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "log"))

summary(m1)
#> 
#> Call:
#> glm(formula = am ~ as.factor(cyl), family = binomial(link = "identity"), 
#>     data = mtcars)
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)       0.7273     0.1343   5.416 6.09e-08 ***
#> as.factor(cyl)6  -0.2987     0.2303  -1.297 0.194539    
#> as.factor(cyl)8  -0.5844     0.1636  -3.571 0.000355 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 43.230  on 31  degrees of freedom
#> Residual deviance: 33.935  on 29  degrees of freedom
#> AIC: 39.935
#> 
#> Number of Fisher Scoring iterations: 2
summary(m2)
#> 
#> Call:
#> glm(formula = am ~ as.factor(cyl), family = binomial(link = "log"), 
#>     data = mtcars)
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)      -0.3185     0.1846  -1.725   0.0846 .
#> as.factor(cyl)6  -0.5288     0.4739  -1.116   0.2644  
#> as.factor(cyl)8  -1.6275     0.6802  -2.393   0.0167 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 43.230  on 31  degrees of freedom
#> Residual deviance: 33.935  on 29  degrees of freedom
#> AIC: 39.935
#> 
#> Number of Fisher Scoring iterations: 6

d <- ggeffects::data_grid(m1, "cyl")

predict(m1, newdata = d, type = "response")
#>         1         2         3 
#> 0.7272727 0.4285714 0.1428571
predict(m2, newdata = d, type = "response")
#>         1         2         3 
#> 0.7272727 0.4285714 0.1428571

Created on 2023-09-10 with reprex v2.0.2

but the coefficients are not. binomial(link = "identity") seems to return differences-in-probabilities, while binomial(link = "log") returns log-risks (or risk ratios, when exponentiated).

binomial(link = "log") returns differences-in-log-risks or equivalently log to the risk-ratio (except for the intercept which is a log-risk). So if you were to be consistent with the current naming scheme, it should be

# parameters(m1): Risk difference
# parameters(m2): Log-risk difference / Log risk ratio
# parameters(m2, exponentiate = TRUE): Risk ratio

with log-risk difference meaning a difference in log-risks, and not the log of a risk difference.

And very odd indeed that the documentation doesn't mention the identity-link for the binomial family!

Sounds reasonable. I'll go with Risk then

One problem, in general, seems to me that SEs are also on a probability-scale, thus the CI are not accurate. Would you suggest back-transforming the SE onto the link-scale, calculate CI and then back to response scale?

Yes, CIs should be calculated on the link-scale and transformed to the response scale. Whether to report the SE on the link-scale or response-scale, I'm not sure about. I know that Stata's lincom function would report the SE on the response scale although tests and confidence intervals are based on the SE on the link-scale. I think this would probably be the most reasonable to do although I don't see many use cases for the SE on the response-scale though.

The "linear"-alike models have negative CIs for the predicted probabilities, but maybe that's an inaccuracy that can't be handled inside the parameters package.

m1 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "identity"))
m2 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "log"))
m3 <- lm(am ~ as.factor(cyl), data = mtcars)

parameters::model_parameters(m1)
#> Parameter   | Log-Risk |   SE |         95% CI |     z |      p
#> ---------------------------------------------------------------
#> (Intercept) |     0.73 | 0.13 | [ 0.46,  0.99] |  5.42 | < .001
#> cyl [6]     |    -0.30 | 0.23 | [-0.75,  0.15] | -1.30 | 0.195 
#> cyl [8]     |    -0.58 | 0.16 | [-0.91, -0.26] | -3.57 | < .001
#> 
#> Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
#>   computed using a Wald z-distribution approximation.

ggeffects::ggpredict(m1, "cyl")
#> # Predicted probabilities of am
#> 
#> cyl | Predicted |        95% CI
#> -------------------------------
#>   4 |      0.73 | [ 0.46, 0.99]
#>   6 |      0.43 | [ 0.06, 0.80]
#>   8 |      0.14 | [-0.04, 0.33]

parameters::model_parameters(m2)
#> Parameter   | Log-Risk |   SE |         95% CI |     z |     p
#> --------------------------------------------------------------
#> (Intercept) |    -0.32 | 0.18 | [-0.83, -0.08] | -1.72 | 0.085
#> cyl [6]     |    -0.53 | 0.47 | [-1.77,  0.27] | -1.12 | 0.264
#> cyl [8]     |    -1.63 | 0.68 | [-3.38, -0.53] | -2.39 | 0.017
#> 
#> Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
#>   computed using a Wald z-distribution approximation.
#> 
#> The model has a log- or logit-link. Consider using `exponentiate =
#>   TRUE` to interpret coefficients as ratios.

ggeffects::ggpredict(m2, "cyl")
#> # Predicted probabilities of am
#> 
#> cyl | Predicted |       95% CI
#> ------------------------------
#>   4 |      0.73 | [0.51, 1.04]
#>   6 |      0.43 | [0.18, 1.01]
#>   8 |      0.14 | [0.04, 0.52]

parameters::model_parameters(m3)
#> Parameter   | Coefficient |   SE |         95% CI | t(29) |      p
#> ------------------------------------------------------------------
#> (Intercept) |        0.73 | 0.13 | [ 0.46,  1.00] |  5.48 | < .001
#> cyl [6]     |       -0.30 | 0.21 | [-0.73,  0.14] | -1.40 | 0.171 
#> cyl [8]     |       -0.58 | 0.18 | [-0.95, -0.22] | -3.30 | 0.003
#> 
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#>   using a Wald t-distribution approximation.

ggeffects::ggpredict(m3, "cyl")
#> # Predicted values of am
#> 
#> cyl | Predicted |        95% CI
#> -------------------------------
#>   4 |      0.73 | [ 0.46, 1.00]
#>   6 |      0.43 | [ 0.09, 0.77]
#>   8 |      0.14 | [-0.10, 0.38]

Created on 2023-09-11 with reprex v2.0.2

Yeah, that's because the confidence intervals are based on asymptotic theory, so small datasets can result in this behavior. This is the same issue with predicted probabilities being above 1 for family = binomial(link = "log"). This is not something parameters should deal with in any way, imo.