Wrong header when using identity-link with GLMs
snhansen opened this issue · 11 comments
When using the identity-link with glm
the header says "Log-Risk" but should just be "Risk".
glm(am ~ vs, data = mtcars, family = binomial(link = "identity")) |>
parameters()
# Parameter | Log-Risk | SE | 95% CI | z | p
# ------------------------------------------------------------
# (Intercept) | 0.33 | 0.11 | [ 0.15, 0.56] | 3.00 | 0.003
# vs | 0.17 | 0.17 | [-0.17, 0.49] | 0.96 | 0.338
I'm not sure, is "Risk" or "Probability" the better term here? See chapter 11.2.3.4.1 in https://bookdown.org/marklhc/notes_bookdown/generalized-linear-models.html.
I'm not sure, is "Risk" or "Probability" the better term here?
Probability would definitely apply to all situations whereas "risk" would only apply to something we'd consider adverse. But you'd have to rename "Log-Risk" as well and any other instances of risk to be consistent then.
binomial(link = "log")
returns Log-Risk resp. Risk Ratio when exponentiate = TRUE
, because using a log-link indeed returns risk ratios (and not probabilities, as binomial(link = "identity")
does). BTW, I didn't know of that option before, it's also not documented in ?family
.
All GLMs of the binomial family model probabilities. So binomial(link = "identity")
returns a (baseline) probability for the intercept and differences in probabilities for all other parameters. We can call these risks and risk differences if we'd like. In the same spirit, binomial(link = "log")
returns a (baseline) log-probability for the intercept and differences in log-probabilities for all other parameters. We can call these log-probabilities for log-risks if we'd like. If we exponentiate the coefficients, we get a probability/risk for the intercept and a probability ratio/risk ratio for the other parameters.
So if you want consistency you'd use probability everywhere or risk everywhere.
The identity link is the default for the Gaussian family which models mean differences. It's documented here: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/family
The identity link is the default for the Gaussian family
Yes, but it's not documented for the binomial family:
the binomial family the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);
All GLMs of the binomial family model probabilities.
Yes, and the predictions on the response-scale are probabilities, but the coefficients are not. binomial(link = "identity")
seems to return differences-in-probabilities, while binomial(link = "log")
returns log-risks (or risk ratios, when exponentiated). You see this from the coefficients. In m1
, the coefficients sum up to the predicted probabilities, in m2
, -1.6275
is no difference in probability, because it's (in asbolute terms) larger than 1 (100%).
m1 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "identity"))
m2 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "log"))
summary(m1)
#>
#> Call:
#> glm(formula = am ~ as.factor(cyl), family = binomial(link = "identity"),
#> data = mtcars)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.7273 0.1343 5.416 6.09e-08 ***
#> as.factor(cyl)6 -0.2987 0.2303 -1.297 0.194539
#> as.factor(cyl)8 -0.5844 0.1636 -3.571 0.000355 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 43.230 on 31 degrees of freedom
#> Residual deviance: 33.935 on 29 degrees of freedom
#> AIC: 39.935
#>
#> Number of Fisher Scoring iterations: 2
summary(m2)
#>
#> Call:
#> glm(formula = am ~ as.factor(cyl), family = binomial(link = "log"),
#> data = mtcars)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.3185 0.1846 -1.725 0.0846 .
#> as.factor(cyl)6 -0.5288 0.4739 -1.116 0.2644
#> as.factor(cyl)8 -1.6275 0.6802 -2.393 0.0167 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 43.230 on 31 degrees of freedom
#> Residual deviance: 33.935 on 29 degrees of freedom
#> AIC: 39.935
#>
#> Number of Fisher Scoring iterations: 6
d <- ggeffects::data_grid(m1, "cyl")
predict(m1, newdata = d, type = "response")
#> 1 2 3
#> 0.7272727 0.4285714 0.1428571
predict(m2, newdata = d, type = "response")
#> 1 2 3
#> 0.7272727 0.4285714 0.1428571
Created on 2023-09-10 with reprex v2.0.2
but the coefficients are not. binomial(link = "identity") seems to return differences-in-probabilities, while binomial(link = "log") returns log-risks (or risk ratios, when exponentiated).
binomial(link = "log")
returns differences-in-log-risks or equivalently log to the risk-ratio (except for the intercept which is a log-risk). So if you were to be consistent with the current naming scheme, it should be
# parameters(m1): Risk difference
# parameters(m2): Log-risk difference / Log risk ratio
# parameters(m2, exponentiate = TRUE): Risk ratio
with log-risk difference meaning a difference in log-risks, and not the log of a risk difference.
And very odd indeed that the documentation doesn't mention the identity-link for the binomial family!
Sounds reasonable. I'll go with Risk then
One problem, in general, seems to me that SEs are also on a probability-scale, thus the CI are not accurate. Would you suggest back-transforming the SE onto the link-scale, calculate CI and then back to response scale?
Yes, CIs should be calculated on the link-scale and transformed to the response scale. Whether to report the SE on the link-scale or response-scale, I'm not sure about. I know that Stata's lincom
function would report the SE on the response scale although tests and confidence intervals are based on the SE on the link-scale. I think this would probably be the most reasonable to do although I don't see many use cases for the SE on the response-scale though.
The "linear"-alike models have negative CIs for the predicted probabilities, but maybe that's an inaccuracy that can't be handled inside the parameters package.
m1 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "identity"))
m2 <- glm(am ~ as.factor(cyl), data = mtcars, family = binomial(link = "log"))
m3 <- lm(am ~ as.factor(cyl), data = mtcars)
parameters::model_parameters(m1)
#> Parameter | Log-Risk | SE | 95% CI | z | p
#> ---------------------------------------------------------------
#> (Intercept) | 0.73 | 0.13 | [ 0.46, 0.99] | 5.42 | < .001
#> cyl [6] | -0.30 | 0.23 | [-0.75, 0.15] | -1.30 | 0.195
#> cyl [8] | -0.58 | 0.16 | [-0.91, -0.26] | -3.57 | < .001
#>
#> Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
#> computed using a Wald z-distribution approximation.
ggeffects::ggpredict(m1, "cyl")
#> # Predicted probabilities of am
#>
#> cyl | Predicted | 95% CI
#> -------------------------------
#> 4 | 0.73 | [ 0.46, 0.99]
#> 6 | 0.43 | [ 0.06, 0.80]
#> 8 | 0.14 | [-0.04, 0.33]
parameters::model_parameters(m2)
#> Parameter | Log-Risk | SE | 95% CI | z | p
#> --------------------------------------------------------------
#> (Intercept) | -0.32 | 0.18 | [-0.83, -0.08] | -1.72 | 0.085
#> cyl [6] | -0.53 | 0.47 | [-1.77, 0.27] | -1.12 | 0.264
#> cyl [8] | -1.63 | 0.68 | [-3.38, -0.53] | -2.39 | 0.017
#>
#> Uncertainty intervals (profile-likelihood) and p-values (two-tailed)
#> computed using a Wald z-distribution approximation.
#>
#> The model has a log- or logit-link. Consider using `exponentiate =
#> TRUE` to interpret coefficients as ratios.
ggeffects::ggpredict(m2, "cyl")
#> # Predicted probabilities of am
#>
#> cyl | Predicted | 95% CI
#> ------------------------------
#> 4 | 0.73 | [0.51, 1.04]
#> 6 | 0.43 | [0.18, 1.01]
#> 8 | 0.14 | [0.04, 0.52]
parameters::model_parameters(m3)
#> Parameter | Coefficient | SE | 95% CI | t(29) | p
#> ------------------------------------------------------------------
#> (Intercept) | 0.73 | 0.13 | [ 0.46, 1.00] | 5.48 | < .001
#> cyl [6] | -0.30 | 0.21 | [-0.73, 0.14] | -1.40 | 0.171
#> cyl [8] | -0.58 | 0.18 | [-0.95, -0.22] | -3.30 | 0.003
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald t-distribution approximation.
ggeffects::ggpredict(m3, "cyl")
#> # Predicted values of am
#>
#> cyl | Predicted | 95% CI
#> -------------------------------
#> 4 | 0.73 | [ 0.46, 1.00]
#> 6 | 0.43 | [ 0.09, 0.77]
#> 8 | 0.14 | [-0.10, 0.38]
Created on 2023-09-11 with reprex v2.0.2
Yeah, that's because the confidence intervals are based on asymptotic theory, so small datasets can result in this behavior. This is the same issue with predicted probabilities being above 1 for family = binomial(link = "log")
. This is not something parameters should deal with in any way, imo.