/regressionmodels

Overview of modelling strategies and packages. Which model do I need for my data?

The UnlicenseUnlicense

Overview of R Modelling Packages

This is an overview of R packages and functions for fitting different types of regression models. For each row, the upper cells in the last column (packages and functions) refer to “simple” models, while the lower cells refer to their mixed models counterpart (if available and known).

This overview raises no claims towards completeness of available modelling packages. Rather, it shows commonly or more often used packages, but there a plenty of other packages as well (that might even perform better in doing those mentioned tasks - if you’re aware of such packages or think that an important package or function is missing, please file an issue).

Modelling Packages

Nature of Response Example Type of Regression R package or function Example Webpage Bayesian with brms
Continuous Quality of Life, linear scales linear lm() brm(family = gaussian())
- lmer()
- glmmTMB()
Binary Success yes/no binary logistic glm(family=binomial) UCLA brm(family = binomial())
- glmer(*)
- glmmTMB(*)
Binary, weighted Success yes/no, with weights quasi-binary logistic glm(family=quasibinomial)
glmmPQL(family="quasibinomial")
Trials (or proportions of counts) 20 successes out of 30 trials logistic glm(cbind(successes, failures), family=binomial) Hadley’s notes brm(successes | trials(total), family = binomial())
- glmer(*)
- glmmTMB(*)
Count data Number of usage, counts of events Poisson glm(family=poisson) UCLA brm(family = poisson())
- glmer(*)
- glmmTMB(*)
Count data, with excess zeros or overdispersion Number of usage, counts of events (with higher variance than mean of response) negative binomial glm.nb() UCLA brm(family = negbinomial())
- glmer.nb()
- glmmTMB(family=nbinom)
Count data with very many zeros (inflation) see count data, but response is modelled as mixture of Bernoulli & Poisson distribution (two sources of zeros) zero-inflated zeroinfl() UCLA brm(family = zero_inflated_poisson())
glmmTMB(ziformula, family=poisson)
Count data, with very many zeros (inflation) and overdispersion Number of usage, counts of events (with higher variance than mean of response) zero-inflated negative binomial zeroinfl(dist="negbin") UCLA brm(family = zero_inflated_negbinomial())
glmmTMB(ziformula, family=nbinom)
Count data, zero-truncated see count data, but only for positive counts (hurdle component models zero-counts) hurdle (Poisson) hurdle() UCLA brm(family = hurdle_poisson())
glmmTMB(family=truncated_poisson)
Count data, zero-truncated and overdispersion see “Count data, zero-truncated”, but with higher variance than mean of response hurdle (neg. binomial) vglm(family=posnegbinomial) UCLA brm(family = hurdle_negbinomial())
glmmTMB(family=truncated_nbinom)
Proportion / Ratio (without zero and one) Percentages, proportion of continuous data Beta (see note below) betareg() ouR data generation brm(family = Beta())
glmmTMB(family=beta_family)
Proportion / Ratio (including zero and one) Percentages, proportions of continuous data Beta-Binomial, zero-inflated Beta, ordered Beta (see note below) - BBreg()
- betabin()
- vglm(family=betabinomial)
- ordbetareg()
ouR data generation brm(family = zero_one_inflated_beta())
- glmmTMB(ziformula, family=beta_family)
- glmmTMB(ziformula, family= betabinomial)
- glmmTMB(ziformula, family= ordbeta)
- ordbetareg()
Ordinal Likert scale, worse/ok/better ordinal, proportional odds, cumulative - polr()
- clm()
- bracl()
UCLA brm(family = cumulative())
- clmm()
- mixor()
- MCMCglmm(family = "ordinal")
Multinomial No natural order of categories, like red/green/blue multinomial - multinom()
- brmultinom()
UCLA brm(family = multinomial())
MCMCglmm(family = "multinomial")
Continuous, right-skewed Financial data, reaction times Gamma glm(family=Gamma) Sean Anderson brm(family = Gamma()), but see also Reaction time distributions in brms
- glmer(*)
- glmmTMB(*)
(Semi-)Continuous, (right) skewed, probably with spike at zero (zero-inlfated) Financial data, probably exponential dispersion of variance Tweedie - glm(family=tweedie)
- cpglm()
Revolutions
- cpglmm()
- glmmTMB(*)
(Semi-)Continuous, (right) skewed, probably with spike at zero (zero-inlfated) Normal distribution, but negative values are censored and stacked on zero Tobit - tobit()
- censReg()
brm(y | cens(), family = gaussian())
semLme()
Continuous, but truncated or outliers truncated - censReg()
- tobit()
- vglm(family=tobit)
UCLA-1, UCLA-2 brm(y | trunc(), family = gaussian())
Continuous, but exponential growth log-transformed, non-linear - glm(family=Gaussian("log")
- nls()
Some useful equations, linear vs. non-linear regression
- glmmTMB(*)
- nlmer()
- nlme()
Proportion / Ratio with more than 2 categories Biomass partitioning in plants (ratio of leaf, stem and root mass) Dirichlet DirichReg() brm(family = dirichlet())
Time-to-Event Survival-analysis, time until event/death occurs Cox (proportional hazards) coxph UCLA brm(family = cox())
coxme()
  • * indicates that for the mixed models functions the same response-type and family should be used as for their glm counterpart.

  • Note that ratios or proportions from count data, like cbind(successes, failures), are modelled as logistic regression with glm(cbind(successes, failures), family=binomial()), while ratios from continuous data (where the response ranges from zero to one) are modelled using beta-regression.

  • Usually, zero-inflated models are used when 0 or 1 come from a separate process or category. However, when the 0/1 values are most consistent with censoring rather than with a separate category/process, the ordered beta regression is probably a better choice (i.e., 0 mean “below detection”, not “something qualitatively different happened”) (Source: https://twitter.com/bolkerb/status/1577755600808775680)

Included packages for non-mixed models:

Included packages for mixed models:

Included packages for Bayesian models (mixed an non-mixed):

Handout

There is a handout in PDF-format.