Comparison of Mixed Models Packages
billdenney opened this issue ยท 40 comments
Thank you so much for all your work on mixed models throughout the R ecosystem!
I found these pages as a great summary of how mixed models are available within R. One thing that I have not found is a resource comparing the features of the various R packages for fitting mixed models. I wonder if I'm missing it somewhere that you're aware of or if it may be a good fit here?
As a specific example, I need to fit a linear mixed model with different factor levels and heteroskedastic residual error by crossing the factor levels. (To ensure that I'm being clear here, I have treatment effects for treatments A and B measured daily on days 1 to 5 and treatment effects for treatments C and D measured daily on days 1 to 7; days are treated categorically; and I need to estimate unstructured covariance with treatment by day and a random effect on the intercept.)
I know that I can do unstructured residual error with nlme::lme()
and that I can have the treament*day
interaction for random effects with lme4::lmer()
, but I don't see an easy way to do both. Looking at various resources that you have helpfully compiled over time, I see that I can beat lme4::lmer()
into submission (๐) for the unstructured residual covariance matrix (though it's not immediately clear how to use the result) that you wrote at https://bbolker.github.io/mixedmodels-misc/notes/varmats.html.
So, the question/ask is: Do you know of a table that identifies the features of mixed effects modeling packages for R?
No, although I've had a student work on something like this in the past (having trouble locating it at the moment). Maybe I should set up a google sheet so this can be a living/crowdsourced document? There is a comparison table in the glmmTMB
JSS paper, but it focuses on zero-inflation.
It would be very helpful for me, and I assume others, too. Croudsourcing it seems like a good solution, if it doesn't already exist.
I would contribute! And I bet the #rstats twitterverse would be all over this.
OK, next question. (1) format? (github + pull requests would be fun, but a google sheet would be much more accessible and probably easier to edit - GH diffs on a CSV file might not be all that useful ...) (2) Once that's settled, I might put up a preliminary list of categories for review. There's a very out-of-date table from Bolker et al. 2009 MEE and a zero-inflation-focused one in Brooks et al 2017:
The possibilities are endless: LMM vs GLMM, REML or not, approx integration (PQL/Laplace/AGQ/Monte Carlo), deterministic or stochastic, Bayesian vs frequentist, choice of response distributions, choice of link functions, zero (and zero-one) inflation, covariance structures, crossed REs, dispersion models, multi-membership, ... (maybe we should just look at brms
and list all of the things it can do ...)
By the way @billdenney , the dispersion-modeling feature of glmmTMB
(dispformula=...
) might do what you want. It's less flexible than nlme
's heteroscedasticity-handling with the weights
argument/varFunc
classes (which allow you to fit models for dispersion as functions of the predicted mean), but can handle any linear model with continuous or categorical predictors.
I do think of brms
as a pretty inclusive list. My thought is to start simple-ish with the list we can readily come up with (and, I'm being generous to include myself in "we"). And then, to encourage others to add to the list where a new category makes sense adding all they know across all of the packages. And, I would assume that reaching out to the package authors would also be productive.
But, having been party to similar efforts in the past, I think that the most important parts are to make it simple and low-effort to contribute. So, I think that pull requests will likely get complicated quickly because the contributions won't be rows but will be spread throughout.
I like the Google sheets idea (though I've not administered one with open contributions before).
Heck, this is an amazing start! Google doc-ing this and letting people add more packages, features, and something more extensive than a check, maybe, would be a great start! I'm sure the community will take it from there in wild ways.
I agree that adding more packages, features, and details would be useful. My initial thought of how to sketch it out would be:
- Take some of the packages that we know better (again, including myself in "we" is very generous to myself)
- Link to CRAN
- Link to the website for the package (e.g. GitHub pages)
- Make a list of the features by group
- Estimation method: Frequentist or Bayesian, and allowing both of those, and maybe something else?
- Linear, nonlinear (and these would be separate since some packages allow only linear, some allow both, and I guess there may be a package that only focuses on nonlinear)
- Supported response variable distributions
- Multivariate response variable support
- (To me, this seems like it would be a subset of heteroskedastic residual variable support, but it may be important enough to include on its own.)
- Residual variable distribution support
- Extensible response variable distribution support (i.e. can the user easily write their own distribution function for the response variable)
For all of these, either indicate "no" or indicate how to use it (e.g. "See the 'foo' argument in the function 'bar()'" or "See ?foo" or "See the page https://example.com/details"), and blank would indicate that the answer is unknown.
Contributors should be able to add new packages, new feature groups (but hopefully that would be rare), and new features. The feature list would likely get pretty long as there will likely be a lot of features that are important but only supported by a single package.
And (getting way ahead of myself), if we really take this all the way, then it eventually becomes a JSS article and a "Mixed/Hierarchical Models" CRAN task view (edit: and I see that Ben has a ~6-year-old version of the CRAN task view here: https://bbolker.github.io/mixedmodels-misc/MixedModels.html).
OK, one more (last?) question: should the primary format (for now) be
- a table-like format (i.e. like the examples above: Google sheet, rows represent packages, columns represent features/feature sets. Cells could contain either yes/no check marks (possibly with footnotes), or lists of values from a semi-controlled vocabulary
- a tag-list like (YAMLish or JSONish or something) format, e.g.
package: lme4
links:
CRAN: ...
GitHub: ...
model_types: LMM, GLMM
estimation:
framework: frequentist
approach: deterministic
GLMM_approx: Laplace, AGQ
response_distributions:
family: Poisson, binomial, NB2, Gamma
links: identity, cloglog, logit, probit
(I'm just a little bit scared/paralyzed by the thought of developing this hierarchy/vocabulary completely, and I'm not crazy about YAML)
- something more free-form?
To me, a table is the easiest way to receive contributions from others, so I pretty strongly lean that way.
And, it is readily extensible for new structure of input.
Related to glmmTMB/glmmTMB#522, it would be useful if a category were "Automatically handles rank-deficient fixed effects?"
I made some tweaks, most notably adding a "Column Descriptions" worksheet so that what the column contains is clear to people entering data.
This is great! Let me know when you want me to unleash the Twitterverse on it!
I added an instructions worksheet. Is this ready for twitterverse unleashing?
Thanks for the bump/reminder. I'm feeling a little chicken/that a little bit more time spent fleshing this out/organizing this ourselves, or with a more limited group of contributors, would be good before we open it to the whole world. I will try to spend a little time on this today ...
What do you think about caution vs boldness?
I tend to be of the opinion that steps taken are better than steps planned.
If it will get out there with a bit more planning, then that is good. If the planning will get a better response, then that is also good. (For instance, I think that providing a completely blank worksheet and saying "twitter, go for it!" probably would not succeed.) And, if getting a few contributors on board would help flesh it out and get some initial momentum, then go for it.
But, if the outcome is likely the same, then I'd say push the baby out of the nest earlier so that it is a step taken.
Addendum: I'm also not likely the right person to make the push which means my opinion is less important. I've actually used twitter about 5 times ever, and I don't have a notable name in mixed models like you two do.
I said I would get this today, but I didn't. Tomorrow! (I'll push it by the end of the day even if I don't do any more. I take your points about getting it done rather than getting it perfect.)
@billdenney , is there a difference between my "dispersion_model" column and your "variance_heteroskedasticity" model???
The more I look at this the more I realize why I'm daunted/have never gotten around to it ...
@bbolker , now that you ask... probably not. Fair to remove that column.
It is definitely not a simple thing to do.
One more question (I'm not stalling or anything :-) ) What do you think about adding the column descriptions/dictionary as a big (wrapped) cell at the top of the column? I know this violates the principle of keeping the column types homogeneous, but it seems it might be more accessible and convenient than having the definitions on a separate sheet (this occurred to me when I was adding & subtracting columns and having to go to two different sheets to keep everything synchronized ...)
Adding the description seems simpler to use, so I like it!
still working .. .ugh
OK, I'm finally going to go for it. Feel free to jump in and start populating.
Thanks!!!
yeah, f***ing finally. Having opened a link for comments only (and telling people to request edit access), I'm now second-guessing myself and thinking I should just open it all the way up (anyone can edit with link) ... It would be nice to have people edit in the "suggestion mode" from Google docs but AFAICT that's not possible.
YES!
Hi @bbolker et al., just came across this twitter thread. I would be keen to contribute to this and will send an edit request for the document. Just wondering what the criteria are for whether a package should be included or not? I.e. it's focused on R packages, so that obviously excludes things like SAS - but what of packages not on CRAN? What of R packages that require a licence fee? Is this limited to FOSS R packages?
I don't think it should be limited to FOSS (we already have AS-REML in there, I think). We might want to add a column for license/FOSS-ness.
link to google sheet here so I can find it: https://docs.google.com/spreadsheets/d/19itelYaVW0U0gtNtRfqh76ZGt1awlamNcJwT71u_5Uk/edit#gid=0
Thank you. Asreml-R was exactly what I had in mind as that is what I'm most familiar with. It wasn't in the list that I could see, and so I was going to add it along with a couple of others. I think I have sent an edit request.
I think a "licence" or even just "FOSS (Y/N)" column would be really handy as it can be an important consideration. I agree with your comment about not wanting 200 columns, but what's the best way to go about suggesting/requesting extra columns in general? Discuss here or just go for it?
I like the FOSS (Y/N) options as I think that is what will matter to most users. If they want more detail, I think it's reasonable that they would go to the specific package's website to understand specifics of the license.
I also wonder if a "Last updated" column would be useful, mainly to indicate projects/packages that are stagnant or dead, but the downside is that it would obviously be fairly quickly out of date. Perhaps some indication of life rather than last updated? E.g. "project currently active" or similar?
maybe last updated โ "maturity" (I know there's standard terminology/badges for that: experimental, maturing, ???) but don't know if that's exactly the same thing ...
Yeah, the lifecycle package is commonly used within tidyverse and those packages following tv style. This article mentions that the stages can be used to describe packages, functions or even arguments, however they are most commonly applied to functions.
Using that maturity terminology would be familiar to many users, but would probably require regular updating. I was trying to think of an objective way to indicate life of a package without requiring regular checking and updating. E.g. pedigreemm
is one I've looked at in the past and hasn't been updated since 2014. It works, but should it be classed as stable, deprecated or superseded?
I wonder if there's a way to grab the last published date on CRAN automatically in the sheet? Would that be useful if doable?
Edit: FWIW I managed to get a Google Sheets formula to grab this info from CRAN packages if that's useful:
This is done in a test version of my own, not the master copy.
Thanks for making this happen! I used it today to find a package that supported censoring.
@rogerssam "last published" sounds like a good idea. How does it work? Seems fancy for a google sheets formula (web lookup & scrape/text processing)?
@bbolker yeah it grabs the URL, and then the CRAN details are just a table, so can be imported and I realised it could be parsed using XPath to return just the published date, or an Excel #N/A
when not a CRAN URL.
The formula is just =INDEX(IMPORTXML(C25,"//table[1]//tr[td//text()[contains(., 'Published:')]]"), 2)
, where C25 is the URL.
I don't know if it's dynamic (i.e. it updates automatically on sheet load), or if it requires some intervention, but I will investigate some more, and see what I can do to make it dynamic if needed.
I'll add it to the sheet, but feel free to move as appropriate.
Note that this is way out of date now. There is now a mixed models task view which provides much of this information, although not in a tabular format and not necessarily with quite as much detail ...