mikemahoney218/mm218.dev

https://www.mm218.dev/posts/2021/01/model-averaging/

Opened this issue · 7 comments

Mike Mahoney: Model averaging methods: how and why to build ensemble models

https://www.mm218.dev/posts/2021/01/model-averaging/

Thank you very much Mike, excellent article. I learned a lot from you.
One question, would the idea of ​​combining classification models be the same?

Hey @ronandsa ! Glad it's useful.

The short answer is "yes". You'd want to weight the probability each component model assigns each class, not just the assigned class, and then take the most likely classification. I personally haven't come across the massive sort of review for classifiers as Dormann et al. did for regressors, but people most certainly do it:

https://web.cs.ucdavis.edu/~davidson/Publications/ModelAve.pdf
https://www.dbmi.pitt.edu/sites/default/files/Visweswaran%20Cooper.pdf

(Of course, how effective any given averaging method is will depend on the specifics of your data and models!)

Thank you SO MUCH - this is the most concise explanation on model averaging I have been able to locate. Wonderfully helpful, and I'm more confident in my approach now.
Follow-up question - if you use model averaging to report a model estimation result, what is the convention for reporting out results and for estimating the model significance? Eg, I have 5 models to average, and I'll be reporting on two IVs predicting 1 DV. My output is Odds Ratios (logistic reg). One of those IVs is significant for all 5 models. The other is significant for 3 models, not all 5. Using equal-weights to estimate the model averages for both IVs, do I use the OR for all 5 models in both those calculations, or the OR for only the significant models (eg, 5 for the first and 3 for the second)? Additionally, how do I report back on the overall / estimated significance (what happens to my estimator p-values)?

Glad it was useful @gochezkerrth !

A disclaimer before this answer: I'm an applied researcher, not a statistician by training, so I'm going to answer more with what I have done rather than what's the right thing to do.

So, in this situation, it sounds like you're doing exploratory/descriptive modeling work -- you've got a set of X potential IVs, and a number of different plausible models combining them to estimate your DV. That means you're not doing confirmatory/attribution modeling work: your conclusions can be "it seems like X, Y, and Z are important variables here" but not "we're very confident that X has an impact of Y%".

For that type of project, I'm a fan of ranking models via AIC (or, well, AICc). You fit all possible models, then rank them by AICc, and then "select" any model whose AICc is within a set value (usually 4, sometimes 2) as a well-supported model. Your results section can then focus on what DVs are generally included in your well-supported models, and on the size and direction of their coefficients. I did this, for instance, in Mahoney and Stella 2020 ( https://www.mm218.dev/papers/mahoney_stella_2020.pdf ) -- see table 4 on page 9, section 3.4 for an example. This approach doesn't really use p-values; a model having an AICc within your threshold means the model is well-supported, regardless of the p-values of its terms.

You'd only then average the models (maybe using AICc!) if you were using these models to generate predictions -- generating predictions from each model and then averaging those together.

Hope that makes sense!

I'm an applied researcher as well; I appreciate you underlining the distinction. I am also reaching out to my local university math department to see if there are any theoretical correct methods I'm missing.

Correct, this is exploratory and descriptive with some predictive results (mostly predicting who will do really well and why, and who will do really poorly and why, to generate ideas for supports going forward), and I have ranked them by AICc (so glad to be on track!). My relationships are generally Odds Ratios, not coefficients, as most are not linear. Ideally, I would like to report out something like "clients with characteristic X by year 4 were on average 3.2x more likely to achieve model status (the DV) in year 5". And have at least my documentation noting that a relationship is reported out only if consistently significant, etc, & how I defined that.

I'm also employing a backwards approach to maximize the significant IV relationships ("Caveats to model selection" from https://uoftcoders.github.io/rcourse/lec09-model-selection.html) before running and selecting my lowest AICc and the delta AICc <=2. I noticed your paper in 2020 used delta <=4, and I was wondering if there's a specific reason?

In sum, it sounds like I'm on the right track?

Thank you again.

Sounds like you're on the right track!

Believe the 4 threshold is from Burham and Anderson 2004 -- https://doi.org/10.1177/0049124104268644 -- but I can't get a copy of that paper right now. I completely forget why 4 is preferable versus 2.

My only other suggestion is that you might consider reporting the range of coefficient variables -- "across all well-supported models, clients with characteristic X were between 1.8x (1.6-2.0) to 3.2x (3.0 - 3.4) more likely to achieve model status." But otherwise, sounds like a good approach to me!

Aha, bless you - sometimes I forget the simple things. A range works perfectly. Thank you!