Support new models
Opened this issue · 48 comments
IF THE MODEL YOU WOULD LIKE TO SUPPORT IS NOT LISTED BELOW, PLEASE OPEN A NEW ISSUE.
It is often very easy to add support for new models. If you would like to help us do it (thanks!!!), please read this vignette on marginaleffects
extensions: https://vincentarelbundock.github.io/marginaleffects/articles/extensions.html
marginaleffects
only supports modeling packages released on CRAN. If you would like to add support for a non-CRAN package, your code can be hosted in the user-submitted (b
ut unsupported) code library in the vignette above.
If you cannot contribute code to support a new model, and if that model is not listed below, please open a new issue with a working example using a small publicly available dataset.
Seems possible
ordinal::clmm2
: #1003
Existing issues (with some notes)
bife
: #949earth
(MARS): #1186flexsurv
: #320gamm4
: #947logitr
: See code insandbox/
and #516mrmm
: #963 #1000MuMIn
: #1154panelr
: #210stats::nls
: #326systemfit
: #836
predict
problem
alpaca
: #786bayesX
censReg
coxme
: #645coxphw
: #467coxrobust
cplm
endorse
: #1144gee
glmm
glmnl
ivprobit
MASS::ridgelm
mgcv::gamm
: #187ordinal::clm2
: no prediction for each levelordinal::clmm
: no predict method at allpglm
plm::pgmm
robust::glmRob
: breaks withnewdata
timereg
: #467
vcov
problem
complmrob
bfsl
novcov
method at alllqmm
: breaks withnewdata
#1072
Categorical / Multinomial DV
VGAM::vglm
: not just multinomialVGAM::vgam
Not on CRAN
bigmlogit
: checked on 2022-06-02mixor
: checked on 2022-03-10mnlogit
: checked on 2023-01-11tobit1
: checked on 2023-01-11ZBZIMM
: #541
No formula interface
glinternet
: #934
Not currently planned
- Spatial regression models typically rely on very different data structures, and are currently considered out of scope for
marginaleffects
- #1052
spaMM::HLfit
:- #1052 (comment)
- #1067
sdmTMB
: #1136
metafor
: Very different input data. Likely very difficult to support. Would need code submission from an expert contributor. #604lavaan
: Vincent has no experience with this package. We could consider this, but it would require serious code contributions. #499R-INLA
: Very different and unlikely. #342Apollo
for choice models. Seems extremely idiosyncratic, and does not appear to supply any of the standard methods such asvcov()
,coef()
, orpredict()
. Supporting this is likely to be a major challenge. Vincent is unlikely to do this himself without significant help (or more likely a quasi-complete PR) from a contributor. https://cran.r-project.org/web/packages/apollo/index.htmlcmprsk::crr
: Seems very different. Vincent will not implement this unless someone is willing to help.
Supported via tidymodels
or mlr3
glmnet
: no formula interface and already supported throughtidymodels
andmlr3
. #923SuperLearner
: #834
Supported by other packages
- Source:
emmeans
https://cran.r-project.org/web/packages/emmeans/vignettes/models.html - Source:
easystats
- Source:
ggeffects
- bamlss, cgam, cgamm, gamm, gamm4, MixMod, nlmer, survreg, sjstats::svyglm.nb, vgam.
quantreg::rq
You may take a look at insight::get_varcov.rq()
. Maybe you could even rely on insight::get_varcov()
, since it automatically does some "fixing" (like fixing negativ matrices or rank deficiency).
quantreg::rq
You may take a look at
insight::get_varcov.rq()
. Maybe you could even rely oninsight::get_varcov()
, since it automatically does some "fixing" (like fixing negativ matrices or rank deficiency).
That's fantastic! I didn't know about this and will definitely take a look, and almost certainly adopt it.
quantreg::rq
You may take a look at
insight::get_varcov.rq()
. Maybe you could even rely oninsight::get_varcov()
, since it automatically does some "fixing" (like fixing negativ matrices or rank deficiency).
Worked like charm: add907e . Thanks @strengejacke !
If requests are open, I'd love for svyolr and svymultinom to be supported!
If requests are open, I'd love for svyolr and svymultinom to be supported!
Ooops, sorry I missed this post @dustingilbreath !
I am interested in adding support for this. Unfortunately, the svyolr
object does not have a predict()
method, so it cannot be supported at the moment. I have contacted the package maintainer a while ago, and he seemed positive about the chances that this feature could eventually be published. See here: #157
I am not familiar with svymultinom
model; it does not seem to be a part of the survey
package. Where is it from? Do you know if they have a predict()
method with a newdata
argument?
No worries and thanks for getting back to me!
First, that's great on svyolr! It would make my life much much easier :)
Second, svymultinom isn't integrated into the survey package, but is rather in the svyrepmisc package:
https://rdrr.io/github/carlganz/svrepmisc/man/svymultinom.html
On whether it has a predict method with a newdata argument, it passes arguments to multinom, so maybe? I'm a terrible programmer if being honest.
@dustingilbreath is svymultinom
supplied by a package on CRAN or just Github?
FWIW, I only plan to support packages which have officially been released on CRAN. A lot of the Github packages are experimental, and it can be challenging to support packages with unstable user interface. Also, being released on CRAN means there is at least a minimal level of testing, and suggests that there may be enough of a user-base to be worth the work.
Unfortunately just Github. It may be my ignorance, but I haven't seen a multinomial model that's adjusted for complex survey design anywhere else, which is where my question stems from at the end of the day.
In any case, many thanks for getting back to me, the great package, and the hopefully coming svyolr addition!
Love the new stuff that keeps coming out here. I'm curious if there are plans to support mice
(multiple imputation) models in some way. I wouldn't mind thinking through this but wondered if that was already in the works.
Love the new stuff that keeps coming out here. I'm curious if there are plans to support
mice
(multiple imputation) models in some way. I wouldn't mind thinking through this but wondered if that was already in the works.
No current plans, but I'm not sure how this would work. Shouldn't we estimate marginal effects in each of the imputed datasets, and then combine the results using Rubin's rules? I anticipate challenges in trying to make marginaleffects
"support" mice
objects. For instance, do they even have the methods we need, such as predict()
?
No current plans, but I'm not sure how this would work. Shouldn't we estimate marginal effects in each of the imputed datasets, and then combine the results using Rubin's rules? I anticipate challenges in trying to make
marginaleffects
"support"mice
objects. For instance, do they even have the methods we need, such aspredict()
?
As far as I've seen, they've discussed in in depth (amices/mice#82) but not sure it will be formally implemented in the package...
Possibly relevant: http://www.haghish.com/statistics/stata-blog/stata-programming/download/mimrgns.html
Yeah, definitely relevant.
It probably comes down to a choice between two orders of operation.
Order A:
- Impute 20 datasets
- Fit in each of the 20
- Marginal effects in each of the 20
- Pool
Order B:
- Impute 20 datasets
- Fit in each of the 20
- Pool
- Marginal effects in the pooled estimates
My guess is that Order B is either impossible or very difficult to implement.
Order A, in contrast, is probably very easy or (nearly) possible already. However, I'm not sure of the statistical theory here, and there are some vague warnings at the link I posted above.
Right, I think "Order A" makes the most sense. But you are right, that link waved it hands and said it could be cursed in some situations. But I honestly don't see how it could be a serious problem if we are already getting marginal effects for each of the individual modeling techniques (just like normal) and then pool? What could the pooling do to cause serious issues? Will need to look into this more.
See also https://strengejacke.github.io/ggeffects/reference/pool_predictions.html (which uses Order A)
Thanks Vincent and contributors to this thread. As food for thought, consider Westreich and colleagues 2015 paper:
https://academic.oup.com/ije/article/44/5/1731/2594566?login=false
Ignore that the authors compare a very specific multiple imputation approach for counterfactual recovery with a standard, no frills, G-computation approach. Nevertheless, the authors make the following comments that would seem to apply quite generally to causal inference over multiply imputed data.
"... note that we are only estimating a mean, so accounting for between-imputation variance is unnecessary; and related ... Rubin’s formula for the variance cannot be used because every observation contributes to both exposed and unexposed calculations, and therefore we bootstrap. A closed-form variance estimator is likely possible though not explored here..."
Following Westreich et al's logic, it is not entirely clear to me whether and how Rubin's rules apply when we are estimating counterfactual contrasts using G-computation. I am curious to follow this issue; it will certainly interest many others. Thanks again all.
Thanks for the link. I only had time to skim, but I don't understand why this would apply "generally" to procedures other than the one they propose. Is there an actual argument somewhere?
BTW, their procedure seems pretty weird to me. Is it a good summary to say that, basically, they do parametric g-estimation, but instead of specifying a logit (or other) model explicitly, their "model" is the default algorithm used by SAS's multiple imputation routine? The "multiple" seems entirely incidental to them, so it sounds like this is all it boils down to...
Thanks for replying Vincent. I share your intuitions! However, when it comes to causal inference, I find my intuitions sometimes lead me astray. Additionally, Westreich is one of the most respected researchers in Epi, so perhaps there is something to his comments. Speculating, perhaps his motivation stems from something like the following.
Recall, G-computation recovers a standardised mean
Similarly we may set everyone's level of A to
(We may need to compute such expectations as functions...).
This is G-computation: we standardise to the mean distribution of the confounders Z in a population by obtaining the average of the predicted responses at some settings of A = a for an entire population, which we may contrast with another setting A = a* for the same population. (Note we use "standardise" in the epi sense of the term, as balancing the distribution of confounding covariates for the outcome -- there is much scope for terminological muddles here.).
Notably, because the difference in the expectations of the means is the same as the expectations of their difference we may obtain a causal contrast on the difference scale as:
This difference is an average causal effect of exposure A = 1 contrasted with the exposure A = 0 on the difference scale. Again, we have computed this causal effect from the differences in the standardised means of the predicted counterfactual outcomes across a population in which, contrary to fact, everyone receives both exposures.
In any case, I find that considerations such as those I have just mentioned help me to build an intuition for why Westreich wants to focus on contrasts of simulated means. Now, what of the computing the variances of the counterfactual responses?
Here I am less confident. The reasoning might be something as follows. It is clear that we cannot apply g-computation directly to model counterfactual variances. This is because
The fact that G-computation and its extensions require special handling for obtaining the variances of potential outcomes might underlie Westreich's quick dismissal of Rubin's rules for partial pooling. Channelling Westreich's line of thought as I have imagined it -- perhaps wrongly -- it is because we only recover contrasts for the means of the simulated potential outcomes that we must turn to bootstrapping for our standard errors (although he mentions, opaquely, closed-form solutions might be possible...)
I will admit that my attempts to build intuition here are something of a stab in the dark. Again, I may be off the mark! I want to underscore I do not have any specific advice. I too am curious. Hopefully, my comments prompt you and others to more considered thoughts.
Thanks again for your time and your wonderful package Vincent!
This issue is about supporting new models, so I am now hiding these comments. I opened a new issue #444 to keep the conversation going.
Hi @vincentarelbundock, are there updates on the support for nlme::lme() models?
You indicated in the closed issue #99 that previously insight::get_data(), or rather insight:::get_data.clm() method, did not extract cluster variables. The latest insight package has made get_data(lme()) correctly extract cluster variables. However, predictions() in marginaleffects only calculates point estimates but leaves SE and other inference statistics NA although there are no error or warning messages printed.
Do you know what causes the NA? Does it require any Model-Specific Arguments like for lme4::lmer() models to correctly calculate standard errors that may or may not account for random effects? So far nlme::lme() is the only tool I found that models heteroscedasticity while accounting for random effects and error correlation. Therefore, it solves certain problems that lme4::lmer() does not. So, it would be great that we can also find marginal effects from such models through your package.
Hi! Congratulations on the package! :) If this issue is still open for requests, I would love to calculate marginal means on models fitted with the phylolm package. Since many of us have to take phylogenetic relationships into account for our analyses, I think it could be quite useful!
Hi @vincentarelbundock, are you still interested in supporting {flexsurv}
models? I am the author of the predict.flexurvreg()
method in the flexsurv
package and would be interested in contributing to this package.
@mattwarkentin yes, absolutely!
Adding new models is usually pretty easy. I wrote a vignette describing all of the steps here: https://vincentarelbundock.github.io/marginaleffects/articles/extensions.html#marginaleffects-extension
Let me know what you think.
Report df1 and df2
Hi, thank you very much for this package. I find it extremely useful. I am currently trying to use it in a 'feglm' class object, from the alpaca package. I need this package as I am estimating a non linear probability model with high dimension fixed effects and need to perform the bias correction they modeled. The bife package is a supported alternative that is supported by marginal effects but I believe the results I am obtaining from bife are not okey. Is there any way that feglm objects could be implemented - supported? I have tried the extensions suggestions but I have not succeeded.
@jarmasmuguerza thanks for the suggestion. I have opened a separate issue for this package.
What is the problem with "bife"? Is there a bug in "marginaleffects"? If so, can you show me a reproducible example?
I am going to do my best to explain the issue because I am not quite sure what is going on.
- I am estimating a probit with high dimension fixed effects using "bife".
- Then I estimate this same model with feglm, having previously called "fixest" too, the "feglm" estimation's output is of class "fixest" instead of "feglm". This allows me to use the "marginaleffects" package (more importantly the avg_slopes function), but prevents me from using biasCorr from "alpaca" which needs a "feglm" class object (which is ultimately what I want before computing marginal effects).
The coefficients for 1 & 2 are very similar- as much as I can't see the estimated fixed effects- however, the marginal effects differ largely (two-fold). The ones from 2 very closely resemble the OLS ones. It is this large difference that is making me doubt the marginal effects computed after "bife", but it could well be an issue not related to "marginaleffects".
I don't really know my way around GitHub yet, but if needed I could somehow show a reproducible example.
Hi, I went back to the bife estimations. Fitting a probit the average marginal effect of variable x can also be manually computed by taking the average across observations of the expression: normaldensity(estimated index value) * (derivative of linear index function with respect to x). This doesn't seem to be what avg_slopes is returning after a bife estimation.
@jarmasmuguerza Can you open a separate issue and supply a minimal replicable example using public data?
@jarmasmuguerza Can you open a separate issue and supply a minimal replicable example using public data?
Sure, I'll give it a try.
Hello. As I wrote on Twitter, I'd like to add support for the SuperLearner R package, available on CRAN. I am looking at the examples provided and one thing that differentiates it from the other supported classes is that the result of a call to the SuperLearner
functions is essentially a list of fitted models, and not just one. It currently supports the following estimators:
> listWrappers()
All prediction algorithm wrappers in SuperLearner:
[1] "SL.bartMachine" "SL.bayesglm" "SL.biglasso"
[4] "SL.caret" "SL.caret.rpart" "SL.cforest"
[7] "SL.earth" "SL.extraTrees" "SL.gam"
[10] "SL.gbm" "SL.glm" "SL.glm.interaction"
[13] "SL.glmnet" "SL.ipredbagg" "SL.kernelKnn"
[16] "SL.knn" "SL.ksvm" "SL.lda"
[19] "SL.leekasso" "SL.lm" "SL.loess"
[22] "SL.logreg" "SL.mean" "SL.nnet"
[25] "SL.nnls" "SL.polymars" "SL.qda"
[28] "SL.randomForest" "SL.ranger" "SL.ridge"
[31] "SL.rpart" "SL.rpartPrune" "SL.speedglm"
[34] "SL.speedlm" "SL.step" "SL.step.forward"
[37] "SL.step.interaction" "SL.stepAIC" "SL.svm"
[40] "SL.template" "SL.xgboost"
It is therefore necessary to take care of e.g., extracting the coefficients from each estimator and somehow combine them (e.g., by averaging them). One question is: taking as example the get_coef
function, should I manually extract the coefficients from the individual estimators of the library?
Hi @vincentarelbundock, would you consider supporting the systemfit
package? systemfit() models should be very similar to ivreg() objects that marginaleffects already supports and sem() models from the lavaan
package, as all allow instrument variables. However, systemfit() also allows one or multiple equations to be estimated at the same time and accounts for crossequation correlations, so it works for seemingly unrelated regression without instrument variables and simple structural equation modeling without latent factors or measurement model components. Therefore, there can be multiple outcome variables in systemfit(), and effects are usually parted into direct effect, indirect effect, and total effect.
Hi, I am requesting support for marginal odds ratios for fixest glm models.
Marginal odds ratios are defined here: https://sociologicalscience.com/articles-v10-10-332/
The latest copy of the book shows no testing has been on done on fixest's GLMs.
https://marginaleffects.com/articles/supported_models.html
Thanks!
comparisons(model, comparison='lnoravg', transform=exp)
@vincentarelbundock yes, in fact I realize that the reason that this did not work for my model is not because you do not support it - but because the fixest-estimated models don't play nicely with comparisons or avg_comparisons when using interacted fixed effects.
For some reason, a model estimated with the ^ in fixest (to generate a fixed effects interaction, e.g. period x census region with states as unit of analysis) will fail, generating the error:
"Error: The function supplied to the comparison
argument must accept two numeric vectors of predicted
probabilities of length 0, and return a single numeric value or a numeric vector of length 0, with no
missing value."
while when the interacted fixed effect is incorporated into fixest estimation as its own variable, the comparisons and avg_comparisons functions work fine. I suppose this is a bug rather than a missing feature. I can generate a reproducible example if it will help.
Can you supply a reproducible example with public data? (Like mtcars)
In fact, I am unable to replicate the issue with mtcars and so I think that this problem is arising from missing data or another idiosyncracy in my own dataset. Feel free to mark this closed and I can open a separate issue when I've replicated the issue on a public dataset.
I am now certain that this problem arose from the following feature in fixest's procedure for constructing combined fixed effects quickly:
Combining the fixed-effects
You can combine two variables to make it a new fixed-effect using ^. The syntax is as follows: fe_1^fe_2. Here you created a new variable which is the combination of the two variables fe_1 and fe_2. This is identical to doing paste0(fe_1, "_", fe_2) but more convenient.Note that pasting is a costly operation, especially for large data sets. Thus, the internal algorithm uses a numerical trick which is fast, but the drawback is that the identity of each observation is lost (i.e. they are now equal to a meaningless number instead of being equal to paste0(fe_1, "_", fe_2)). These “identities” are useful only if you're interested in the value of the fixed-effects (that you can extract with fixef.fixest). If you're only interested in coefficients of the variables, it doesn't matter. Anyway, you can use combine.quick = FALSE to tell the internal algorithm to use paste instead of the numerical trick. By default, the numerical trick is performed only for large data sets.
Setting combine.quick = FALSE solved the problem in all the cases I tested. Thanks for making a great package.
Hello,
It could be possible to add glmnet package? It's not supported and has predict method (and it's mentioned in Misc at the beginning).
I could work on it if it's needed (with some held or guidance)
Thanks!
Thanks for the suggestion @scfmolina
Is this package only "matrix-based" or is there a formula interface to fit models?
Behind the scenes marginaleffects builds data frames with different predictor values, and then uses the formula in the model to make predictions. If there is no formula interface, it would be a lot of work to add support.
It's matrix based, you have to give X, y. There is no formula argument.
But I see that tidymodels has an interface for it using both generics::fit which receive the formula & generics::fit_xy which receive the X, y.
The example is here
@scfmolina That's great news! I just added support for tidymodels
, so my guess is that this is going to work in the development version of marginaleffects
. See here: https://marginaleffects.com/articles/machine_learning.html
I've been testing the last hours. It works perfect if we use the fit function with formula interface but it does not work using fit_xy using X,y matrices/vectors. Amazing how well it works!
And with respecto to this:
When the underlying engine that tidymodels uses to fit the model is not supported by marginaleffects as a standalone model, we can also obtain correct results, but no uncertainy estimates....
I understand that getting confidence intervals from a bias estimates as from glmnet is a little bit strange at least theoretically. This post gives a good answer with useful resources about that.
So, do you planned to add uncertainy estimates for ML models or regularized regressions? I could help (or at least try to help) with that.
I do not currently have plans to do this, but I really appreciate your offer to help.
If you want to work on this, I strongly recommend you first propose a detailed implementation plan --- with scientific references and pointers to relevant bits of the marginaleffects
code base.
The worst outcome would be for you to spend a lot of time writing code that I can't or won't merge (ex: on statistical grounds, because it adds too much complexity, or hinders maintainability)
I may be wrong but ordinal::clmm2 and ordinal::clmm seem to use predict method (maybe it's been updated). Would you mind checking if you are able to include these models? Thank you.
I'm not sure whether conversation on the rstpm2 implementation should be here or in the other issue... I guess here, as this is the open one?
In stpm2 models, the coefficients seem to be stored in many places...
Using the following at least produces the SEs etc in the predictions output... (i dont know which are important so i changed them all)
set_coef.stpm2 <- function(model, coefs, ...) {
insight::check_if_installed("rstpm2")
model@coef <- coefs
model@fullcoef <- coefs
model@lm$coefficients <- coefs
model@args$init <- unlist(coefs)
model@details$par <- coefs
model
}
> predictions(m2, newdata = nd)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % rectime hormon x3
0.535 0.0387 13.8 <0.001 142.1 0.459 0.611 1000 0 50
0.647 0.0411 15.8 <0.001 183.3 0.567 0.728 1000 1 50
0.739 0.0689 10.7 <0.001 86.7 0.604 0.874 1000 2 50
Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, rectime, hormon, x3
Type: response
avg_predictions doesnt work though for some reason... if the missing variables were the spline terms, I might have kinda understood, but that hormon is missing confuses me... it seems to give the same error whether newdata is passed or not
avg_predictions(m2, var = "hormon")
Error: Unable to compute predicted values with this model. You can try to supply a different
dataset to the `newdata` argument. This error was also raised:
undefined columns selected
Bug Tracker: https://github.com/vincentarelbundock/marginaleffects/issues
In addition: Warning message:
Some of the variable names are missing from the model data: hormon
to note: rstpm2 can predict a whole load of effect measures... so the type_dictionary would need updating to include some of them (i would be most interested in rmst and rmstdiff, but others are certainly also of interest)
Thanks. Let's talk in the model-specific issue.