Posterior predictive simulations
bwiernik opened this issue · 5 comments
Classification:
Feature Request
Summary
I'd like to be able to simulate posterior draws and posterior predictive draws ala arm::sim()
and bayesplot::pp_check()
. I wrote a quick and dirty function to draw coefficients/tau or yi samples assuming that the REML and knha is used. I can generalize it, but I'd like to verify you'd be interested and get your input on API and coding style.
https://gist.github.com/bwiernik/be04c8817046ea0ffea9e9838928ea96
Hi Brenton. Not sure if you have seen the simulate.rma()
method. Is this what you are trying to do via your function?
library(metafor)
dat <- escalc(measure="RR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg)
res <- rma(yi, vi, data=dat)
res
sim <- simulate(res, nsim=1000)
sav <- lapply(sim, function(x) rma(x, vi, data=dat))
sapply(sav, function(x) x$beta)
t(sapply(sav, function(x) c(x$beta, x$tau2)))
As far as I can tell, the simulate.rma()
function doesn't incorporate any uncertainty in either the mean function or the tau2 estimate, similar to the stats::simulate.lm()
function?
Correct, it just simulates new data based on the estimated fixed effects and the marginal var-cov matrix of the estimates. Can you point me to a paper describing the theory as to what you are trying to accomplish?
To add to this: I took a look at getMethod("sim", "lm")
from arm
. I see what they are doing there. But doing something similar is going to be very difficult for meta-analytic models. In simple regression models, the distribution of sigma^2 is a scaled chi-square distribution but this doesn't apply to tau^2 in random-effects models. For iterative estimators, such as REML, the exact distribution is not even known.
This is the original paper on posterior predictive checks: http://www.stat.columbia.edu/~gelman/research/published/A6n41.pdf
See also https://cran.r-project.org/web/packages/bayesplot/vignettes/graphical-ppcs.html and Andrew Gelman and Jennifer Hill. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
So, the function I linked to does two things:
- Draws samples of model parameters based on the fitted model and uncertainty (parametric bootstrap), analogous to posterior draws in Bayesian estimation. Implemented as in
arm::sim()
using the sampling distributions of beta and tau2. - Optionally uses those sampled parameters to draw samples of individual effect sizes (ala
predict()
). Compared tosimulate()
orpredict()
, each row of the sample matrix contains its own mean computed from a draw of the simulated betas and its own SD computed from a draw of the simulated tau2s and the sample vi.
The two main applications here are:
- Model checking--do the simulated effect size distributions resemble the observed distribution or are there major assumption violations?
- Uncertainty visualization incorporating uncertainty in the tau2 value (e.g., as we might use for power analysis for complex models).
If you know the distribution of the parameter estimates, then you can indeed use this information to directly sample them. However, in random-effects models, this is not the case (except for some very special cases). Once you move to rma.mv()
models, it gets even more tricky.
However, if you simulate new data from the estimated parameters and then refit the same model to each simulated dataset, you should be accomplishing the same thing or something close to it. The former is of course faster, but the latter is completely general as it does not require that we figure out for every possible model what the distribution of the parameter estimates is.
I examined this with an lm()
model:
library(arm)
library(RcppEigen)
res <- lm(mpg ~ cyl + hp + wt, data = mtcars)
sim1 <- sim(res, n.sims = 1000000)
X <- model.matrix(res)
tmp <- simulate(res, nsim = 1000000)
sav <- lapply(tmp, function(y) fastLmPure(X, y)) # much faster than lm()
sim2 <- data.frame(t(sapply(sav, coef)))
sim2$sigma <- sapply(sav, function(x) x$s)
par(mfrow=c(5,1))
for (i in 1:4) {
plot(ecdf(sim1@coef[,i]), col="red")
plot(ecdf(sim2[,i]), add=TRUE, col="blue")
}
plot(ecdf(sim1@sigma), col="red")
plot(ecdf(sim2$sigma), add=TRUE, col="blue")
summary(sim1@sigma)
summary(sim2$sigma)
sd(sim1@sigma)
sd(sim2$sigma)
For the fixed effects, the distributions appear to be essentially the same. For sigma, they are not, but they are similar. I don't know the theory that would say under what conditions these two approaches converge to the same distributions (maybe at least asymptotically?). But based on this, I would be hesitant to embark on the attempt to do something like arm::sim()
for meta-analytic models, when the machinery to do something analogous is in essence already available (and again, right now we don't even have the theory to implement something like arm::sim()
for random-effects models anyway).