mitchelloharawild/distributional

More reliable mean() / variance() for dist_default

mjskay opened this issue · 2 comments

Currently for distributions without an implementation of mean(), mean.dist_default() takes the mean of 1000 draws from the distribution:

mean.dist_default <- function(x, ...){
mean(generate(x, times = 1000), na.rm = TRUE)
}

Naturally this causes some sampling error:

> mean(dist_wrap("lnorm", 0, 1))
[1] 1.76567
> mean(dist_wrap("lnorm", 0, 1))
[1] 1.587978
> mean(dist_wrap("lnorm", 0, 1))
[1] 1.646674
> mean(dist_wrap("lnorm", 0, 1))
[1] 1.706244

I feel like there should be a less noisy alternative. One such might be to take the mean of a large number of quantiles, a sort of quasi-monte carlo approach:

mean.dist_default <- function(x, ...){
  mean(quantile(x, stats::ppoints(1000)), na.rm = TRUE)
}

This yields decent results on a sanity check:

> mean(dist_wrap("lnorm", 0, 1))
[1] 1.645156     # should be exp(0.5)
> exp(0.5)
[1] 1.648721

The approximation might also be improved by using integrate() on continuous distributions, though I'm not sure how reliable this would be across arbitrary distributions:

mean.dist_default <- function(x, ...){
  if (is.double(support(x)[[1]])) {
    limits = quantile(x, c(0, 1))
    tryCatch(
      integrate(function(y) density(x, y) * y, limits[1], limits[2])$value,
      error = function(e) NA_real_
    )
  } else {
    mean(quantile(x, stats::ppoints(1000)), na.rm = TRUE)
  }
}

Yeilding a result quite close to the correct result:

> mean(dist_wrap("lnorm", 0, 1))
[1] 1.648721

And something similar could probably also be used to improve the calculations in variance.dist_default().

Added, thanks!

Lovely, thank you!