Improve accuracy of summary statistic methods for transformed distributions

Question

Improve accuracy of summary statistic methods for transformed distributions

mjskay opened this issue 3 years ago · 11 comments

I'm not sure the exact problem, but it seems that the mean() of transformed distributions is not always correct. Here's an example of exp() applied to dist_normal(), which should yield a lognormal distribution, giving incorrect means:

> x = dist_normal(0:1, 1:2)
> mean(x)
[1] 0 1
> mean(exp(x))   # should be roughly exp(c(0.5, 3))
[1] 1.500000 8.154845
> exp(c(0.5, 3))
[1]  1.648721 20.085537

I'm not sure the issue. It does appear to work in some cases:

> mean(log(exp(x)))    # should be roughly c(0, 1)
[1] -4.732208e-07  1.000000e+00

Answer 1 · 2021-11-07T23:52:27.000Z

The Taylor series approximation used to compute the mean of log-Normal distributions is appropriate for small variances.
Some relevant discussion can be found here: https://robjhyndman.com/hyndsight/backtransforming/#comment-3043970592
In most forecasting applications this approximation results in a relatively small inaccuracy in the forecast distribution's mean.

The log normal distribution is sufficiently common to justify a specific distribution, but I'm concerned about how this inaccuracy might impact non-log transformations.

One option could be to sample the mean, the quantile() and generate() methods are accurate. But this would substantially degrade performance for a task commonly done at scale in forecasting.

@robjhyndman - do you have any suggestions for improving this approximation?

Answer 2 · 2021-11-08T06:11:01.000Z

The Taylor series approximation up to 2nd derivatives has the nice property that it is very general provided the transformation is differentiable, but it is not necessarily very accurate. I can think of three possible approaches:

Use a higher order Taylor series approximation. That would involve computing up to the 4th order derivative and moment.
Use better approximations (or exact results) for specific transformations such as log, exp, and power transformations; and Taylor series if the transformation is not recognized.
Use an empirical approach by randomly sampling a few thousand observations from the transformed distribution.

Answer 3 · 2021-11-08T06:25:35.000Z

Great, thanks for the suggestions.
I think a good start would be to add better methods for common transformations and fallback to Taylor series approximations otherwise.
I think sampling for the mean will be too slow for most applications, but would be useful to provide as an option for users.

Answer 4 · 2021-11-08T08:06:40.000Z

OK. To start, a more accurate version for exp() has mean exp(m + v/2) and variance [exp(v)-1]exp(2m+v) where the original distribution has mean m and variance v. But if we are going to allow special cases, then perhaps identify when exp is applied to a normal distribution, and save it as logNormal rather than t(N).

Answer 5 · 2021-11-12T01:07:26.000Z

I've added the log-normal distribution, and a shortcut between it and the normal with exp() and log().

library(distributional)
exp(dist_normal())
#> <distribution[1]>
#> [1] lN(0, 1)
log(dist_lognormal())
#> <distribution[1]>
#> [1] N(0, 1)

^{Created on 2021-11-12 by the reprex package (v2.0.0)}

Answer 6 · 2024-04-19T07:14:57.000Z

Following up on the discussion started in #102

Currently these are calculated with taylor series approximations, which are faster than the numerical integration and I think work fine in most cases. Maybe numerical integration would be fast enough with a higher tolerance, while still giving comparable accuracy but with improved consistency.

Yes, I did notice they use taylor series approximations. But I'm not sure that they do work fine in most cases. The examples I posted use fairly typical parameter ranges, and the error is huge. When you say they work fine in most cases, do you have any results on the expected error under different transformations and distributions?

They wway it is now, I personally wouldn't use this functionality. Maybe the simplest fix is to have an argument to mean and variance that allows the user to choose which method (TS vs integration) to use. That way anyone can decide on what is the acceptable speed-accuracy trade-off for them

Answer 7 · 2024-04-19T07:35:52.000Z

Btw, the speed issue is greatly improved by now having the symbolic derivatives. For example, on the master branch:

mean_new <- function(x, rel.tol = .Machine$double.eps^0.5, ...) {
  fun <- function(at, dist) {
    unlist(density(dist, at)) * at
  }
  integrate(fun, -Inf, Inf, dist = x, rel.tol = rel.tol)$value
}

trans_lnorm <- exp(dist_wrap('norm', mean = 2, sd = 1.5))

microbenchmark::microbenchmark(mean(trans_lnorm),
+                                mean_new(trans_lnorm))
#> Unit: microseconds
#>                   expr        min          lq        mean      median          uq        max neval
#>      mean(trans_lnorm)    514.468    542.4095    581.0446    575.1685    601.8595    902.205   100
#>  mean_new(trans_lnorm) 103720.283 105202.5765 109171.2150 106897.0450 109197.6370 175162.168   100

And on the symmbolic derivatives branch:

Unit: microseconds
                 expr       min        lq      mean    median
    mean(trans_lnorm)   540.093   553.828   606.679   568.629
mean_new(trans_lnorm) 10147.090 10281.221 10995.819 10466.890

Answer 8 · 2024-04-19T07:47:59.000Z

and using low rel.tol in integrate such as 0.1 is only 7 times slower than the taylor series, but produces ~2000 times smaller error:

                                          type     mean microseconds
1                                   analytical 22.75990      0.58548
2                                taylor series 15.70174    627.20898
3 integrate(rel.tol = .Machine$double.eps^0.5) 22.75990  11238.07048
4                   integrate (rel.tol = 1e-5) 22.75990   8340.58449
5                   integrate (rel.tol = 1e-3) 22.75982   6154.55551
6                   integrate (rel.tol = 1e-1) 22.75512   4015.56993

Answer 9 · 2024-04-19T07:52:27.000Z

I'm open to changing it, there are many options here. For example, mean.dist_default() uses numerical integration. Originally it took a the average from a sufficiently large sample with generate(). The current behaviour is for both speed and consistency with the forecast/fable packages, but I agree that accuracy is a more important consideration (@robjhyndman can comment on the accuracy of the TS approximation under different circumstances).

I also experimented with rel.tol, and I think that something high like 0.1 by default is ok - of course this should be parameterised.

Here's some other examples the taylor series approach with different transformations:

library(distributional)
mu = 15
sd = 2
x <- dist_normal(mu, sd)^2
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 229.0715
mean(x)
#> [1] 229

x <- sqrt(dist_normal(mu, sd))
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 3.864325
mean(x)
#> [1] 3.864377

bc <- scales::boxcox_trans(0.5)
x <- dist_transformed(dist_normal(mu, sd), bc$transform, bc$inverse)
mean(x)
#> [1] 5.728753
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 5.728844

^{Created on 2024-04-19 with reprex v2.0.2}

Answer 10 · 2024-04-19T08:55:53.000Z

I guess the issue is much more pronounced when the inverse function is log because the taylor series approximation for log has a very small radius of covergence, while the transformations you posted involve power and exp transformations that are well approximated by Taylor series

Answer 11 · 2024-04-19T08:57:23.000Z

Yes, that's correct. This is why we have log(dist_normal()) -> dist_lognormal() to catch this common case.