Improve accuracy of summary statistic methods for transformed distributions
mjskay opened this issue · 11 comments
I'm not sure the exact problem, but it seems that the mean()
of transformed distributions is not always correct. Here's an example of exp()
applied to dist_normal()
, which should yield a lognormal distribution, giving incorrect means:
> x = dist_normal(0:1, 1:2)
> mean(x)
[1] 0 1
> mean(exp(x)) # should be roughly exp(c(0.5, 3))
[1] 1.500000 8.154845
> exp(c(0.5, 3))
[1] 1.648721 20.085537
I'm not sure the issue. It does appear to work in some cases:
> mean(log(exp(x))) # should be roughly c(0, 1)
[1] -4.732208e-07 1.000000e+00
The Taylor series approximation used to compute the mean of log-Normal distributions is appropriate for small variances.
Some relevant discussion can be found here: https://robjhyndman.com/hyndsight/backtransforming/#comment-3043970592
In most forecasting applications this approximation results in a relatively small inaccuracy in the forecast distribution's mean.
The log normal distribution is sufficiently common to justify a specific distribution, but I'm concerned about how this inaccuracy might impact non-log transformations.
One option could be to sample the mean, the quantile()
and generate()
methods are accurate. But this would substantially degrade performance for a task commonly done at scale in forecasting.
@robjhyndman - do you have any suggestions for improving this approximation?
The Taylor series approximation up to 2nd derivatives has the nice property that it is very general provided the transformation is differentiable, but it is not necessarily very accurate. I can think of three possible approaches:
- Use a higher order Taylor series approximation. That would involve computing up to the 4th order derivative and moment.
- Use better approximations (or exact results) for specific transformations such as log, exp, and power transformations; and Taylor series if the transformation is not recognized.
- Use an empirical approach by randomly sampling a few thousand observations from the transformed distribution.
Great, thanks for the suggestions.
I think a good start would be to add better methods for common transformations and fallback to Taylor series approximations otherwise.
I think sampling for the mean will be too slow for most applications, but would be useful to provide as an option for users.
OK. To start, a more accurate version for exp() has mean exp(m + v/2) and variance [exp(v)-1]exp(2m+v) where the original distribution has mean m and variance v. But if we are going to allow special cases, then perhaps identify when exp is applied to a normal distribution, and save it as logNormal rather than t(N).
I've added the log-normal distribution, and a shortcut between it and the normal with exp()
and log()
.
library(distributional)
exp(dist_normal())
#> <distribution[1]>
#> [1] lN(0, 1)
log(dist_lognormal())
#> <distribution[1]>
#> [1] N(0, 1)
Created on 2021-11-12 by the reprex package (v2.0.0)
Following up on the discussion started in #102
Currently these are calculated with taylor series approximations, which are faster than the numerical integration and I think work fine in most cases. Maybe numerical integration would be fast enough with a higher tolerance, while still giving comparable accuracy but with improved consistency.
Yes, I did notice they use taylor series approximations. But I'm not sure that they do work fine in most cases. The examples I posted use fairly typical parameter ranges, and the error is huge. When you say they work fine in most cases, do you have any results on the expected error under different transformations and distributions?
They wway it is now, I personally wouldn't use this functionality. Maybe the simplest fix is to have an argument to mean
and variance
that allows the user to choose which method (TS vs integration) to use. That way anyone can decide on what is the acceptable speed-accuracy trade-off for them
Btw, the speed issue is greatly improved by now having the symbolic derivatives. For example, on the master branch:
mean_new <- function(x, rel.tol = .Machine$double.eps^0.5, ...) {
fun <- function(at, dist) {
unlist(density(dist, at)) * at
}
integrate(fun, -Inf, Inf, dist = x, rel.tol = rel.tol)$value
}
trans_lnorm <- exp(dist_wrap('norm', mean = 2, sd = 1.5))
microbenchmark::microbenchmark(mean(trans_lnorm),
+ mean_new(trans_lnorm))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> mean(trans_lnorm) 514.468 542.4095 581.0446 575.1685 601.8595 902.205 100
#> mean_new(trans_lnorm) 103720.283 105202.5765 109171.2150 106897.0450 109197.6370 175162.168 100
And on the symmbolic derivatives branch:
Unit: microseconds
expr min lq mean median
mean(trans_lnorm) 540.093 553.828 606.679 568.629
mean_new(trans_lnorm) 10147.090 10281.221 10995.819 10466.890
and using low rel.tol in integrate such as 0.1 is only 7 times slower than the taylor series, but produces ~2000 times smaller error:
type mean microseconds
1 analytical 22.75990 0.58548
2 taylor series 15.70174 627.20898
3 integrate(rel.tol = .Machine$double.eps^0.5) 22.75990 11238.07048
4 integrate (rel.tol = 1e-5) 22.75990 8340.58449
5 integrate (rel.tol = 1e-3) 22.75982 6154.55551
6 integrate (rel.tol = 1e-1) 22.75512 4015.56993
I'm open to changing it, there are many options here. For example, mean.dist_default()
uses numerical integration. Originally it took a the average from a sufficiently large sample with generate()
. The current behaviour is for both speed and consistency with the forecast/fable packages, but I agree that accuracy is a more important consideration (@robjhyndman can comment on the accuracy of the TS approximation under different circumstances).
I also experimented with rel.tol
, and I think that something high like 0.1 by default is ok - of course this should be parameterised.
Here's some other examples the taylor series approach with different transformations:
library(distributional)
mu = 15
sd = 2
x <- dist_normal(mu, sd)^2
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 229.0715
mean(x)
#> [1] 229
x <- sqrt(dist_normal(mu, sd))
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 3.864325
mean(x)
#> [1] 3.864377
bc <- scales::boxcox_trans(0.5)
x <- dist_transformed(dist_normal(mu, sd), bc$transform, bc$inverse)
mean(x)
#> [1] 5.728753
vapply(generate(x, 1000000), mean, numeric(1L))
#> [1] 5.728844
Created on 2024-04-19 with reprex v2.0.2
I guess the issue is much more pronounced when the inverse function is log
because the taylor series approximation for log has a very small radius of covergence, while the transformations you posted involve power and exp transformations that are well approximated by Taylor series
Yes, that's correct. This is why we have log(dist_normal())
-> dist_lognormal()
to catch this common case.