mitchelloharawild/distributional

Inaccurate HDRs

robjhyndman opened this issue · 5 comments

library(distributional)
hdr(dist_normal())
#> <hdr[1]>
#> [1] [-1.95194, 1.95194]95
qnorm(0.975)
#> [1] 1.959964

Created on 2024-08-12 with reprex v2.1.1

This is largely based on the grid size of the quantiles (same as is done in hdrcde).

Gradient descent based methods would produce more accurate solutions starting from suitable hdrs from the grid method.

library(distributional)
hdr(dist_normal(), n = 1e6)
#> <hdr[1]>
#> [1] [-1.959956, 1.959956]95
qnorm(0.975)
#> [1] 1.959964

Created on 2024-08-12 with reprex v2.1.0

Maybe set n a little higher. quantiles should be relatively quick to compute for most distributions. Setting n=1e5 is still relatively quick for dist_normal() I think 4 accurate significant figures is a reasonable objective.

What do you think about root finding algorithms for this, starting from a suitable location chosen from a course grid?

Maybe. Perhaps try a speed comparison on dist_normal(). I've tested the effect of grid size on accuracy, and it has some really weird numerical behaviour.

library(distributional)
library(ggplot2)
df <- data.frame(
  n = exp(seq(log(50), log(5e3), length = 1e3)),
  upper = NA_real_
)
for (i in seq(NROW(df))) {
  df$upper[i] <- unlist(hdr(dist_normal(), n = df$n[i]))["upper"]
}
df |>
  ggplot(aes(x = n, y = upper)) +
  geom_line() +
  geom_hline(yintercept = qnorm(0.975), col = "red") +
  scale_x_log10()

Created on 2024-09-17 with reprex v2.1.1

I started writing a root finding function, and then realised why I didn't do it this way in the first place. Every time you do a calculation of the HDR coverage for a generic continuous distribution, you need to compute the density at a large number of observations, because the algorithm uses Monte Carlo integration. So with root finding, it would require far more computations than having a fine grid.

Of course, we could have hdr.dist_xxx() functions for specific distributions that would be much faster. For symmetric distributions the computation is easily done from the cdf. So perhaps we could at least add hdr.dist_normal() and similar for other symmetric distributions, and change the value of n in hdr.dist_default() to about 5000?

I can do a PR along these lines if you want.