`geom_density()` for bounded data

Question

`geom_density()` for bounded data

Closed this issue 2 years ago · 8 comments

Currently geom_density() directly uses density() to compute kernel density estimation. Due to its nature, it has "extending property": output density can imply that values outside the range of input data are possible (with default "gaussian" kernel). This is a realistic practical setup, but there are cases when this is not true and data is bounded (for example, when only positive values are possible). It can be a good idea to support kernel density estimation on this type of bounded data with new bounds argument of geom_density().

In my opinion, one of the methods most easiest to implement, understand, and teach, is "reflection" method. There is one possible description on this page. Basically, boundary correction is done by doing "standard" kernel density estimation first, and then "reflecting" tails outside of desired interval to be inside. Densities inside and outside of desired interval are added together in "symmetric fashion": d(x) = d_f(x) + d_f(l - (x-l)) + d_f(r + (r-x)), where d_f is density of input, d is density of output, l and r are left and right edges of desired interval.

I made some quick and dirty changes to ggplot2 for demonstration. stat_density() gets bounds argument with default value of c(-Inf, Inf). Here are some examples of proposed functionality:

library(tibble)

set.seed(101)

ggplot(tibble(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, 1), color = "blue", ) +
  stat_function(data = tibble(x = c(0, 1)), fun = dunif, color = "red")

ggplot(tibble(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, Inf), color = "blue") +
  stat_function(data = tibble(x = c(0, 5)), fun = dexp, color = "red")

Answer 1 · 2019-07-03T08:50:54.000Z

Here are more appropriate plots with plot limits completely representing current behavior:

library(ggplot2)

set.seed(101)

ggplot(data.frame(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(color = "blue", bounds = c(0, 1)) +
  stat_function(data = data.frame(x = c(0, 1)), fun = dunif, color = "red")

ggplot(data.frame(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(color = "blue", bounds = c(0, Inf)) +
  stat_function(data = data.frame(x = c(0, 5)), fun = dexp, color = "red")

^{Created on 2019-07-03 by the reprex package (v0.3.0)}

Answer 2 · 2020-04-30T18:04:20.000Z

Sorry, a naive question: if geom_density is using stats::density(), why not simply allow to passing arguments from and to in stats::density()?

Answer 3 · 2020-04-30T18:24:29.000Z

Not a naive question at all!

Arguments from and to represent "the left and right-most points of the grid at which the density is to be estimated;...". They don't alter the density estimation process, while this suggestion modifies output so that it describes density better in case there are known constraints on data generating process.

Here is an example. Black line demonstrates "raw" estimation, that goes outside of [0; 1] segment (which, we know should never happen). Blue line is a result with from = 0 and to = 1, and it just clips the portion of original curve (and as result is not a density estimate itself, because total square is less than 1).

library(ggplot2)

set.seed(42)
x <- runif(100)

den_default <- as.data.frame(density(x)[c("x", "y")])
den_cut <- as.data.frame(density(x, from = 0, to = 1)[c("x", "y")])

ggplot(mapping = aes(x, y)) +
  geom_line(data = den_cut, color = "blue", size = 2) +
  geom_line(data = den_default, color = "black")

^{Created on 2020-04-30 by the reprex package (v0.3.0)}

Answer 4 · 2020-04-30T20:38:36.000Z

ok, thanks for the explanation!!

Answer 5 · 2020-05-17T19:16:35.000Z

Hello @echasnovski,

This proposed function looks really cool and useful. If you don't mind me asking, could you share your code that created it? Thanks.

Edit: Fantastic. Thank you so much.

Answer 6 · 2020-05-19T16:46:23.000Z

Thanks, @spspitze.

I decided to make PR (#4013) for this feature to make it more publicly accessible.
Currently, to actually use it, you have two choices:

Install version of 'ggplot2' from my fork (which you might "overwrite" when updating 'ggplot2' later on) via:

# install.packages("remotes")
remotes::install_github("echasnovski/ggplot2", ref="geom_density-bounds")

Somehow use the code from my fork directly. All relevant updates are in this file.

Answer 7 · 2020-11-25T20:51:36.000Z

Hi @echasnovski - thanks very much for this: I used your fork to good effect for a plot I was making, and also this issue made me aware of how the default clipped geom_densitys produced might not have area=1, which I find troubling.

Reading further into the subject, and for anyone else interested, I found the evmix package, and trying the examples for bckden it looks to be very good. Also their paper gives some more descriptions of the different options for dealing with boundary correction.

I suspect ultimately it will be necessary to do the computations for this outside of ggplot2 (perhaps one can use these pre-calculated densities with geom_density(stat="identity" ?), though of course much more convenient to have a bounds argument as you demonstrate!

Answer 8 · 2023-01-31T19:57:48.000Z

This issue was fixed by #4013