LKremer/ggpointdensity

Points at Infinity are Missing

const-ae opened this issue · 10 comments

Hi Lukas,

I have a feature suggestion: I just noticed, that geom_pointdensity() does not plot the points that have an x or y value of ±Inf, unlike geom_point() which puts them all down at the border of the plot:

library(ggplot2)
df <- data.frame(x = rnorm(200),
           y = c(rnorm(100), rep(-Inf, 100)))

ggplot(df, aes(x=x, y = y)) +
    geom_point()

ggplot(df, aes(x=x, y = y)) +
  ggpointdensity::geom_pointdensity()
#> Warning: Removed 100 rows containing non-finite values (stat_pointdensity).

Created on 2020-01-27 by the reprex package (v0.3.0)

This feature would be quite useful, because it could help to see how dense the overplotting at the bottom is.

Best, Constantin

Hi Constantin,

interesting find! I guess from a mathematical point of view it's impossible to assign a density to those points 😅
Do you want to calculate the density in 1D instead, or what do you suggest?

Yes, I think the most reasonable behaviour would be to calculate the density as if the points with Inf were at .Machine$double.xmax = 1.79e+308.
That way the density for the finite points wouldn't change, but could still see the density at the border of the plot.

@lysogeny added the points at infinity. They lack a density estimate for now.
@const-ae I have to test if your suggestion works. I'm worried that replacing infinite value with huge numbers could cause e.g. a float overflow in the C code cause these numbers will be squared in the density calculation.

Hey Lukas,

I just came across the problem again and gave it a try myself, building on the PR of lysogeny.

library(ggplot2)

df <- data.frame(x = rnorm(20001),
                 y = c(rnorm(20001 - 100), rep(-Inf, 100)))

ggplot(df, aes(x=x, y = y)) +
  geom_point()

ggplot(df, aes(x=x, y = y)) +
  ggpointdensity::geom_pointdensity(method = "kde2d") 

ggplot(df, aes(x=x, y = y)) +
  ggpointdensity::geom_pointdensity(method = "default")

Created on 2020-06-15 by the reprex package (v0.3.0)

I modified the C count_neighbors() function to treat the distance on the axis that is infinite as 0. I think this is reasonable, because that is how they appear on the plot.

I didn't however modify the KDE2D density estimation, which means that the results can differ as you can see in the reprex.

Oh, I realize that I might have made a mistake. I of course only want to allow the distance if both points are infinite position.

Okay, I updated the PR (#14).

The plots look now better as well:

library(ggplot2)

df <- data.frame(x = rnorm(20001),
                 y = c(rnorm(20001 - 100), rep(-Inf, 100)))

ggplot(df, aes(x=x, y = y)) +
  geom_point()

ggplot(df, aes(x=x, y = y)) +
  ggpointdensity::geom_pointdensity(method = "kde2d") 

ggplot(df, aes(x=x, y = y)) +
  ggpointdensity::geom_pointdensity(method = "default")

Created on 2020-06-15 by the reprex package (v0.3.0)

Thanks for fixing this @const-ae . The only issue is that method="kde2d" and method="default" handle infinite values differently now (i.e. kde2d doesn't calculate the density for those values at all). At some point I want to fix this inconsistency, but for now I'm closing this.

Thanks for merging the PR. I see that this is not ideal that the methods differ. I just took a look how kde2d actually works and the implementation https://github.com/cran/MASS/blob/c2ff394b1c45d58ebe72811699c683a3ca59e097/R/kde2d.R doesn't seem super complicated, so I guess it might be possible to write your own version of kde2d that can handle infinity :)

I've installed the package from github per the instructions on the README and updated the suggested packages, however, I still get

Warning message: Removed 49 rows containing non-finite values (stat_pointdensity).

Am I missing something?

ps. great package, BTW.

To follow up on my previous post, "Yes, I'm missing something." Restarting R fixed my problem.