Minimum density can be zero
Opened this issue · 1 comments
It seems like with large, skewed datasets the density estimate for a point can be exactly zero. This doesn't make sense to me, since all the points should represent some data. It also presents a technical issue if I, say, wanted to log-transform the color scale.
library(ggplot2)
library(ggpointdensity)
df <- data.frame(x = c(rep(0, 100000), rnorm(100000)),
y = c(rep(0, 100000), rnorm(100000)))
p <- ggplot(df, aes(x = x, y = y)) +
geom_pointdensity()
p
#> geom_pointdensity using method='kde2d' due to large number of points (>20k)
p + scale_color_continuous(trans = "log10")
#> geom_pointdensity using method='kde2d' due to large number of points (>20k)
#> Warning: Transformation introduced infinite values in discrete y-axis
Created on 2024-02-08 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.1 (2023-06-16)
#> os macOS Sonoma 14.2.1
#> system x86_64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Phoenix
#> date 2024-02-08
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.0)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
#> curl 5.2.0 2023-12-08 [1] CRAN (R 4.3.0)
#> digest 0.6.34 2024-01-11 [1] CRAN (R 4.3.0)
#> dplyr 1.1.4 2023-11-17 [1] CRAN (R 4.3.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.0)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.0)
#> farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
#> ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.3.0)
#> ggpointdensity * 0.1.0 2024-02-01 [1] Github (LKremer/ggpointdensity@02f3ab2)
#> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.0)
#> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)
#> highr 0.10 2022-12-22 [1] CRAN (R 4.3.0)
#> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.0)
#> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.0)
#> labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> MASS 7.3-60.0.1 2024-01-13 [1] CRAN (R 4.3.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.3.0)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.0)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
#> scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.0)
#> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.0)
#> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.0)
#> xml2 1.3.5 2023-07-06 [1] CRAN (R 4.3.0)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.0)
#>
#> [1] /Users/ericscott/Library/R/x86_64/4.3/library
#> [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
This is maybe related to the default bandwidth estimator used by MASS::k2de()
. If I supply my own values of h
using a different bandwidth estimator (e.g. bw.nrd0()
) I don't have this issue or the issue with bandwith == 0 (#21). Even the documentation says that bw.nrd()
"has remained the default for historical and compatibility reasons, rather than as a general recommendation". Perhaps it would be better for stat_pointdensity()
to calculate its own bandwidth rather than relying on the defaults for k2de()