surprising singleton intervals if `max(x) == max(brks)` and `close_end = FALSE`
dpprdan opened this issue Β· 4 comments
chop()
/tab()
creates (surprising) singleton intervals if max(x) == max(brks)
.
library(santoku)
x <- 1:10
brks <- c(1, 5, 10)
Note that the brks
contain all of x
(or rather min(x) == min(brks)
and max(x) == max(brks)
), so there is no need to extend
.
tab(x, brks, close_end = FALSE)
#> [1, 5) [5, 10) {10}
#> 4 5 1
The creation of a singleton interval is surprising, IMO.
I think this ought to rather be (desired outcome)
#> [1, 5) [5, 10) <NA>
#> 4 5 1
To get the desired outcome, right now, one needs to call
tab(x, brks, close_end = FALSE, extend = FALSE)
#> [1, 5) [5, 10) <NA>
#> 4 5 1
So the problem is that the breaks are extended, even though this isnβt necessary, i.e. max(x) == max(brks)
not max(x) > max(breaks)
Is this an unintended consequence of the recent changes - maybe we even discussed this π€?
IMO the correct code to get the current output would be
tab(1:10, c(2, 5, 10, 10), close_end = TRUE)
#> [1, 2) [2, 5) [5, 10) {10}
#> 1 3 5 1
Session info
sessioninfo::session_info()
#> β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 4.2.1 (2022-06-23 ucrt)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language en
#> collate German_Germany.utf8
#> ctype German_Germany.utf8
#> tz Europe/Berlin
#> date 2022-10-28
#> pandoc 2.19.2 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.1)
#> digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.1)
#> evaluate 0.17 2022-10-07 [1] RSPM
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.1)
#> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.1)
#> lifecycle 1.0.3 2022-10-07 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> purrr 0.3.5 2022-10-06 [1] RSPM
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.1)
#> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.1)
#> rmarkdown 2.17 2022-10-07 [1] RSPM
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> santoku * 0.8.0.9000 2022-10-28 [1] Github (hughjonesd/santoku@bc73cb7)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
#> stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.1)
#> styler 1.8.0 2022-10-22 [1] CRAN (R 4.2.1)
#> vctrs 0.5.0 2022-10-22 [1] CRAN (R 4.2.1)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.34 2022-10-18 [1] CRAN (R 4.2.1)
#> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.1)
#>
#> [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2
#> [2] C:/Program Files/R/R-4.2.1/library
#>
#> ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why would you say it's not necessary to extend the breaks? The unextended break [5, 10)
doesn't include 10. extend
is to ensure that all data is categorized. If you want [5, 10]
you could use close_end = TRUE
(which has become the default in master).
There's a separate question which is "should we just close the break rather than adding a new singleton interval"? Technically this is harder, and also it feels weird because it involves changing the requested intervals, rather than adding new intervals around them. I'm against it.
NB would like to know what you think of #42 and branch experiment-named-breaks.
If you want
[5, 10]
you could useclose_end = TRUE
I knew that, but I intentionally chose closed_end = FALSE
. E.g. to get the equivalent of
x <- 1:10
brks <- c(1, 5, 10)
cut(x, brks, right = FALSE)
#> [1] [1,5) [1,5) [1,5) [1,5) [5,10) [5,10) [5,10) [5,10) [5,10) <NA>
#> Levels: [1,5) [5,10)
extend
is to ensure that all data is categorized.
By default, yes. This is also (though slightly less strict) documented:
- By default,
chop()
always covers the whole range of the data, so you wonβt get unexpectedNA
values.
And indeed, by default, the whole range of the data is covered π
library(santoku)
tab(x, brks)
#> [1, 5) [5, 10]
#> 4 6
(Note that the breaks are not extended here, though!)
However, the docs also state that with extend = NULL
intervals will only be extended if max(x) > max(breaks)
not max(x) >= max(breaks)
.
If
extend
isNULL
(the default), intervals will be extended to
and , only if
necessary β i.e. ifmin(x) < min(breaks)
andmax(x) > max(breaks)
respectively.
This is clearly violated here, where max(x) == max(breaks)
tab(x, brks, close_end = FALSE)
#> [1, 5) [5, 10) {10}
#> 4 5 1
or min(x) == min(breaks)
tab(x, brks, left = FALSE, close_end = FALSE)
#> {1} (1, 5] (5, 10]
#> 1 4 5
If we compare the default to the latter two, then close_end
determines whether or not the breaks are extended/a singleton interval is created. That is β¦ surprising?!
First, why does close_end
determine if intervals are extend
ed? (well, to be pedantic, a new interval is created and not an existing one extended).
Second, why does close_end = FALSE
(βLogical. Close last break at right?β) create an interval that is closed at the right?
Relatedly, the following documentation is currently violated by tab(x, brks, close_end = FALSE)
as well
If
close_end
isTRUE
the end break will be closed at both ends [β¦]
(Well, this at least hints that close_end = FALSE
means that the last interval will not be closed at both ends)
If
left
isTRUE
andclose_end
isFALSE
, all breaks will look like...[x1, x2) ...
.
By contrast, we're out of default-land, so the "cover the whole range" claim isn't binding anymore.
There's a separate question which is "should we just close the break rather than adding a new singleton interval"?
Nope, do not close, because close_end = FALSE
.
Finally I think that singleton intervals are special and should therefore only be created explicitly, not implicitly like here. But that is just a matter of taste.
Having said all that, even though I clearly have a preference, I donβt feel very strongly about it, as long as it is possible to somehow get all the mentioned outputs. Naturally, the functions behaviour and the documentation should match in the end.
Yeah, the documentation was a bit inaccurate. Fixed. But the intention of extend
is definitely "cover all the data unless extend = FALSE
". So if the unextended breaks will cover x
, we're good; if not, we need to extend. close_end
matters because if close_end
is FALSE
, then 10 won't be covered, and we need to extend. In other words, close_end
changes what the set of unextended breaks would be, so it shouldn't be surprising that it affects whether we have to extend.
Re: tab(x, brks, left = FALSE, close_end = FALSE)
creating a closed break at the end, that's because when left = FALSE
, close_end
applies to the first break - as the docs say. It's true that they say "end break" at one point, which is a little ambiguous, but I don't want to have to say "last (or first)" everywhere. And the example makes it clear:
If βleftβ is βFALSEβ and βclose_endβ is βTRUEβ, breaks will look like β[x1, x2], (x2, x3] ... (x_n-1, x_n]β.