hughjonesd/santoku

surprising singleton intervals if `max(x) == max(brks)` and `close_end = FALSE`

dpprdan opened this issue Β· 4 comments

chop()/tab() creates (surprising) singleton intervals if max(x) == max(brks).

library(santoku)
x <- 1:10
brks <- c(1, 5, 10)

Note that the brks contain all of x (or rather min(x) == min(brks) and max(x) == max(brks)), so there is no need to extend.

tab(x, brks, close_end = FALSE)
#>  [1, 5) [5, 10)    {10} 
#>       4       5       1

The creation of a singleton interval is surprising, IMO.

I think this ought to rather be (desired outcome)

#>  [1, 5) [5, 10)    <NA> 
#>      4       5       1 

To get the desired outcome, right now, one needs to call

tab(x, brks, close_end = FALSE, extend = FALSE)
#>  [1, 5) [5, 10)    <NA> 
#>       4       5       1

So the problem is that the breaks are extended, even though this isn’t necessary, i.e. max(x) == max(brks) not max(x) > max(breaks)

Is this an unintended consequence of the recent changes - maybe we even discussed this πŸ€”?

IMO the correct code to get the current output would be

tab(1:10, c(2, 5, 10, 10), close_end = TRUE)
#>  [1, 2)  [2, 5) [5, 10)    {10} 
#>       1       3       5       1
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-10-28
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.2.0)
#>  cli           3.4.1      2022-09-23 [1] CRAN (R 4.2.1)
#>  digest        0.6.30     2022-10-18 [1] CRAN (R 4.2.1)
#>  evaluate      0.17       2022-10-07 [1] RSPM
#>  fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.5.2      2021-12-08 [1] CRAN (R 4.2.0)
#>  glue          1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.9        2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3      2022-07-18 [1] CRAN (R 4.2.1)
#>  knitr         1.40       2022-08-24 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.3      2022-10-07 [1] RSPM
#>  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  purrr         0.3.5      2022-10-06 [1] RSPM
#>  R.cache       0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.0     2022-06-28 [1] CRAN (R 4.2.1)
#>  Rcpp          1.0.9      2022-07-08 [1] CRAN (R 4.2.1)
#>  reprex        2.0.2      2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6      2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.17       2022-10-07 [1] RSPM
#>  rstudioapi    0.14       2022-08-22 [1] CRAN (R 4.2.1)
#>  santoku     * 0.8.0.9000 2022-10-28 [1] Github (hughjonesd/santoku@bc73cb7)
#>  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8      2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.1      2022-08-20 [1] CRAN (R 4.2.1)
#>  styler        1.8.0      2022-10-22 [1] CRAN (R 4.2.1)
#>  vctrs         0.5.0      2022-10-22 [1] CRAN (R 4.2.1)
#>  withr         2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.34       2022-10-18 [1] CRAN (R 4.2.1)
#>  yaml          2.3.6      2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Why would you say it's not necessary to extend the breaks? The unextended break [5, 10) doesn't include 10. extend is to ensure that all data is categorized. If you want [5, 10] you could use close_end = TRUE (which has become the default in master).

There's a separate question which is "should we just close the break rather than adding a new singleton interval"? Technically this is harder, and also it feels weird because it involves changing the requested intervals, rather than adding new intervals around them. I'm against it.

NB would like to know what you think of #42 and branch experiment-named-breaks.

If you want [5, 10] you could use close_end = TRUE

I knew that, but I intentionally chose closed_end = FALSE. E.g. to get the equivalent of

x <- 1:10
brks <- c(1, 5, 10)

cut(x, brks, right = FALSE)
#>  [1] [1,5)  [1,5)  [1,5)  [1,5)  [5,10) [5,10) [5,10) [5,10) [5,10) <NA>  
#> Levels: [1,5) [5,10)

extend is to ensure that all data is categorized.

By default, yes. This is also (though slightly less strict) documented:

  • By default, chop() always covers the whole range of the data, so you won’t get unexpected NA values.

And indeed, by default, the whole range of the data is covered πŸ‘

library(santoku)
tab(x, brks)
#>  [1, 5) [5, 10] 
#>       4       6

(Note that the breaks are not extended here, though!)

However, the docs also state that with extend = NULL intervals will only be extended if max(x) > max(breaks) not max(x) >= max(breaks).

If extend is NULL (the default), intervals will be extended to
and , only if
necessary – i.e. if min(x) < min(breaks) and max(x) > max(breaks)
respectively.

This is clearly violated here, where max(x) == max(breaks)

tab(x, brks, close_end = FALSE)
#>  [1, 5) [5, 10)    {10} 
#>       4       5       1

or min(x) == min(breaks)

tab(x, brks, left = FALSE, close_end = FALSE)
#>     {1}  (1, 5] (5, 10] 
#>       1       4       5

If we compare the default to the latter two, then close_end determines whether or not the breaks are extended/a singleton interval is created. That is … surprising?!
First, why does close_end determine if intervals are extended? (well, to be pedantic, a new interval is created and not an existing one extended).
Second, why does close_end = FALSE (β€œLogical. Close last break at right?”) create an interval that is closed at the right?

Relatedly, the following documentation is currently violated by tab(x, brks, close_end = FALSE) as well

If close_end is TRUE the end break will be closed at both ends […]

(Well, this at least hints that close_end = FALSE means that the last interval will not be closed at both ends)

If left is TRUE and close_end is FALSE, all breaks will look like ...[x1, x2) ....

By contrast, we're out of default-land, so the "cover the whole range" claim isn't binding anymore.

There's a separate question which is "should we just close the break rather than adding a new singleton interval"?

Nope, do not close, because close_end = FALSE.

Finally I think that singleton intervals are special and should therefore only be created explicitly, not implicitly like here. But that is just a matter of taste.

Having said all that, even though I clearly have a preference, I don’t feel very strongly about it, as long as it is possible to somehow get all the mentioned outputs. Naturally, the functions behaviour and the documentation should match in the end.

Yeah, the documentation was a bit inaccurate. Fixed. But the intention of extend is definitely "cover all the data unless extend = FALSE". So if the unextended breaks will cover x, we're good; if not, we need to extend. close_end matters because if close_end is FALSE, then 10 won't be covered, and we need to extend. In other words, close_end changes what the set of unextended breaks would be, so it shouldn't be surprising that it affects whether we have to extend.

Re: tab(x, brks, left = FALSE, close_end = FALSE) creating a closed break at the end, that's because when left = FALSE, close_end applies to the first break - as the docs say. It's true that they say "end break" at one point, which is a little ambiguous, but I don't want to have to say "last (or first)" everywhere. And the example makes it clear:

If β€˜left’ is β€˜FALSE’ and β€˜close_end’ is β€˜TRUE’, breaks will look like β€˜[x1, x2], (x2, x3] ... (x_n-1, x_n]’.