easystats/datawizard

categorize() add labels="range" option

Closed this issue ยท 9 comments

evil @profandyfield is menacing me not to include datawizard in his book update until we add the option to categorize() to label the categories by "range" option to mimick cut_interval()'s default

ggplot2::cut_interval(mtcars$mpg, n=5)
#>  [1] (19.8,24.5] (19.8,24.5] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8]
#>  [7] [10.4,15.1] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8] (15.1,19.8]
#> [13] (15.1,19.8] (15.1,19.8] [10.4,15.1] [10.4,15.1] [10.4,15.1] (29.2,33.9]
#> [19] (29.2,33.9] (29.2,33.9] (19.8,24.5] (15.1,19.8] (15.1,19.8] [10.4,15.1]
#> [25] (15.1,19.8] (24.5,29.2] (24.5,29.2] (29.2,33.9] (15.1,19.8] (15.1,19.8]
#> [31] [10.4,15.1] (19.8,24.5]
#> Levels: [10.4,15.1] (15.1,19.8] (19.8,24.5] (24.5,29.2] (29.2,33.9]

Created on 2024-10-02 with reprex v2.1.0

Ours:

datawizard::categorize(mtcars$mpg, n_groups=5, labels="median")
#>  [1] 22.80 22.80 22.80 22.80 15.20 15.20 15.20 22.80 22.80 22.80 15.20 15.20
#> [13] 15.20 15.20 15.20 15.20 15.20 22.80 22.80 22.80 22.80 15.20 15.20 15.20
#> [25] 22.80 22.80 22.80 22.80 15.20 22.80 15.20 22.80
#> Levels: 15.20 22.80

Created on 2024-10-02 with reprex v2.1.0

Before January, @DominiqueMakowski, before January. The clock of evil is ticking ...

I always confuse this:

is (3, 5] including 3 and excluding 5, or excluding 3 and including 5?

@profandyfield I think you can start working on that chapter:

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5)
#>  [1] 3 3 3 3 2 2 1 3 3 2 2 2 2 2 1 1 1 5 5 5 3 2 2 1 2 4 4 5 2 2 1 3

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#>  [1] [19.8,24.5) [19.8,24.5) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8)
#>  [7] [10.4,15.1) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8) [15.1,19.8)
#> [13] [15.1,19.8) [15.1,19.8) [10.4,15.1) [10.4,15.1) [10.4,15.1) [29.2,33.9]
#> [19] [29.2,33.9] [29.2,33.9] [19.8,24.5) [15.1,19.8) [15.1,19.8) [10.4,15.1)
#> [25] [15.1,19.8) [24.5,29.2) [24.5,29.2) [29.2,33.9] [15.1,19.8) [15.1,19.8)
#> [31] [10.4,15.1) [19.8,24.5)
#> Levels: [10.4,15.1) [15.1,19.8) [19.8,24.5) [24.5,29.2) [29.2,33.9]

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "observed")
#>  [1] (21-24.4)   (21-24.4)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7)
#>  [7] (10.4-15)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7) (15.2-19.7)
#> [13] (15.2-19.7) (15.2-19.7) (10.4-15)   (10.4-15)   (10.4-15)   (30.4-33.9)
#> [19] (30.4-33.9) (30.4-33.9) (21-24.4)   (15.2-19.7) (15.2-19.7) (10.4-15)  
#> [25] (15.2-19.7) (26-27.3)   (26-27.3)   (30.4-33.9) (15.2-19.7) (15.2-19.7)
#> [31] (10.4-15)   (21-24.4)  
#> Levels: (10.4-15) (15.2-19.7) (21-24.4) (26-27.3) (30.4-33.9)

Created on 2024-10-02 with reprex v2.1.1

๐Ÿ‘‘

And for the sake of completeness, it is now also possible to decide whether breaks are inclusive or exclusive (if right argument in cut() is FALSE or TRUE):

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#>  [1] [19.8,24.5) [19.8,24.5) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8)
#>  [7] [10.4,15.1) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8) [15.1,19.8)
#> [13] [15.1,19.8) [15.1,19.8) [10.4,15.1) [10.4,15.1) [10.4,15.1) [29.2,33.9]
#> [19] [29.2,33.9] [29.2,33.9] [19.8,24.5) [15.1,19.8) [15.1,19.8) [10.4,15.1)
#> [25] [15.1,19.8) [24.5,29.2) [24.5,29.2) [29.2,33.9] [15.1,19.8) [15.1,19.8)
#> [31] [10.4,15.1) [19.8,24.5)
#> Levels: [10.4,15.1) [15.1,19.8) [19.8,24.5) [24.5,29.2) [29.2,33.9]
datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range", breaks = "inclusive")
#>  [1] (19.8,24.5] (19.8,24.5] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8]
#>  [7] [10.4,15.1] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8] (15.1,19.8]
#> [13] (15.1,19.8] (15.1,19.8] [10.4,15.1] [10.4,15.1] [10.4,15.1] (29.2,33.9]
#> [19] (29.2,33.9] (29.2,33.9] (19.8,24.5] (15.1,19.8] (15.1,19.8] [10.4,15.1]
#> [25] (15.1,19.8] (24.5,29.2] (24.5,29.2] (29.2,33.9] (15.1,19.8] (15.1,19.8]
#> [31] [10.4,15.1] (19.8,24.5]
#> Levels: [10.4,15.1] (15.1,19.8] (19.8,24.5] (24.5,29.2] (29.2,33.9]

Created on 2024-10-02 with reprex v2.1.1

datawizard::categorize(mtcars$mpg, n_groups=5, labels="median")

btw, in that example, n_groups is ignored because the default split (argument split) is at the median ;-)

@profandyfield I think you can start working on that chapter:

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5)
#>  [1] 3 3 3 3 2 2 1 3 3 2 2 2 2 2 1 1 1 5 5 5 3 2 2 1 2 4 4 5 2 2 1 3

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#>  [1] [19.8,24.5) [19.8,24.5) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8)
#>  [7] [10.4,15.1) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8) [15.1,19.8)
#> [13] [15.1,19.8) [15.1,19.8) [10.4,15.1) [10.4,15.1) [10.4,15.1) [29.2,33.9]
#> [19] [29.2,33.9] [29.2,33.9] [19.8,24.5) [15.1,19.8) [15.1,19.8) [10.4,15.1)
#> [25] [15.1,19.8) [24.5,29.2) [24.5,29.2) [29.2,33.9] [15.1,19.8) [15.1,19.8)
#> [31] [10.4,15.1) [19.8,24.5)
#> Levels: [10.4,15.1) [15.1,19.8) [19.8,24.5) [24.5,29.2) [29.2,33.9]

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "observed")
#>  [1] (21-24.4)   (21-24.4)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7)
#>  [7] (10.4-15)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7) (15.2-19.7)
#> [13] (15.2-19.7) (15.2-19.7) (10.4-15)   (10.4-15)   (10.4-15)   (30.4-33.9)
#> [19] (30.4-33.9) (30.4-33.9) (21-24.4)   (15.2-19.7) (15.2-19.7) (10.4-15)  
#> [25] (15.2-19.7) (26-27.3)   (26-27.3)   (30.4-33.9) (15.2-19.7) (15.2-19.7)
#> [31] (10.4-15)   (21-24.4)  
#> Levels: (10.4-15) (15.2-19.7) (21-24.4) (26-27.3) (30.4-33.9)

Created on 2024-10-02 with reprex v2.1.1

For the obsevered range (lower example, only values that are actually in the data), all shown values of that range belong to the categories, i.e. there's no inclusion or exclusion of breaks. Thus, should be use parentheses, brackets, or just values w/o any parentheses for the labels?

a) (3-5)
b) [3-5]
c) 3-5

?

b) I'd say if the exact values shown in the category belong to that category

The difference between range and observed is small and non-obvious tho no? Don't you think it might add more confusion than benefits?

minor: I'd put a space after the comma in the range (though I think a dash is even better)

The difference between range and observed is small and non-obvious tho no?

Yeah, but it's optional, and I think good documented.
Agree with brackets and comma, in line with cut()