categorize() add labels="range" option

Question

categorize() add labels="range" option

Closed this issue 3 months ago · 9 comments

DominiqueMakowski commented 3 months ago

evil @profandyfield is menacing me not to include datawizard in his book update until we add the option to categorize() to label the categories by "range" option to mimick cut_interval()'s default

ggplot2::cut_interval(mtcars$mpg, n=5)
#>  [1] (19.8,24.5] (19.8,24.5] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8]
#>  [7] [10.4,15.1] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8] (15.1,19.8]
#> [13] (15.1,19.8] (15.1,19.8] [10.4,15.1] [10.4,15.1] [10.4,15.1] (29.2,33.9]
#> [19] (29.2,33.9] (29.2,33.9] (19.8,24.5] (15.1,19.8] (15.1,19.8] [10.4,15.1]
#> [25] (15.1,19.8] (24.5,29.2] (24.5,29.2] (29.2,33.9] (15.1,19.8] (15.1,19.8]
#> [31] [10.4,15.1] (19.8,24.5]
#> Levels: [10.4,15.1] (15.1,19.8] (19.8,24.5] (24.5,29.2] (29.2,33.9]

^{Created on 2024-10-02 with reprex v2.1.0}

Ours:

datawizard::categorize(mtcars$mpg, n_groups=5, labels="median")
#>  [1] 22.80 22.80 22.80 22.80 15.20 15.20 15.20 22.80 22.80 22.80 15.20 15.20
#> [13] 15.20 15.20 15.20 15.20 15.20 22.80 22.80 22.80 22.80 15.20 15.20 15.20
#> [25] 22.80 22.80 22.80 22.80 15.20 22.80 15.20 22.80
#> Levels: 15.20 22.80

^{Created on 2024-10-02 with reprex v2.1.0}

DominiqueMakowski commented 3 months ago

👑

Answer 1 · 2024-10-02T14:31:24.000Z

Before January, @DominiqueMakowski, before January. The clock of evil is ticking ...

Answer 2 · 2024-10-02T14:42:14.000Z

I always confuse this:

is (3, 5] including 3 and excluding 5, or excluding 3 and including 5?

Answer 3 · 2024-10-02T15:09:13.000Z

@profandyfield I think you can start working on that chapter:

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5)
#>  [1] 3 3 3 3 2 2 1 3 3 2 2 2 2 2 1 1 1 5 5 5 3 2 2 1 2 4 4 5 2 2 1 3

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#>  [1] [19.8,24.5) [19.8,24.5) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8)
#>  [7] [10.4,15.1) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8) [15.1,19.8)
#> [13] [15.1,19.8) [15.1,19.8) [10.4,15.1) [10.4,15.1) [10.4,15.1) [29.2,33.9]
#> [19] [29.2,33.9] [29.2,33.9] [19.8,24.5) [15.1,19.8) [15.1,19.8) [10.4,15.1)
#> [25] [15.1,19.8) [24.5,29.2) [24.5,29.2) [29.2,33.9] [15.1,19.8) [15.1,19.8)
#> [31] [10.4,15.1) [19.8,24.5)
#> Levels: [10.4,15.1) [15.1,19.8) [19.8,24.5) [24.5,29.2) [29.2,33.9]

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "observed")
#>  [1] (21-24.4)   (21-24.4)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7)
#>  [7] (10.4-15)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7) (15.2-19.7)
#> [13] (15.2-19.7) (15.2-19.7) (10.4-15)   (10.4-15)   (10.4-15)   (30.4-33.9)
#> [19] (30.4-33.9) (30.4-33.9) (21-24.4)   (15.2-19.7) (15.2-19.7) (10.4-15)  
#> [25] (15.2-19.7) (26-27.3)   (26-27.3)   (30.4-33.9) (15.2-19.7) (15.2-19.7)
#> [31] (10.4-15)   (21-24.4)  
#> Levels: (10.4-15) (15.2-19.7) (21-24.4) (26-27.3) (30.4-33.9)

^{Created on 2024-10-02 with reprex v2.1.1}

Answer 4 · 2024-10-02T15:49:34.000Z

And for the sake of completeness, it is now also possible to decide whether breaks are inclusive or exclusive (if right argument in cut() is FALSE or TRUE):

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#>  [1] [19.8,24.5) [19.8,24.5) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8)
#>  [7] [10.4,15.1) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8) [15.1,19.8)
#> [13] [15.1,19.8) [15.1,19.8) [10.4,15.1) [10.4,15.1) [10.4,15.1) [29.2,33.9]
#> [19] [29.2,33.9] [29.2,33.9] [19.8,24.5) [15.1,19.8) [15.1,19.8) [10.4,15.1)
#> [25] [15.1,19.8) [24.5,29.2) [24.5,29.2) [29.2,33.9] [15.1,19.8) [15.1,19.8)
#> [31] [10.4,15.1) [19.8,24.5)
#> Levels: [10.4,15.1) [15.1,19.8) [19.8,24.5) [24.5,29.2) [29.2,33.9]
datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range", breaks = "inclusive")
#>  [1] (19.8,24.5] (19.8,24.5] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8]
#>  [7] [10.4,15.1] (19.8,24.5] (19.8,24.5] (15.1,19.8] (15.1,19.8] (15.1,19.8]
#> [13] (15.1,19.8] (15.1,19.8] [10.4,15.1] [10.4,15.1] [10.4,15.1] (29.2,33.9]
#> [19] (29.2,33.9] (29.2,33.9] (19.8,24.5] (15.1,19.8] (15.1,19.8] [10.4,15.1]
#> [25] (15.1,19.8] (24.5,29.2] (24.5,29.2] (29.2,33.9] (15.1,19.8] (15.1,19.8]
#> [31] [10.4,15.1] (19.8,24.5]
#> Levels: [10.4,15.1] (15.1,19.8] (19.8,24.5] (24.5,29.2] (29.2,33.9]

^{Created on 2024-10-02 with reprex v2.1.1}

Answer 5 · 2024-10-02T15:57:23.000Z

datawizard::categorize(mtcars$mpg, n_groups=5, labels="median")

btw, in that example, n_groups is ignored because the default split (argument split) is at the median ;-)

Answer 6 · 2024-10-03T08:10:42.000Z

@profandyfield I think you can start working on that chapter:

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5)
#>  [1] 3 3 3 3 2 2 1 3 3 2 2 2 2 2 1 1 1 5 5 5 3 2 2 1 2 4 4 5 2 2 1 3

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#>  [1] [19.8,24.5) [19.8,24.5) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8)
#>  [7] [10.4,15.1) [19.8,24.5) [19.8,24.5) [15.1,19.8) [15.1,19.8) [15.1,19.8)
#> [13] [15.1,19.8) [15.1,19.8) [10.4,15.1) [10.4,15.1) [10.4,15.1) [29.2,33.9]
#> [19] [29.2,33.9] [29.2,33.9] [19.8,24.5) [15.1,19.8) [15.1,19.8) [10.4,15.1)
#> [25] [15.1,19.8) [24.5,29.2) [24.5,29.2) [29.2,33.9] [15.1,19.8) [15.1,19.8)
#> [31] [10.4,15.1) [19.8,24.5)
#> Levels: [10.4,15.1) [15.1,19.8) [19.8,24.5) [24.5,29.2) [29.2,33.9]

datawizard::categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "observed")
#>  [1] (21-24.4)   (21-24.4)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7)
#>  [7] (10.4-15)   (21-24.4)   (21-24.4)   (15.2-19.7) (15.2-19.7) (15.2-19.7)
#> [13] (15.2-19.7) (15.2-19.7) (10.4-15)   (10.4-15)   (10.4-15)   (30.4-33.9)
#> [19] (30.4-33.9) (30.4-33.9) (21-24.4)   (15.2-19.7) (15.2-19.7) (10.4-15)  
#> [25] (15.2-19.7) (26-27.3)   (26-27.3)   (30.4-33.9) (15.2-19.7) (15.2-19.7)
#> [31] (10.4-15)   (21-24.4)  
#> Levels: (10.4-15) (15.2-19.7) (21-24.4) (26-27.3) (30.4-33.9)

Created on 2024-10-02 with reprex v2.1.1

For the obsevered range (lower example, only values that are actually in the data), all shown values of that range belong to the categories, i.e. there's no inclusion or exclusion of breaks. Thus, should be use parentheses, brackets, or just values w/o any parentheses for the labels?

a) (3-5)
b) [3-5]
c) 3-5

?

Answer 7 · 2024-10-03T08:32:59.000Z

b) I'd say if the exact values shown in the category belong to that category

The difference between range and observed is small and non-obvious tho no? Don't you think it might add more confusion than benefits?

minor: I'd put a space after the comma in the range (though I think a dash is even better)

Answer 8 · 2024-10-03T09:24:33.000Z

The difference between range and observed is small and non-obvious tho no?

Yeah, but it's optional, and I think good documented.
Agree with brackets and comma, in line with cut()