tidyverse/forcats

Add function that creates factor in order of case_when matches

dchiu911 opened this issue · 5 comments

A common workflow I do is map one vector to another using some (possibly complex) conditions, then coerce to a factor with the level order the same as parsed in dplyr::case_when(). It would be helpful if there was a wrapper that created the factor without having to manually specify the levels.
Currently, I'd do something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(2022)
x <- sample(
  c("low", "intermediate", "high"),
  prob = c(0.5, 0.2, 0.3),
  size = 100,
  replace = TRUE
)
z <- rbinom(
  n = 100,
  size = 100,
  prob = 0.3
)
y <- case_when(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_
) %>%
  factor(levels = c("B", "A", "C"))
str(y)
#>  Factor w/ 3 levels "B","A","C": 1 3 2 3 2 3 2 1 1 3 ...

Created on 2022-02-01 by the reprex package (v2.0.1)

Can we add a function that makes y into a factor with the level order the same as specified in the case_when()? For example,

y <- fct_case(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_
)

I think we'd need to make the syntax more limiting than case_when() because the RHS of a case_when() can itself use data values, and reasoning through how those values should interact between conditions seems hard.

Since we'd want to restrict each expression to a single character level, we could put it in the LHS of =, something like:

something(
  "B" = x == "intermediate" | (x == "low" & z < 30),
  "A" = x == "low",
  "C" = x == "high",
)

But I don't know if any existing tidyverse function uses similar syntax.

I do think removing the usage of ~ would make it more consistent as case_when() syntax is quite unique

But I don't know if any existing tidyverse function uses similar syntax.

FWIW this is basically how fct_recode() works (name represents new level, value was the old level), so it wouldn't be unheard to let the name represent the new level, and the value be the logical condition

Will wait until lower level functions are exposed by vctrs.

I would think it would be convenient to solve this from the case_when() itself:

Something like this:

set.seed(2022)
x <- sample(
  c("low", "intermediate", "high"),
  prob = c(0.5, 0.2, 0.3),
  size = 100,
  replace = TRUE
)
z <- rbinom(
  n = 100,
  size = 100,
  prob = 0.3
)
y <- case_when(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_,
  .ptype = "factor"
)