tidyverse/forcats

Unclear warnings and errors generated when setting levels for a factor generated from a character vector

wtimmerman-fitp opened this issue · 3 comments

When I use fct_relevel with the levels argument, I receive a warning that does not clearly indicate what is going wrong. Similarly, when I use the levels argument in forcats::as_factor()'s, (on the assumptions that arguments in .../ellipsis will be passed on to methods), I receive an error "Arguments in ... must be used". Both of these are unexpected results for me based on my understanding of the function help text and base::factor().

For background, my intention is to convert a character column into a factor column using a pre-specified list of levels (the pre-specified list is somewhat important as a check and consistency for reasons that I won't get into here). I have reviewed the forcats issues and don't see an exact match for this problem:

  • Using base:factor(), I can pass the vector of levels to the levels argument; this is fine, but it is not noisy enough if the levels provided do not match the character column I am mutating into a factor.
  • Using forcats::as_factor(), when I pass the levels argument I receive the error "Arguments in ... must be used." I am not clear if I am misusing the function.
  • Using forcats::fct_relevel(), I receive the warning "Outer names are only allowed for unnamed scalar atomic inputs". This comes from vctrs, and I also see it referenced in the fct_relevel() help, but it doesn't seem to apply in the reprex I've generated below.

My questions are:

  • Should these forcats() functions be generating different/more-specific warnings?
  • Should these forcats() functions behave differently when passed the levels argument?
  • Should I be using these functions differently (or a different function altogether) given my use case?

Reprex

library(tidyverse)

mtcars2 <-
  mtcars %>% 
  tibble::rownames_to_column(var = "make_model") %>% 
  dplyr::filter(
    dplyr::row_number() <= 5
  )

use_levels <-
  mtcars2 %>% 
  dplyr::pull(make_model) 

# this works as expected, since the provided levels will by definition match the values in the make_model column.
mtcars2_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = use_levels
    )
  )

# I don't understand why this is an error based on the as_factor() help.
mtcars2_as_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::as_factor(
      make_model,
      levels = use_levels
    )
  )
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `make_model = forcats::as_factor(make_model,
#>   levels = use_levels)`.
#> Caused by error:
#> ! Arguments in `...` must be used.
#> x Problematic argument:
#> * levels = use_levels

# I don't understand why this generates this warning since use_levels does not have names
mtcars2_fct_relevel <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::fct_relevel(
      make_model,
      levels = use_levels
    )
  )
#> Warning: Outer names are only allowed for unnamed scalar atomic inputs

# when i modify use_levels to have a value not present in the column, more challenges arise.
use_levels_mod <-
  c(use_levels, "Other Car")

# base::factor is not noisy enough that there are factor levels not present in the data.
mtcars2_mod_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = use_levels_mod
    )
  )

# as_factor continus to error
mtcars2_mod_as_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::as_factor(
      make_model,
      levels = use_levels_mod
    )
  )
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `make_model = forcats::as_factor(make_model,
#>   levels = use_levels_mod)`.
#> Caused by error:
#> ! Arguments in `...` must be used.
#> x Problematic argument:
#> * levels = use_levels_mod

# fct_relevel generates an expected warning, but still has the 
# original warning that makes little sense in this case.

mtcars2_mod_fct_relevel <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::fct_relevel(
      make_model,
      levels = use_levels_mod
    )
  )
#> Warning: Outer names are only allowed for unnamed scalar atomic inputs
#> Warning: Unknown levels in `f`: Other Car

Created on 2022-08-09 by the reprex package (v2.0.1)

Session info
sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
#> [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.8    ggplot2_3.3.6  
#> [9] tidyverse_1.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.2    xfun_0.31           haven_2.5.0        
#>  [4] gargle_1.2.0        colorspace_2.0-3    vctrs_0.4.1        
#>  [7] generics_0.1.3      htmltools_0.5.3     yaml_2.3.5         
#> [10] utf8_1.2.2          rlang_1.0.4         pillar_1.8.0       
#> [13] glue_1.6.2          withr_2.5.0         DBI_1.1.3          
#> [16] dbplyr_2.2.1        readxl_1.4.0        modelr_0.1.8       
#> [19] lifecycle_1.0.1     munsell_0.5.0       gtable_0.3.0       
#> [22] cellranger_1.1.0    rvest_1.0.2         evaluate_0.15      
#> [25] knitr_1.39          tzdb_0.3.0          fastmap_1.1.0      
#> [28] fansi_1.0.3         highr_0.9           broom_1.0.0        
#> [31] backports_1.4.1     scales_1.2.0        googlesheets4_1.0.0
#> [34] jsonlite_1.8.0      fs_1.5.2            hms_1.1.1          
#> [37] digest_0.6.29       stringi_1.7.8       grid_4.0.5         
#> [40] cli_3.3.0           tools_4.0.5         magrittr_2.0.3     
#> [43] crayon_1.5.1        pkgconfig_2.0.3     ellipsis_0.3.2     
#> [46] xml2_1.3.3          reprex_2.0.1        googledrive_2.0.0  
#> [49] lubridate_1.8.0     assertthat_0.2.1    rmarkdown_2.14     
#> [52] httr_1.4.3          rstudioapi_0.13     R6_2.5.1           
#> [55] compiler_4.0.5

Based on a quick read, I think you might be interested in fct()? More in #299.

Oh, this is perfect! Thank you for the pointer! I think this will solve my issue. level named argument is there, no errors or warnings if an additional level is listed but not in data, errors (unlike base::factor) if one of the supplied levels is not in the data.

I'll close the issue and look forward to fct() getting into a future release.

(example below if anyone curious).

#setup ----
library(tidyverse)

fct <- function(x = character(), levels = NULL, na = character()) {
  if (!is.character(x)) {
    cli::cli_abort("{.arg x} must be a character vector")
  }
  if (!is.character(na)) {
    cli::cli_abort("{.arg na} must be a character vector")
  }
  
  x[x %in% na] <- NA
  
  if (is.null(levels)) {
    levels <- unique(x)
  } else if (!is.character(levels)) {
    abort("`{.arg levels} must be a character vector")
  }
  
  invalid <- setdiff(x, c(levels, NA))
  
  if (length(invalid) > 0 ) {
    cli::cli_abort(c(
      "Values of {.arg x} must be members of {.arg levels}", 
      i = "Invalid value{?s}: {.str {invalid}}"
    ))
  }
  factor(x, levels = levels, exclude = NULL)
}

mtcars2 <-
  mtcars %>% 
  tibble::rownames_to_column(var = "make_model") %>% 
  dplyr::filter(
    dplyr::row_number() <= 5
  )

# Match levels----
match_levels <-
  mtcars2 %>% 
  dplyr::pull(make_model) 

mtcars2_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = match_levels
    )
  )

mtcars2_fct <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = fct(
      make_model,
      levels = match_levels
    )
  )

# Add Levels ----
add_levels <-
  c(match_levels, "Other Car")

mtcars2_add_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = add_levels
    )
  )

mtcars2_add_fct <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = fct(
      make_model,
      levels = add_levels
    )
  )

levels(mtcars2_add_fct$make_model)
#> [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
#> [4] "Hornet 4 Drive"    "Hornet Sportabout" "Other Car"

# Miss Levels ----
miss_levels <-
  match_levels[-1]

mtcars2_miss_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = miss_levels
    )
  )

mtcars2_miss_fct <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = fct(
      make_model,
      levels = miss_levels
    )
  )
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `make_model = fct(make_model, levels =
#>   miss_levels)`.
#> Caused by error in `fct()`:
#> ! Values of `x` must be members of `levels`
#> i Invalid value: "Mazda RX4"

Created on 2022-08-09 by the reprex package (v2.0.1)

Also, if anyone runs into the same warning I got with fct_relevel (Warning: Outer names are only allowed for unnamed scalar atomic inputs), it's because you can't use the levels argument for that function; just pass the vector object of level names (in this case, use_levels) into the ellipsis on its own like:

mtcars2_fct_relevel <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::fct_relevel(
      make_model,
      use_levels
    )
  )