tidymodels/recipes

undocumented char -> factor conversion in recipe creates non-commutative condition

beansrowning opened this issue · 1 comments

The problem

Perhaps I've missed some documentation, but I seem to have identified an issue where {recipes} converts character features to factor invisibly to the user, and this in turn creates a condition where all_string() and all_string_predictors() operate differently depending on where in the recipe they're used.

In my example, I have columns of several different types. I have a few models that will use character features, and some which won't. In this case, I want to remove those features and only keep factors, integers, doubles, etc.

I'd assume that step_rm(all_string_predictors()) would sort this out quickly, but this actually results in unpredictable behavior depending on where in the recipe chain you place it.

I could pre-remove these features beforehand, or pre-compute their values and remove them by name, but this seemed somewhat antithetical to the entire tidymodels approach.

Is this expected behavior, and if so, what is the "preferred" solution to handling it?

Reproducible example

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.4.1
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step


df <- structure(list(
  NEK = c(
    221119035L, 221213318L, 211030043L, 220842741L,
    220161193L, 221215066L
  ), DateChanged = structure(c(
    1667865600,
    1670284800, 1634169600, 1660867200, 1643587200, 1670371200
  ), class = c(
    "POSIXct",
    "POSIXt"
  ), tzone = ""), DateCollected = structure(c(
    1667779200,
    1670198400, 1634083200, 1660780800, 1643500800, 1670371200
  ), class = c(
    "POSIXct",
    "POSIXt"
  ), tzone = ""), DateofTreatment = structure(c(
    1653177600,
    1669334400, 1632614400, 1660262400, 1642636800, 1665964800
  ), class = c(
    "POSIXct",
    "POSIXt"
  ), tzone = ""), AgeYrs = c(555, 555, 555, 555, 555, 555), DrugUse = structure(c(1L, 3L, 1L, 1L, 1L, 1L), levels = c(
    "THERAPEUTIC",
    "ABUSE", "SELF-HARM", "ASSAULT", "UNKNOWN INTENT", "NOT AN ADE"
  ), class = "factor"), Comments = c(
    "lorem ipsum", "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum"
  ),
  DiagOther = c(
    "lorem ipsum", "lorem ipsum", "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum"
  ), DiagOther2 = c(
    "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum",
    "lorem ipsum"
  ), Drug1 = c(
    "lorem ipsum", "lorem ipsum", "lorem ipsum",
    "lorem ipsum", "lorem ipsum", "lorem ipsum"
  ),
  CaseStatus = rbinom(6, 1, 0.5)
), row.names = c(
  NA,
  -6L
), class = c("tbl_df", "tbl", "data.frame"))

glimpse(df)
#> Rows: 6
#> Columns: 11
#> $ NEK             <int> 221119035, 221213318, 211030043, 220842741, 220161193,<85>
#> $ DateChanged     <dttm> 2022-11-07 19:00:00, 2022-12-05 19:00:00, 2021-10-13 <85>
#> $ DateCollected   <dttm> 2022-11-06 19:00:00, 2022-12-04 19:00:00, 2021-10-12 <85>
#> $ DateofTreatment <dttm> 2022-05-21 20:00:00, 2022-11-24 19:00:00, 2021-09-25 <85>
#> $ AgeYrs          <dbl> 555, 555, 555, 555, 555, 555
#> $ DrugUse         <fct> THERAPEUTIC, SELF-HARM, THERAPEUTIC, THERAPEUTIC, THER<85>
#> $ Comments        <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ DiagOther       <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ DiagOther2      <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ Drug1           <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ CaseStatus      <int> 1, 0, 1, 0, 0, 0

# --- With two steps, this seems to work as expected -------------------------------------
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
  step_rm(NEK, all_string_predictors()) |>
  step_date(DateofTreatment) |>
  step_time(DateofTreatment) |>
  step_holiday(DateofTreatment) |>
  step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates

prep(lr_recipe)
#> 
#> -- Recipe ----------------------------------------------------------------------
#> 
#> -- Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10
#> 
#> -- Training information
#> Training data contained 6 data points and no incomplete rows.
#> 
#> -- Operations
#> <95> Variables removed: NEK, Comments, DiagOther, DiagOther2, Drug1 | Trained
#> <95> Date features from: DateofTreatment | Trained
#> <95> Time features from: DateofTreatment | Trained
#> <95> Holiday features from: DateofTreatment | Trained
#> <95> Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 12
#>   AgeYrs DrugUse     CaseStatus DateofTreatment_dow DateofTreatment_month
#>    <dbl> <fct>            <int> <fct>               <fct>                
#> 1    555 THERAPEUTIC          1 Sat                 May                  
#> 2    555 SELF-HARM            0 Thu                 Nov                  
#> 3    555 THERAPEUTIC          1 Sat                 Sep                  
#> 4    555 THERAPEUTIC          0 Thu                 Aug                  
#> 5    555 THERAPEUTIC          0 Wed                 Jan                  
#> 6    555 THERAPEUTIC          0 Sun                 Oct                  
#> # ℹ 7 more variables: DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> #   DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> #   DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> #   DateofTreatment_ChristmasDay <int>

# --- If all_string_predictors() is not in the first step, fails -----------------------------------
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
  step_rm(NEK) |>
  step_rm(all_string_predictors()) |>
  step_date(DateofTreatment) |>
  step_time(DateofTreatment) |>
  step_holiday(DateofTreatment) |>
  step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates

prep(lr_recipe)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10
#> 
#> ── Training information
#> Training data contained 6 data points and no incomplete rows.
#> 
#> ── Operations
#> • Variables removed: NEK | Trained
#> • Variables removed: <none> | Trained
#> • Date features from: DateofTreatment | Trained
#> • Time features from: DateofTreatment | Trained
#> • Holiday features from: DateofTreatment | Trained
#> • Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 16
#>   AgeYrs DrugUse     Comments    DiagOther   DiagOther2  Drug1       CaseStatus
#>    <dbl> <fct>       <fct>       <fct>       <fct>       <fct>            <int>
#> 1    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 2    555 SELF-HARM   lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 3    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 4    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 5    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 6    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,
#> #   DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> #   DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> #   DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> #   DateofTreatment_ChristmasDay <int>

# --- Calling step_rm at the end of the chain fails, recipes only finds NEK and datetime cols because strings have already been converted to factor ----
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
  step_date(DateofTreatment) |>
  step_time(DateofTreatment) |>
  step_holiday(DateofTreatment) |>
  step_rm(NEK, all_string_predictors(), all_datetime_predictors()) # remove unique ID, any string predictors, dates

prep(lr_recipe)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 10
#> 
#> ── Training information
#> Training data contained 6 data points and no incomplete rows.
#> 
#> ── Operations
#> • Date features from: DateofTreatment | Trained
#> • Time features from: DateofTreatment | Trained
#> • Holiday features from: DateofTreatment | Trained
#> • Variables removed: NEK, DateChanged, DateCollected, ... | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 16
#>   AgeYrs DrugUse     Comments    DiagOther   DiagOther2  Drug1       CaseStatus
#>    <dbl> <fct>       <fct>       <fct>       <fct>       <fct>            <int>
#> 1    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 2    555 SELF-HARM   lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 3    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          1
#> 4    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 5    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> 6    555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum          0
#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,
#> #   DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> #   DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> #   DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> #   DateofTreatment_ChristmasDay <int>

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/New_York
#>  date     2024-09-27
#>  pandoc   3.1.2 @ C:/Users/<snip>/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  class          7.3-22     2023-05-03 [1] CRAN (R 4.4.0)
#>  cli            3.6.2      2023-12-11 [1] CRAN (R 4.4.0)
#>  clock          0.7.1      2024-07-18 [1] CRAN (R 4.4.1)
#>  codetools      0.2-20     2024-03-31 [1] CRAN (R 4.4.0)
#>  data.table     1.15.4     2024-03-30 [1] CRAN (R 4.4.0)
#>  digest         0.6.35     2024-03-11 [1] CRAN (R 4.4.0)
#>  dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate       0.23       2023-11-01 [1] CRAN (R 4.4.0)
#>  fansi          1.0.6      2023-12-08 [1] CRAN (R 4.4.0)
#>  fastmap        1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
#>  fs             1.6.4      2024-04-25 [1] CRAN (R 4.4.0)
#>  future         1.33.2     2024-03-26 [1] CRAN (R 4.4.0)
#>  future.apply   1.11.2     2024-03-28 [1] CRAN (R 4.4.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.4.0)
#>  globals        0.16.3     2024-03-08 [1] CRAN (R 4.4.0)
#>  glue           1.7.0      2024-01-09 [1] CRAN (R 4.4.0)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.4.0)
#>  hardhat        1.4.0      2024-06-02 [1] CRAN (R 4.4.1)
#>  htmltools      0.5.8.1    2024-04-04 [1] CRAN (R 4.4.0)
#>  ipred          0.9-15     2024-07-18 [1] CRAN (R 4.4.1)
#>  knitr          1.46       2024-04-06 [1] CRAN (R 4.4.0)
#>  lattice        0.22-6     2024-03-20 [1] CRAN (R 4.4.0)
#>  lava           1.8.0      2024-03-05 [1] CRAN (R 4.4.1)
#>  lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.4.0)
#>  listenv        0.9.1      2024-01-29 [1] CRAN (R 4.4.0)
#>  lubridate      1.9.3      2023-09-27 [1] CRAN (R 4.4.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.4.0)
#>  MASS           7.3-60.2   2024-04-24 [1] local
#>  Matrix         1.7-0      2024-03-22 [1] CRAN (R 4.4.0)
#>  nnet           7.3-19     2023-05-03 [1] CRAN (R 4.4.0)
#>  parallelly     1.37.1     2024-02-29 [1] CRAN (R 4.4.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.4.0)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.4.0)
#>  prodlim        2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
#>  purrr          1.0.2      2023-08-10 [1] CRAN (R 4.4.0)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.4.0)
#>  Rcpp           1.0.12     2024-01-09 [1] CRAN (R 4.4.0)
#>  recipes      * 1.1.0      2024-07-04 [1] CRAN (R 4.4.1)
#>  reprex         2.1.1      2024-07-06 [1] CRAN (R 4.4.1)
#>  rlang          1.1.3      2024-01-10 [1] CRAN (R 4.4.0)
#>  rmarkdown      2.27       2024-05-17 [1] CRAN (R 4.4.0)
#>  rpart          4.1.23     2023-12-05 [1] CRAN (R 4.4.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.4.1)
#>  survival       3.5-8      2024-02-14 [1] CRAN (R 4.4.0)
#>  tibble         3.2.1      2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.4.0)
#>  timechange     0.3.0      2024-01-18 [1] CRAN (R 4.4.0)
#>  timeDate       4041.110   2024-09-22 [1] CRAN (R 4.4.1)
#>  tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.4.0)
#>  utf8           1.2.4      2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.4.0)
#>  withr          3.0.0      2024-01-16 [1] CRAN (R 4.4.0)
#>  xfun           0.43       2024-03-25 [1] CRAN (R 4.4.0)
#>  yaml           2.3.8      2023-12-11 [1] CRAN (R 4.4.0)
#> 
#>  [1] C:/Users/<snip>/AppData/Local/Programs/R/R-4.4.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

This is happening because of the default value of strings_as_factors argument to prep(). Setting it to FALSE will likely deal with your issue.

This is a known issue, and we are planning to move the argument to recipe() have it be more central to the recipe object. #331