undocumented char -> factor conversion in recipe creates non-commutative condition
beansrowning opened this issue · 1 comments
The problem
Perhaps I've missed some documentation, but I seem to have identified an issue where {recipes} converts character features to factor invisibly to the user, and this in turn creates a condition where all_string()
and all_string_predictors()
operate differently depending on where in the recipe they're used.
In my example, I have columns of several different types. I have a few models that will use character features, and some which won't. In this case, I want to remove those features and only keep factors, integers, doubles, etc.
I'd assume that step_rm(all_string_predictors())
would sort this out quickly, but this actually results in unpredictable behavior depending on where in the recipe chain you place it.
I could pre-remove these features beforehand, or pre-compute their values and remove them by name, but this seemed somewhat antithetical to the entire tidymodels approach.
Is this expected behavior, and if so, what is the "preferred" solution to handling it?
Reproducible example
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.4.1
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
df <- structure(list(
NEK = c(
221119035L, 221213318L, 211030043L, 220842741L,
220161193L, 221215066L
), DateChanged = structure(c(
1667865600,
1670284800, 1634169600, 1660867200, 1643587200, 1670371200
), class = c(
"POSIXct",
"POSIXt"
), tzone = ""), DateCollected = structure(c(
1667779200,
1670198400, 1634083200, 1660780800, 1643500800, 1670371200
), class = c(
"POSIXct",
"POSIXt"
), tzone = ""), DateofTreatment = structure(c(
1653177600,
1669334400, 1632614400, 1660262400, 1642636800, 1665964800
), class = c(
"POSIXct",
"POSIXt"
), tzone = ""), AgeYrs = c(555, 555, 555, 555, 555, 555), DrugUse = structure(c(1L, 3L, 1L, 1L, 1L, 1L), levels = c(
"THERAPEUTIC",
"ABUSE", "SELF-HARM", "ASSAULT", "UNKNOWN INTENT", "NOT AN ADE"
), class = "factor"), Comments = c(
"lorem ipsum", "lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum"
),
DiagOther = c(
"lorem ipsum", "lorem ipsum", "lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum"
), DiagOther2 = c(
"lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ipsum",
"lorem ipsum"
), Drug1 = c(
"lorem ipsum", "lorem ipsum", "lorem ipsum",
"lorem ipsum", "lorem ipsum", "lorem ipsum"
),
CaseStatus = rbinom(6, 1, 0.5)
), row.names = c(
NA,
-6L
), class = c("tbl_df", "tbl", "data.frame"))
glimpse(df)
#> Rows: 6
#> Columns: 11
#> $ NEK <int> 221119035, 221213318, 211030043, 220842741, 220161193,<85>
#> $ DateChanged <dttm> 2022-11-07 19:00:00, 2022-12-05 19:00:00, 2021-10-13 <85>
#> $ DateCollected <dttm> 2022-11-06 19:00:00, 2022-12-04 19:00:00, 2021-10-12 <85>
#> $ DateofTreatment <dttm> 2022-05-21 20:00:00, 2022-11-24 19:00:00, 2021-09-25 <85>
#> $ AgeYrs <dbl> 555, 555, 555, 555, 555, 555
#> $ DrugUse <fct> THERAPEUTIC, SELF-HARM, THERAPEUTIC, THERAPEUTIC, THER<85>
#> $ Comments <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ DiagOther <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ DiagOther2 <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ Drug1 <chr> "lorem ipsum", "lorem ipsum", "lorem ipsum", "lorem ip<85>
#> $ CaseStatus <int> 1, 0, 1, 0, 0, 0
# --- With two steps, this seems to work as expected -------------------------------------
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
step_rm(NEK, all_string_predictors()) |>
step_date(DateofTreatment) |>
step_time(DateofTreatment) |>
step_holiday(DateofTreatment) |>
step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates
prep(lr_recipe)
#>
#> -- Recipe ----------------------------------------------------------------------
#>
#> -- Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 10
#>
#> -- Training information
#> Training data contained 6 data points and no incomplete rows.
#>
#> -- Operations
#> <95> Variables removed: NEK, Comments, DiagOther, DiagOther2, Drug1 | Trained
#> <95> Date features from: DateofTreatment | Trained
#> <95> Time features from: DateofTreatment | Trained
#> <95> Holiday features from: DateofTreatment | Trained
#> <95> Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 12
#> AgeYrs DrugUse CaseStatus DateofTreatment_dow DateofTreatment_month
#> <dbl> <fct> <int> <fct> <fct>
#> 1 555 THERAPEUTIC 1 Sat May
#> 2 555 SELF-HARM 0 Thu Nov
#> 3 555 THERAPEUTIC 1 Sat Sep
#> 4 555 THERAPEUTIC 0 Thu Aug
#> 5 555 THERAPEUTIC 0 Wed Jan
#> 6 555 THERAPEUTIC 0 Sun Oct
#> # ℹ 7 more variables: DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> # DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> # DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> # DateofTreatment_ChristmasDay <int>
# --- If all_string_predictors() is not in the first step, fails -----------------------------------
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
step_rm(NEK) |>
step_rm(all_string_predictors()) |>
step_date(DateofTreatment) |>
step_time(DateofTreatment) |>
step_holiday(DateofTreatment) |>
step_rm(all_datetime_predictors()) # remove unique ID, any string predictors, dates
prep(lr_recipe)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 10
#>
#> ── Training information
#> Training data contained 6 data points and no incomplete rows.
#>
#> ── Operations
#> • Variables removed: NEK | Trained
#> • Variables removed: <none> | Trained
#> • Date features from: DateofTreatment | Trained
#> • Time features from: DateofTreatment | Trained
#> • Holiday features from: DateofTreatment | Trained
#> • Variables removed: DateChanged, DateCollected, DateofTreatment | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 16
#> AgeYrs DrugUse Comments DiagOther DiagOther2 Drug1 CaseStatus
#> <dbl> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1
#> 2 555 SELF-HARM lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> 3 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1
#> 4 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> 5 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> 6 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,
#> # DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> # DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> # DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> # DateofTreatment_ChristmasDay <int>
# --- Calling step_rm at the end of the chain fails, recipes only finds NEK and datetime cols because strings have already been converted to factor ----
lr_recipe <- recipe(CaseStatus ~ ., data = df) |>
step_date(DateofTreatment) |>
step_time(DateofTreatment) |>
step_holiday(DateofTreatment) |>
step_rm(NEK, all_string_predictors(), all_datetime_predictors()) # remove unique ID, any string predictors, dates
prep(lr_recipe)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 10
#>
#> ── Training information
#> Training data contained 6 data points and no incomplete rows.
#>
#> ── Operations
#> • Date features from: DateofTreatment | Trained
#> • Time features from: DateofTreatment | Trained
#> • Holiday features from: DateofTreatment | Trained
#> • Variables removed: NEK, DateChanged, DateCollected, ... | Trained
juice(prep(lr_recipe))
#> # A tibble: 6 × 16
#> AgeYrs DrugUse Comments DiagOther DiagOther2 Drug1 CaseStatus
#> <dbl> <fct> <fct> <fct> <fct> <fct> <int>
#> 1 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1
#> 2 555 SELF-HARM lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> 3 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 1
#> 4 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> 5 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> 6 555 THERAPEUTIC lorem ipsum lorem ipsum lorem ipsum lorem ipsum 0
#> # ℹ 9 more variables: DateofTreatment_dow <fct>, DateofTreatment_month <fct>,
#> # DateofTreatment_year <int>, DateofTreatment_hour <int>,
#> # DateofTreatment_minute <int>, DateofTreatment_second <dbl>,
#> # DateofTreatment_LaborDay <int>, DateofTreatment_NewYearsDay <int>,
#> # DateofTreatment_ChristmasDay <int>
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.0 (2024-04-24 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.utf8
#> ctype English_United States.utf8
#> tz America/New_York
#> date 2024-09-27
#> pandoc 3.1.2 @ C:/Users/<snip>/AppData/Local/Pandoc/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> class 7.3-22 2023-05-03 [1] CRAN (R 4.4.0)
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.4.0)
#> clock 0.7.1 2024-07-18 [1] CRAN (R 4.4.1)
#> codetools 0.2-20 2024-03-31 [1] CRAN (R 4.4.0)
#> data.table 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
#> digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
#> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
#> future 1.33.2 2024-03-26 [1] CRAN (R 4.4.0)
#> future.apply 1.11.2 2024-03-28 [1] CRAN (R 4.4.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
#> globals 0.16.3 2024-03-08 [1] CRAN (R 4.4.0)
#> glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
#> gower 1.0.1 2022-12-22 [1] CRAN (R 4.4.0)
#> hardhat 1.4.0 2024-06-02 [1] CRAN (R 4.4.1)
#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#> ipred 0.9-15 2024-07-18 [1] CRAN (R 4.4.1)
#> knitr 1.46 2024-04-06 [1] CRAN (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [1] CRAN (R 4.4.0)
#> lava 1.8.0 2024-03-05 [1] CRAN (R 4.4.1)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
#> listenv 0.9.1 2024-01-29 [1] CRAN (R 4.4.0)
#> lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
#> MASS 7.3-60.2 2024-04-24 [1] local
#> Matrix 1.7-0 2024-03-22 [1] CRAN (R 4.4.0)
#> nnet 7.3-19 2023-05-03 [1] CRAN (R 4.4.0)
#> parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
#> prodlim 2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
#> Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.4.0)
#> recipes * 1.1.0 2024-07-04 [1] CRAN (R 4.4.1)
#> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.4.1)
#> rlang 1.1.3 2024-01-10 [1] CRAN (R 4.4.0)
#> rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
#> rpart 4.1.23 2023-12-05 [1] CRAN (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.1)
#> survival 3.5-8 2024-02-14 [1] CRAN (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
#> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
#> timeDate 4041.110 2024-09-22 [1] CRAN (R 4.4.1)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
#> withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
#> xfun 0.43 2024-03-25 [1] CRAN (R 4.4.0)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
#>
#> [1] C:/Users/<snip>/AppData/Local/Programs/R/R-4.4.0/library
#>
#> ──────────────────────────────────────────────────────────────────────────────