tidymodels/hardhat

`mold.formula()` creates two predictor columns per logical feature

cregouby opened this issue · 4 comments

The problem

I'm having trouble with mold.formula() when one predictor is logical. The mold result$predictors then includes two dbl columns per logical feature what is not expected.

Neither mold.recipe() nor mold.data.frame() on the same data suffer this issue.

Reproducible example

Unexpected two dbl columns

library(hardhat)
# Data with logical feature --------------------------------------
data <- data.frame(
  feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)

mold_formula <-  mold(target~. , data=data)
# expecting a single predictor column here
mold_formula$predictors
#> # A tibble: 10 × 2
#>    feat_logicalFALSE feat_logicalTRUE
#>                <dbl>            <dbl>
#>  1                 1                0
#>  2                 0                1
#>  3                 1                0
#>  4                 1                0
#>  5                 1                0
#>  6                 1                0
#>  7                 1                0
#>  8                 1                0
#>  9                 1                0
#> 10                 1                0

Created on 2021-12-15 by the reprex package (v2.0.1)

Unexpected difference between fomula and reciepe

library(hardhat)
# Data with logical feature --------------------------------------
data <- data.frame(
  feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)

# expecting predictors to be equals between formula and void recipe
mold_formula <-  mold(target~. , data=data)
mold_recipe <- mold(recipes::recipe(target~., data), data=data)

waldo::compare(mold_formula$predictors, mold_recipe$predictors)
#> `old` is length 2
#> `new` is length 1
#> 
#> `names(old)`: "feat_logicalFALSE" "feat_logicalTRUE"
#> `names(new)`: "feat_logical"                        
#> 
#> `old$feat_logicalFALSE` is a double vector (1, 1, 1, 1, 1, ...)
#> `new$feat_logicalFALSE` is absent
#> 
#> `old$feat_logicalTRUE` is a double vector (0, 0, 0, 0, 0, ...)
#> `new$feat_logicalTRUE` is absent
#> 
#> `old$feat_logical` is absent
#> `new$feat_logical` is a logical vector (FALSE, FALSE, FALSE, FALSE, FALSE, ...)

Created on 2021-12-15 by the reprex package (v2.0.1)

Workaround

Either

  • encapsulate the formula in a void recipe(), like in the second part of the reprex, or
  • mutate the logical vector into factor or integer before mold() (with different results depending on the downstream model)

This comes from model.matrix() which I guess treats logical columns like factors? So dummy variables are created? And when there is no intercept factors are always fully expanded (this is expected). The whole point of the formula interface is to use model.matrix() with very little modifications, so on first thought I don't think we are going to want to make any changes here.

data <- data.frame(
  feat_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  target = factor(sample(c("yes", "no"), 10, replace = TRUE))
)

model.matrix(
  target ~ feat_logical + 0, 
  data
)
#>    feat_logicalFALSE feat_logicalTRUE
#> 1                  0                1
#> 2                  0                1
#> 3                  1                0
#> 4                  0                1
#> 5                  1                0
#> 6                  1                0
#> 7                  0                1
#> 8                  0                1
#> 9                  0                1
#> 10                 1                0
#> attr(,"assign")
#> [1] 1 1
#> attr(,"contrasts")
#> attr(,"contrasts")$feat_logical
#> [1] "contr.treatment"

# we have no intercept by default
hardhat::mold(
  target ~ feat_logical, 
  data
)$predictors
#> # A tibble: 10 × 2
#>    feat_logicalFALSE feat_logicalTRUE
#>                <dbl>            <dbl>
#>  1                 0                1
#>  2                 0                1
#>  3                 1                0
#>  4                 0                1
#>  5                 1                0
#>  6                 1                0
#>  7                 0                1
#>  8                 0                1
#>  9                 0                1
#> 10                 1                0

model.matrix(
  target ~ feat_logical, 
  data
)
#>    (Intercept) feat_logicalTRUE
#> 1            1                1
#> 2            1                1
#> 3            1                0
#> 4            1                1
#> 5            1                0
#> 6            1                0
#> 7            1                1
#> 8            1                1
#> 9            1                1
#> 10           1                0
#> attr(,"assign")
#> [1] 0 1
#> attr(,"contrasts")
#> attr(,"contrasts")$feat_logical
#> [1] "contr.treatment"

hardhat::mold(
  target ~ feat_logical, 
  data, 
  blueprint = hardhat::default_formula_blueprint(intercept = TRUE)
)$predictors
#> # A tibble: 10 × 2
#>    `(Intercept)` feat_logicalTRUE
#>            <dbl>            <dbl>
#>  1             1                1
#>  2             1                1
#>  3             1                0
#>  4             1                1
#>  5             1                0
#>  6             1                0
#>  7             1                1
#>  8             1                1
#>  9             1                1
#> 10             1                0

Created on 2021-12-16 by the reprex package (v2.0.1)

I'm pretty sure this is expected since model.matrix() has some code that checks is.factor() || is.logical() for determining if something is factorish
https://github.com/wch/r-source/blob/79298c499218846d14500255efd622b5021c10ec/src/library/stats/R/models.R#L642

Hello David,
Thanks for this very clear and documented investigation.
I'll be able to leave with it !

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.