tidymodels/hardhat

BTM-library breaks tune_grid()

LasWin opened this issue · 11 comments

It seems that having BTM-library loaded breaks the tune_grid() function.

The following code works but when library(BTM) is called, fitting the model no longer works.

I can work around this problem either by choosing another topic model algorithm or doing the fitting within caret but this is an annoying little bug which propably should be fixed.

library(tidymodels)
library(dplyr)

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          levels = 5)

cell_folds <- vfold_cv(cell_train)


tree_wf <- workflow() %>%
  add_model(tune_spec) %>%
  add_formula(class ~ .)

tree_res <- 
  tree_wf %>% 
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
  )

library(BTM)

tree_res <- 
  tree_wf %>% 
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
  )

When BTM is loaded, BTM:::terms.data.frame gets called. I just stepped through this with the debugger and it looks like it is happening in hardhat::model_frame():

terms <- terms(frame)

library(tidymodels)
library(dplyr)
library(BTM)

data(cells, package = "modeldata")

cell_split <- initial_split(cells %>% select(-case), 
                            strata = class)
cell_train <- training(cell_split)
cell_test  <- testing(cell_split)

tune_spec <- 
  decision_tree(
    cost_complexity = tune(),
    tree_depth = tune()
  ) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

tree_grid <- grid_regular(cost_complexity(),
                          tree_depth(),
                          levels = 5)

cell_folds <- vfold_cv(cell_train)


tree_wf <- workflow() %>%
  add_model(tune_spec) %>%
  add_formula(class ~ .)

tree_wf %>% 
  tune_grid(
    resamples = cell_folds,
    grid = tree_grid
  )
#> x Fold01: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold02: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold03: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold04: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold05: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold06: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold07: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold08: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold09: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> x Fold10: preprocessor 1/1: Error in terms.data.frame(frame): please provide in da...
#> Warning: All models failed. See the `.notes` column.
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> preprocessor 1/1: Error in terms.data.frame(frame): please provide in data a data.frame with 2 columns as indicated in the help of BTM
#> preprocessor 1/1: Error in terms.data.frame(frame): please provide in data a data.frame with 2 columns as indicated in the help of BTM
#> preprocessor 1/1: Error in terms.data.frame(frame): please provide in data a data.frame with 2 columns as indicated in the help of BTM
#> # Tuning results
#> # 10-fold cross-validation 
#> # A tibble: 10 x 4
#>    splits             id     .metrics .notes          
#>    <list>             <chr>  <list>   <list>          
#>  1 <split [1363/152]> Fold01 <NULL>   <tibble [1 × 1]>
#>  2 <split [1363/152]> Fold02 <NULL>   <tibble [1 × 1]>
#>  3 <split [1363/152]> Fold03 <NULL>   <tibble [1 × 1]>
#>  4 <split [1363/152]> Fold04 <NULL>   <tibble [1 × 1]>
#>  5 <split [1363/152]> Fold05 <NULL>   <tibble [1 × 1]>
#>  6 <split [1364/151]> Fold06 <NULL>   <tibble [1 × 1]>
#>  7 <split [1364/151]> Fold07 <NULL>   <tibble [1 × 1]>
#>  8 <split [1364/151]> Fold08 <NULL>   <tibble [1 × 1]>
#>  9 <split [1364/151]> Fold09 <NULL>   <tibble [1 × 1]>
#> 10 <split [1364/151]> Fold10 <NULL>   <tibble [1 × 1]>

Created on 2021-02-15 by the reprex package (v1.0.0)

It is a bit invasive to make a method like this for a class you don't own, as it can cause breakage in unexpected places like this, where we are using terms() in a very "normal" way and expecting the terms.default method to get called.

I would have suggested either subclassing "data.frame" and making a terms() method for that subclass, or using a non-generic btm_terms() in place of: https://github.com/bnosac/BTM/blob/d5276595a1f2e788a73179651b7c3c14938b35da/R/btm.R#L387

@jwijffels would you consider changing BTM to not export a method for terms.data.frame? Calling terms() on a data frame and expecting the default method to get run is very common, so I think this could break more than just our packages. ?stats::model.frame even documents that the return value is a data frame with a "terms" attribute attached, and the documentation of ?terms states that:

the default method just extracts the terms component of the object, or failing that a "terms" attribute (as used by model.frame).

BTM has an S3 method that cannot be overridden; since the stats package is always attached on startup, you can never load it after BTM. Any since those S3 methods are not fully exported, we can't call them by namespace either.

I thought that conflicted package would be able to fix it but that isn't the case.

It doesn't help that the S language basics are not well designed. Surprisingly,

> class(model.frame(dist ~ speed, data = cars))
[1] "data.frame"

Also, functions like expand.model.frame() might make you think that they are S3 methods (it is not).

I'm not familiar with hardhat and what it does. But if you need stats:::terms.default. Why don't you add the code to hardhat, it's 3 lines?

stats:::terms.default <- function (x, ...) {
    v <- x$terms
    if (is.null(v)) {
        v <- attr(x, "terms")
        if (is.null(v)) 
            stop("no terms component nor attribute")
    }
    v
}

@jwijffels, the hardhat part of this is not relevant. Would you not consider the fact that the return value is different before and after loading BTM to be a bug in BTM? It seems in direct conflict with base R's intended and documented behavior for terms() when used on the result of model.frame().

mf <- model.frame(dist ~ speed, data = cars)

terms(mf)
#> dist ~ speed
#> attr(,"variables")
#> list(dist, speed)
#> attr(,"factors")
#>       speed
#> dist      0
#> speed     1
#> attr(,"term.labels")
#> [1] "speed"
#> attr(,"order")
#> [1] 1
#> attr(,"intercept")
#> [1] 1
#> attr(,"response")
#> [1] 1
#> attr(,".Environment")
#> <environment: R_GlobalEnv>
#> attr(,"predvars")
#> list(dist, speed)
#> attr(,"dataClasses")
#>      dist     speed 
#> "numeric" "numeric"

library(BTM)
#> Warning: package 'BTM' was built under R version 4.0.2

terms(mf)
#> $n
#> [1] 19
#> 
#> $tokens
#>    id token freq
#> 1   0     4    2
#> 2   1     7    2
#> 3   2     8    1
#> 4   3     9    1
#> 5   4    10    3
#> 6   5    11    2
#> 7   6    12    4
#> 8   7    13    4
#> 9   8    14    4
#> 10  9    15    3
#> 11 10    16    2
#> 12 11    17    3
#> 13 12    18    4
#> 14 13    19    3
#> 15 14    20    5
#> 16 15    22    1
#> 17 16    23    1
#> 18 17    24    4
#> 19 18    25    1

Created on 2021-02-16 by the reprex package (v1.0.0)

Sure your remark on the documented behaviour of function terms in the stats package is valid.
But I wouldn't call it a bug. Would you call filter(...) and library(dplyr); filter(...) a bug?

Would you call filter(...) and library(dplyr); filter(...) a bug?

I would if there were no way to mediate name conflicts.

I don't think that there is any issue making S3 methods for base or stats package generics; we do that all of the time.

It's just that BTM::terms.data.frame() is very incongruent with the intention of what terms methods do:

  • It does not appear to return a terms object. The generic is documented to "extract terms objects from various kinds of R data objects." (their emphasis)
  • You are using a very general data format (data frame) for a very narrow purpose (obtain biterms from a tokenized data frame for text analysis).

If an S3 method implementation cause significant side effects elsewhere, I would consider my code to be a bug.

Completely overriding a generic function and adding an S3 method that changes pre-existing behavior have very different levels of invasiveness.

I would say that your S3 method that changes pre-existing behavior is more invasive, since there is no way to "opt out" of it and ensure that we continue to get the behavior of terms() that happened before loading your package.

In the filter() case, you:

  1. Get a warning upon package load
  2. Can prefix with stats:: to retain existing behavior

Neither of these are the case with BTM, which to me makes it more invasive, to the point of being a bug.

mf <- model.frame(dist ~ speed, data = cars)
library(BTM)
stats::terms(mf)
#> $n
#> [1] 19
#> 
#> $tokens
#>    id token freq
#> 1   0     4    2
#> 2   1     7    2
#> 3   2     8    1
#> 4   3     9    1
#> 5   4    10    3
#> 6   5    11    2
#> 7   6    12    4
#> 8   7    13    4
#> 9   8    14    4
#> 10  9    15    3
#> 11 10    16    2
#> 12 11    17    3
#> 13 12    18    4
#> 14 13    19    3
#> 15 14    20    5
#> 16 15    22    1
#> 17 16    23    1
#> 18 17    24    4
#> 19 18    25    1

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
stats::filter(1:10, rep(1, 3))
#> Time Series:
#> Start = 1 
#> End = 10 
#> Frequency = 1 
#>  [1] NA  6  9 12 15 18 21 24 27 NA

There is also a S4 generic terms function in the topicmodels R package, which doesn't do similar things as stats::terms what is your opinion about that?

what is your opinion about that?

We generally don't engage with Whataboutism. We'd like you to change this but that, of course, is your prerogative.

@LasWin We're sorry that you have this problem. I don't think that it is something that we are going to code around, based on my comments above.