Workflows - Error with add_formula() when factor column on RHS

Question

Workflows - Error with add_formula() when factor column on RHS

mdancho84 opened this issue 3 years ago · 2 comments

@DavisVaughan - I'm getting a weird issue with workflows with a new package we are testing out called gamsnip. I'm trying to add a formula and I'm running into an error with factors in the RHS.

Problem

Error: Functions involving factors or characters have been detected on the RHS of `formula`. These are not allowed when `indicators = "none"`. Functions involving factors were detected for the following columns: 'id'.

Reproducible Example

library(modeltime)
library(tidymodels)
library(gamsnip)
#> Loading required package: mgcv
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
#> This is mgcv 1.8-34. For overview type 'help("mgcv-package")'.
library(tidyverse)
library(timetk)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

m4_monthly_extended <- m4_monthly %>%
    group_by(id) %>%
    future_frame(.length_out = 24, .bind_data = TRUE) %>%
    mutate(lag_24 = lag(value, 24)) %>%
    ungroup() %>%
    mutate(date_num = as.numeric(date)) %>%
    mutate(date_month = month(date))
#> .date_var is missing. Using: date

m4_monthly_train  <- m4_monthly_extended %>% drop_na()
m4_monthly_future <- m4_monthly_extended %>% filter(is.na(value))

splits <- time_series_split(m4_monthly_train, assess = 24, cumulative = TRUE)
#> Using date_var: date
#> Data is not ordered by the 'date_var'. Resamples will be arranged by `date`.
#> Overlapping Timestamps Detected. Processing overlapping time series together using sliding windows.

splits
#> <Analysis/Assess/Total>
#> <1382/96/1478>

wflw_fit_gam <- workflow() %>%
    add_model(
        gam_mod(mode = "regression") %>%
            set_engine("gam", method = "REML")
    ) %>%
    add_formula(
        value ~ s(date_month, by = id) 
        + s(date_num, by = id) 
        + s(date_num, date_month, by = id) 
        + id
    ) %>%
    fit(training(splits))
#> Error: Functions involving factors or characters have been detected on the RHS of `formula`. These are not allowed when `indicators = "none"`. Functions involving factors were detected for the following columns: 'id'.

^{Created on 2021-03-26 by the reprex package (v1.0.0)}

Answer 1 · 2021-03-26T17:49:14.000Z

add_formula() is primarily used to specify terms / variables in the model (although it also does some light pre-processing as well using the standard model.matrix() infrastructure). Notably, it is not aware of any "special" functions like s() that are model specific.

What you want is to supply a model formula through add_model(formula = ). This is different from the variable selection / preprocessing formula that you supply in add_formula(). A model formula will be passed all the way through to the mgcv call (or whatever pkg is used), no tidymodels package will do anything with that model formula.

So I would do something like this, specifying variables that are going to be used in the model with add_variables() (or you could use add_recipe()), and then specifying exactly how the model should be fit with add_model(formula = ).

library(modeltime)
library(tidymodels)
library(gamsnip)
library(tidyverse)
library(timetk)
library(lubridate)

m4_monthly_extended <- m4_monthly %>%
  group_by(id) %>%
  future_frame(.length_out = 24, .bind_data = TRUE) %>%
  mutate(lag_24 = lag(value, 24)) %>%
  ungroup() %>%
  mutate(date_num = as.numeric(date)) %>%
  mutate(date_month = month(date))

m4_monthly_train  <- m4_monthly_extended %>% drop_na()
m4_monthly_future <- m4_monthly_extended %>% filter(is.na(value))

splits <- time_series_split(m4_monthly_train, assess = 24, cumulative = TRUE)

spec <- gam_mod(mode = "regression") %>%
  set_engine("gam", method = "REML")

wflw_fit_gam <- workflow() %>%
  add_variables(value, c(date_month, date_num, id)) %>%
  add_model(
    spec, 
    formula = value ~ s(date_month, by = id) + 
      s(date_num, by = id) + 
      s(date_num, date_month, by = id) + 
      id
  ) %>%
  fit(training(splits))

wflw_fit_gam
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Variables
#> Model: gam_mod()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Outcomes: value
#> Predictors: c(date_month, date_num, id)
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> value ~ s(date_month, by = id) + s(date_num, by = id) + s(date_num, 
#>     date_month, by = id) + id
#> 
#> Estimated degrees of freedom:
#>  8.59  6.35  7.71  4.75  1.39  1.01  1.00 
#>  1.00 24.05 18.59  9.93 16.91  total = 105.27 
#> 
#> REML score: 10465.4

Answer 2 · 2021-03-26T18:03:35.000Z

This is exactly what I needed. 👍