Workflows - Error with add_formula() when factor column on RHS
mdancho84 opened this issue · 2 comments
@DavisVaughan - I'm getting a weird issue with workflows
with a new package we are testing out called gamsnip
. I'm trying to add a formula and I'm running into an error with factors in the RHS.
Problem
Error: Functions involving factors or characters have been detected on the RHS of `formula`. These are not allowed when `indicators = "none"`. Functions involving factors were detected for the following columns: 'id'.
Reproducible Example
library(modeltime)
library(tidymodels)
library(gamsnip)
#> Loading required package: mgcv
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
#> This is mgcv 1.8-34. For overview type 'help("mgcv-package")'.
library(tidyverse)
library(timetk)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
m4_monthly_extended <- m4_monthly %>%
group_by(id) %>%
future_frame(.length_out = 24, .bind_data = TRUE) %>%
mutate(lag_24 = lag(value, 24)) %>%
ungroup() %>%
mutate(date_num = as.numeric(date)) %>%
mutate(date_month = month(date))
#> .date_var is missing. Using: date
m4_monthly_train <- m4_monthly_extended %>% drop_na()
m4_monthly_future <- m4_monthly_extended %>% filter(is.na(value))
splits <- time_series_split(m4_monthly_train, assess = 24, cumulative = TRUE)
#> Using date_var: date
#> Data is not ordered by the 'date_var'. Resamples will be arranged by `date`.
#> Overlapping Timestamps Detected. Processing overlapping time series together using sliding windows.
splits
#> <Analysis/Assess/Total>
#> <1382/96/1478>
wflw_fit_gam <- workflow() %>%
add_model(
gam_mod(mode = "regression") %>%
set_engine("gam", method = "REML")
) %>%
add_formula(
value ~ s(date_month, by = id)
+ s(date_num, by = id)
+ s(date_num, date_month, by = id)
+ id
) %>%
fit(training(splits))
#> Error: Functions involving factors or characters have been detected on the RHS of `formula`. These are not allowed when `indicators = "none"`. Functions involving factors were detected for the following columns: 'id'.
Created on 2021-03-26 by the reprex package (v1.0.0)
add_formula()
is primarily used to specify terms / variables in the model (although it also does some light pre-processing as well using the standard model.matrix()
infrastructure). Notably, it is not aware of any "special" functions like s()
that are model specific.
What you want is to supply a model formula through add_model(formula = )
. This is different from the variable selection / preprocessing formula that you supply in add_formula()
. A model formula will be passed all the way through to the mgcv call (or whatever pkg is used), no tidymodels package will do anything with that model formula.
So I would do something like this, specifying variables that are going to be used in the model with add_variables()
(or you could use add_recipe()
), and then specifying exactly how the model should be fit with add_model(formula = )
.
library(modeltime)
library(tidymodels)
library(gamsnip)
library(tidyverse)
library(timetk)
library(lubridate)
m4_monthly_extended <- m4_monthly %>%
group_by(id) %>%
future_frame(.length_out = 24, .bind_data = TRUE) %>%
mutate(lag_24 = lag(value, 24)) %>%
ungroup() %>%
mutate(date_num = as.numeric(date)) %>%
mutate(date_month = month(date))
m4_monthly_train <- m4_monthly_extended %>% drop_na()
m4_monthly_future <- m4_monthly_extended %>% filter(is.na(value))
splits <- time_series_split(m4_monthly_train, assess = 24, cumulative = TRUE)
spec <- gam_mod(mode = "regression") %>%
set_engine("gam", method = "REML")
wflw_fit_gam <- workflow() %>%
add_variables(value, c(date_month, date_num, id)) %>%
add_model(
spec,
formula = value ~ s(date_month, by = id) +
s(date_num, by = id) +
s(date_num, date_month, by = id) +
id
) %>%
fit(training(splits))
wflw_fit_gam
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Variables
#> Model: gam_mod()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> Outcomes: value
#> Predictors: c(date_month, date_num, id)
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#>
#> Family: gaussian
#> Link function: identity
#>
#> Formula:
#> value ~ s(date_month, by = id) + s(date_num, by = id) + s(date_num,
#> date_month, by = id) + id
#>
#> Estimated degrees of freedom:
#> 8.59 6.35 7.71 4.75 1.39 1.01 1.00
#> 1.00 24.05 18.59 9.93 16.91 total = 105.27
#>
#> REML score: 10465.4
This is exactly what I needed. 👍