ModelOriented/iBreakDown

Factor Variables Converted To Numeric Makes Results Less User Friendly

alexanderjwhite opened this issue · 3 comments

Factors are converted to numerics resulting in variables in the plots being labeled with a value rather than their label. I've identified where this occurs. The stacktrace is shown below along with example images. In nice_format, which is called by nice_pair (shown below) calls as.character will make a factor into a numeric.

Result of reprex:
image
image

Issue isolation
image
image
image

reprex

library(tidymodels)
library(modeldata) 
library(DALEXtra)
data(ames)

rf_model <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

rf_wflow <- 
  workflow() %>% 
  add_formula(
    Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + 
      Latitude + Longitude) %>% 
  add_model(rf_model) 

rf_fit <- rf_wflow %>% fit(data = ames)

exp_train <- ames %>% 
  select(-Sale_Price)

exp_rf <- 
  explain_tidymodels(
    rf_fit, 
    data = exp_train, 
    y = ames$Sale_Price,
    verbose = TRUE
  )

first_obs <- exp_train %>% 
  slice(1)

breakdown <- predict_parts(explainer = exp_rf, new_observation = first_obs, type = "break_down")
breakdown[1:7,]
first_obs$Bldg_Type
first_obs$Neighborhood

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] DALEXtra_2.1.1 DALEX_2.3.0 yardstick_0.0.8 workflowsets_0.1.0
[5] workflows_0.2.3 tune_0.1.5 tidyr_1.1.3 tibble_3.1.1
[9] rsample_0.1.0 recipes_0.1.16 purrr_0.3.4 parsnip_0.1.5
[13] modeldata_0.1.0 infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.6
[17] dials_0.0.9 scales_1.1.1 broom_0.7.6 tidymodels_0.1.3

loaded via a namespace (and not attached):
[1] jsonlite_1.7.2 splines_4.0.2 foreach_1.5.1 prodlim_2019.11.13
[5] vip_0.3.2 assertthat_0.2.1 GPfit_1.0-8 globals_0.14.0
[9] ipred_0.9-11 pillar_1.6.1 backports_1.2.1 lattice_0.20-44
[13] glue_1.4.2 reticulate_1.20 visdat_0.5.3 pROC_1.17.0.1
[17] digest_0.6.27 hardhat_0.1.6 colorspace_2.0-1 Matrix_1.2-18
[21] plyr_1.8.6 timeDate_3043.102 pkgconfig_2.0.3 lhs_1.1.1
[25] DiceDesign_1.9 listenv_0.8.0 ranger_0.12.1 gower_0.2.2
[29] lava_1.6.9 generics_0.1.0 ellipsis_0.3.2 withr_2.4.2
[33] furrr_0.2.3 nnet_7.3-14 cli_3.0.0 survival_3.1-12
[37] magrittr_2.0.1 crayon_1.4.1 future_1.21.0 fansi_0.4.2
[41] parallelly_1.26.1 MASS_7.3-51.6 class_7.3-17 tools_4.0.2
[45] lifecycle_1.0.0 munsell_0.5.0 compiler_4.0.2 rlang_0.4.11
[49] grid_4.0.2 iterators_1.0.13 rstudioapi_0.13 rappdirs_0.3.3
[53] gtable_0.3.0 codetools_0.2-16 DBI_1.1.1 R6_2.5.0
[57] gridExtra_2.3 lubridate_1.7.10 utf8_1.2.1 iBreakDown_2.0.1
[61] parallel_4.0.2 Rcpp_1.0.6 vctrs_0.3.8 rpart_4.1-15
[65] png_0.1-7 tidyselect_1.1.1

Indeed, this is due to new_observation being a tibble; related:

as.character(factor("a"))
library(tidyr)
as.character(tibble(factor("a")))

I see. It works as a data.frame. Is this the intended functionality? This is called by DALEX/DALEXtra which is built to provide tidyverse extensions. tibbles are a central component of tidyverse functionality. A new observation (which frequently will be a multivariable slice of a tibble) will likely be used often here. Wouldn't it be beneficial to provide this robustness?

Yes, we shall patch this (probably update nice_fotmat).