curso-r/treesnip

Can we optain VIP()- variable importance from lightGbm or catBoost the tidy-way?

Closed this issue · 2 comments

I'm thinking of Julia Sigles example here:

https://stackoverflow.com/questions/62772397/integration-of-variable-importance-plots-within-the-tidy-modelling-framework

There seems to be a generic one and a custom one- over the model-specific function predict().
Both with the tidymodels standard variable importance package VIP.

Reproduceable example (generic case, with a simple linear model):
`
library(lightgbm)

if (require("lightgbm")) {
library(AmesHousing)
library(janitor)
library(dplyr)
library(ggplot2)
library(rsample)
library(recipes)
library(parsnip)
library(tune)
library(dials)
library(workflows)
library(yardstick)
library(catboost)
library(tidymodels)
library(treesnip)
library(tidyverse)
library(vip)
info<-(.packages())
print(info)
}

set.seed(1234)

ames_data <- make_ames() %>%
janitor::clean_names()

ames_split <- rsample::initial_split(
ames_data,
prop = 0.8,
strata = sale_price
)

rec <- recipes::recipe(sale_price ~ ., data = training(ames_split)) %>%

recipes::step_other(all_nominal(), threshold = 0.01) %>%

recipes::step_nzv(all_nominal())

myPrep <- prep(rec)

spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")

vipFit<-spec %>%
set_engine("lm") %>%
fit(sale_price ~ .,data = juice(myPrep))

impObj <- vipFit %>% vip::vi(scale=FALSE)

print(impObj)
`
Output:

A tibble: 193 x 3

Variable Importance Sign

1 second_flr_sf 11.5 POS
2 first_flr_sf 10.4 POS
3 misc_val 10.1 NEG
4 overall_qualVery_Excellent 9.04 POS
5 overall_qualExcellent 8.56 POS
6 neighborhoodStone_Brook 8.32 POS
7 neighborhoodNorthridge 7.51 POS
8 overall_condGood 6.84 POS
9 neighborhoodNorthridge_Heights 6.51 POS
10 overall_condVery_Good 6.44 POS

... with 183 more rows

Reproduceable example (custom case, with a simple linear model via pred_wrapper):

rec <- recipes::recipe(sale_price ~ ., data = training(ames_split)) %>%
recipes::step_other(all_nominal(), threshold = 0.01) %>%
recipes::step_nzv(all_nominal())

spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")

final_model <- spec
final_wf <- workflow() %>%
add_model(final_model) %>%
add_recipe(rec)

final_fitted_wf <- final_wf %>% fit(data=training(ames_split))

myVip<-final_fitted_wf %>%
extract_fit_parsnip() %>%
vip(pred_wrapper = stats::predict, train = training(ames_split))

print(myVip)

Output
image

In theory this could be implemented in this package, but I'd ask first in the VIP package repository. For instance, the xgboost vip implementation used in that example is here:

https://github.com/koalaverse/vip/blob/fa5dc6e9b8233145145d4cc7a0e67ea42409cee6/R/vi_model.R#L1310-L1343

It would be better to keep the implementation there because others that might be not using treesnip could benefit of it.
Please refer to: https://github.com/koalaverse/vip/issues

If anybody is interested in doing it without vip with catBoost/treesnip/tidymodels...

myModel<-extract_fit_engine(final_fitted_wf)
impObj <- catboost.get_feature_importance(myModel, 
  pool = NULL, 
  type = 'FeatureImportance',
  thread_count = -1)

Output (pretty ugly):
[,1]
ms_sub_class 1.24522954
ms_zoning 1.12307172
lot_frontage 2.94193678
lot_area 2.01862041
lot_shape 0.90817355
lot_config 0.35386407
[...]

PS: Suprise suprise...this also works with lightGbm/treesnip/tidymodels...

myModel<-extract_fit_engine(final_fitted_wf); impObj <- lgb.importance(myModel, percentage = TRUE)

Output (pretty fine):
Feature Gain Cover Frequency
1: myXreg32 28304.0115 39998 72
2: myXreg52 14347.0080 23272 41
3: myXreg31 10914.2301 34374 56
4: myXreg33 10746.1890 53054 96
5: myXreg7 10681.6466 30822 54
6: myXreg4 10121.8224 34187 60
7: myXreg46 9604.2502 30287 54
8: myXreg23 9591.9631 40724 72
9: myXreg6 9545.2380 47203 84
10: myXreg17 9544.5662 34146 58
[...]