curso-r/treesnip

categorical vars

Closed this issue · 11 comments

Both catboost and lightgbm advantages are the built-in support of categorical variables.
Figure out and document how to correctly pass them.

So I've used lightgbm with python extensively in the last year and with pandas you can manually specify which columns are categorical in the pandas dataframe. After that, lightGBM picks those up. In R data.frames I think it would be nice if character and factor would automatically be selected, unless you manually specify it. Let me dive in the lightgbm R docs for more info

edit: Tidymodels docs suggest factors for categorical
Categorical data should be in factor variables (as opposed to binary or integer representations).

another edit:
So I guess it would make most sense to pass an optional argument to train (something like): categorical_columns=c("col1","col2","etc"). If that argument is NULL / not specified: Train should handle factors as categorical columns and we pass those column numbers (1-indexed to lightgbm lgb.train with the argument categorical_features and 0-indexed to catboost catboost.load_pool with the argument cat_features).
Why do I think this makes most sense? I think user facing, an R user expects that categorical features 'factors' are handled automatically. I believe that happens automatically from lm to CART. Specifying it manually should overwrite the defaults. When you specify an optional argument, clearly you know what you are doing (I hope).
And also as a user I don't want to think about the magic commands that make my wishes heard in lightgbm versus catboost. The parsnip (in this case treesnip) package takes care of those pesky specific commands.

lightgbm docs say

The categorical features must be indexed like in R (1-indexed, not 0-indexed)

So column number.
But the lgb.train also says a string is allowed

categorical_feature | list of str or int type int represents index, type str represents feature names

catboost also uses column indices, but 0 indexed according to their docs.

A vector of categorical features indices.

The indices are zero-based and can differ from the ones given in the columns description file.

So something like this:
training_data <- catboost.load_pool(dataset, label = label_values,cat_features = c(0,3))

I agree with consider factors/strings columns as categorical variables. My main concern is how we should deal with new levels in the categorical features or if this is already handled by catboost and lightgbm.

We will probably need to change how we deal with cat vars here:

https://github.com/curso-r/treesnip/blob/master/R/lightgbm.R#L180

And also change the itnerface here: https://github.com/curso-r/treesnip/blob/master/R/lightgbm.R#L16 to data.frame as IIUC matrix means only numeric.

As far as I can see, catboost does not allow you to specify the categorical features in the train function, but only in the data loading part. In lightgbm you do have to specify it in the train function and also in the data.

I agree with consider factors/strings columns as categorical variables. My main concern is how we should deal with new levels in the categorical features or if this is already handled by catboost and lightgbm.

Isn't that more a recipes problem? There is a step_others or something in recipes.

In this more recent commit I don't think you pass the indices of categorical variables to the train function of lgbm, only to the lgb.Dataset?

Do we need to pass for both the dataset and lgb.train?
I would expect that passing to lgb.Dataset is enough since we also only pass the label into the dataset. But it's not clear in the docs...

I also don't know how to make sure the categorical variables are being used as categoricals. Do you have a simple test case that we could use here: https://github.com/curso-r/treesnip/blob/master/tests/testthat/helper-model.R#L62-L84

No I'm afraid not.

I would test the categorical check and maybe the dataset that it creates. Those are simple to do. Creating a dataset that tests whether categorical variables are used is hella difficult. Because to my understanding both catboost and lightgbm do incredibly sophisticated things to both numeric and categorical features. So even if we create factors that should be separable into groups or something the trees would split those if they were numbers too. Maybe we can check the unittests inside of lightgbm or catboost for examples ?