ja-thomas/OMLbots

Data conversion for xgboost

Closed this issue · 9 comments

xgboost only accepts numerical features.
We should decide how we convert factor variables. I implemented now the automatic conversion into a numeric by ordering it according to the levels and making a numerical variable out of the factor variables. Alternatively convert it to several binary features, but feature space can get big with features with many factors.

We need the second and third solution as wrappers. Otherwise we overfit.

Also this is a more general thing and not just relevant for xgboost

Oh, I just realized this wasn't in the mlr repo but in the OMLbot.

I think we should have a look at the datasets we will run and preprocess them. Then use the preprocessed study_14 data for all learners.
What I don't want to do is something like: if(learner == "xyz") then use this and that preprocessing.

I see this differently. For me it would be very "unfair", if all variables would be converted to numeric because some learners can actually handle factors, and can handle them better than just the transformed numeric ones. I think we rather should think at a good method to transform the variables.

i agree with @PhilippPro. you cannot convert everything into a format that xgboost likes.

you need to add wrappers to such algorithms

I don't want to transform the data only for xgboost, but work on numeric datasets without NA values. We could either write a generic wrapper for this within R or create a transformed study_14 dataset.
I think this is necessary so we can use the database for comparing learners later on. Afaik Ranger uses data.matrix to convert factors to numerics. If we now write a wrapper using model.matrix to convert factors for xgboost, we can't really compare both models, because they use different transformations for factors (and xgboost might perform better only because of this).
We can also reduce the overhead for learners, that apply transformations like this, if we would use a transformed study_14.

I think ranger does not convert factors to numerics. I think it tries to split factor variables trying out every subset of the levels that is available. I talked about this with Marvin recently. That's why it also cannot handle too many levels in factor variables (around 50 max). We can talk about that on thursday. ;)

https://www.r-bloggers.com/on-ranger-respect-unordered-factors/
I'm not sure, if this has changed in the current version though.

Oh ok, I didn't know this. We have to be careful here, maybe make a new hyperparameter or just set it to a different value.