ja-thomas/OMLbots

Conversion of Hyperparameters

PhilippPro opened this issue · 11 comments

I just had a look at the hyperparameters extracted from the OpenML platformand there are some problems when using them for surrogate models.

We have some/a lot of NA values. For using them in the surrogate models we have to convert them. My suggestion: -1 for numeric variables and an "NA" level for factor variables.

Making a hierarchical model structure (e.g. if booster in xgboost linear than this model, etc.) is too complicated.

why do you have NAs?

if it is because of subordinate params we already have a standard approach for this.

and no, coding numerics just as -1 is very certainly a bad idea (if the param can become neg. as well)

Different reasons:

  1. We have hierarchical hyperparameters. If we us gblinear we do not have hyperparameters for
    max_depth min_child_weight colsample_bytree colsample_bylevel.
  2. If the booster is gbtree it is not written in the output in OpenML and hence NA. We can also solve this manually.
  3. In some cases I do not know why the information of some hyperparameters is not available in OpenML.
    e.g. in rpart sometimes the information for maxdepth and minsplit is missing, maybe because it is the default value?

We have no hyperparameters that can get negative. What is the standard approach?

for 2) and 3): I checked it, it is because it is set to the default! We can reset this manually.

@DanielKuehn87 should we do this before we put the data into the database? I think this would be cleaner...

We have hierarchical hyperparameters

ok that is normal, happens in MBO all the time.

If the booster is gbtree it is not written in the output in OpenML and hence NA. We can also solve this
manually.

that sounds weird and worrisome. is it because it is the default?
i really hope oml does not "swallow" important info?

In some cases I do not know why the information of some hyperparameters is not available in OpenML.

again, have you figured out what goes wrong here?

We have no hyperparameters that can get negative.

of course you have neg values. your model must use the logscale of params where you optimize on logscale.
that is hopefully clear?

What is the standard approach?

please read the mbo paper and properly read ?makeMBOLearner

that sounds weird and worrisome. is it because it is the default?
i really hope oml does not "swallow" important info?

Yes it is because it is default.

again, have you figured out what goes wrong here?

Yes, read my post above, please. For 2) and 3) the problem is that they are the defaults.

of course you have neg values. your model must use the logscale of params where you optimize on logscale.

Ok, I only thought at the transformed values, you are right here.

please read the mbo paper and properly read ?makeMBOLearner

I found this in the mbo paper, we can do it the same way (it is something similar like my -1 approach ;)):

For the surrogate we need a regression model that is more flexible and can
handle categorical features as well as missing values to support dependent pa-
rameters. A slightly modified random forest can be used for this purpose. If a
hyperparameter is not active in a design point in the training set (due to unful-
filled conditions), we will mark its value as missing. Although the random forest
could potentially directly handle missing values, many implementations do not.
Hence, we impute these values in the following way: For categorical parameters
we code missing values as a new level, and for numerical parameters we code the
imputed value out of the range of the box-constraints of the parameter under
consideration. This is known as the separate-class method and was shown to
perform best for decision trees in a prediction-oriented study, when missingness
is related to the outcome

I added a function to insert the default values in getResults.R of the master branch, you can use this for the database extraction @DanielKuehn87.

I updated the functions in getResults.R. Now it really works.

@PhilippPro: I justed merged my database branch in PR #30.
In this PR I also updated the functions in getResults quite a bit to catch different problems. Could you check, if this destroys your addDefaultValues function? :)

It seems ok. I readded it into your functions.