jakob-r/mlrHyperopt

How to use mlrHyperopt in nested CV setting?

Closed this issue · 8 comments

pat-s commented

Given the following minimal example of a nested CV of ksvm() using mlr:

task_ec <- makeClassifTask(id = "data", data = df, 
                           target = "x", positive = "1")

## Tuning in inner resampling loop
ps <- makeParamSet(makeNumericParam("C", lower = -5, upper = 10),
                   makeNumericParam("sigma", lower = -15, upper = 15))

ctrl <- makeTuneControlRandom(maxit = 200)
inner <- makeResampleDesc("CV", iters = 5)

# predict.type for 'auc' (AUROC)
lrn_ec <- makeLearner("classif.ksvm", 
                      predict.type = "prob")

lrn_ksvm <- makeTuneWrapper(lrn_ec, resampling = inner, par.set = ps, 
                            control = ctrl, show.info = TRUE, measures = list(auc))

## Outer resampling loop
outer <- makeResampleDesc("RepCV", folds = 5, reps = 2)
resa_ksvm <- resample(lrn_ksvm, task_ec, resampling = outer, extract = getTuneResult, 
                      show.info = T, measures = list(auc))

How can I rewrite this with mlrHyperopt?

Hi,

  1. mlrHyperopt is not meant to do nested resampling as I see nested resampling as a method to analyze the performance of different tuning and preprocessing strategies. You could also argue that like in the given example you want to analyze the "stability" of the tuning regarding the performance and the proposed parameters. mlrHyperopt is just meant to optimize the parameters for a given data-learner-combination in the best possible way. There is no need to compare it anymore 😉 Here you can see how I did the nested resampling using batchtools.

  2. mlrHyperopt just sits on top of mlr and thus it's not integrated in the mlr workflow. However as it uses mlr for everything (including the tuning itself) you can use all kinds of helpers if you want to mimic the behaviour of mlrHyperopt. In your example you run the random search with 200 iterations of a 5-fold CV. This is in the terms of mlrHyperopt a budget of 5*200 = 1000. There is a lot to argue about this generalization. Also how it uses this budget to balance resampling iterations and tuning budget. I quite possibly have to come up with something more clever than now. At the moment the tuning method is determined based on the ParamSet (wheter it is numerical, categorical, etc). Then each tuning method has like a desired number of needed tuning iterations that I randomly picked as usefull. If this number is high we will have to settle for an inner resampling that has lass iterations and vice versa. Unfortunately for a high budget this simple heuristic tends to prefer overly repeated resamplings instead of increasing the iteration number of the tuning.

library(mlrHyperopt)
task_ec <- iris.task
lrn_ec <- makeLearner("classif.ksvm", predict.type = "prob")
hyper.control <- generateHyperControl(task = task_ec, learner = lrn_ec, budget.evals = 1000)
par.config <- generateParConfig(learner = lrn_ec)

## Tuning in inner resampling loop
ps <- getParConfigParSet(par.config)

ctrl <- getHyperControlMlrControl(hyper.control)
inner <- getHyperControlResampling(hyper.control)

# predict.type for 'auc' (AUROC)
lrn_ksvm <- makeTuneWrapper(lrn_ec, resampling = inner, par.set = ps, 
                            control = ctrl, show.info = TRUE, measures = list(auc))

## Outer resampling loop
outer <- makeResampleDesc("RepCV", folds = 5, reps = 2)
resa_ksvm <- resample(lrn_ksvm, task_ec, resampling = outer, extract = getTuneResult, 
                      show.info = T, measures = list(auc))
  1. mlrHyperopt is mainly meant to give some default tuning ParamSets and the interface to the webservice then to be the best tuning package. This is only some kind of additional beta feature.
pat-s commented

Thanks for the detailed answer!

In your example you run the random search with 200 iterations of a 5-fold CV. This is in the terms of mlrHyperopt a budget of 5*200 = 1000. There is a lot to argue about this generalization. Also how it uses this budget to balance resampling iterations and tuning budget.

So you are using different resampling methods, depending on the given "budget"?

I simply want to provide meaningful default ranges of hyperparameters to be optimized during tuning in the inner loop of my nested resampling setup. I do not care whether I get these from the database of the web service or your default search spaces :)
I assume that the default search spaces of mlrHyperopt will be updated from time to time to reflect the "most successful" search spaces of the database?

mlrHyperopt includes default search spaces for the most common machine learning methods like random forest, svm and boosting.

That's basically what I'm interested in, especially for RF and SVM. So if I use

hyper.control <- generateHyperControl(task = task_ec, learner = lrn_ec, budget.evals = 1000)
par.config <- generateParConfig(learner = lrn_ec)

## Tuning in inner resampling loop
ps <- getParConfigParSet(par.config)

ctrl <- getHyperControlMlrControl(hyper.control)
inner <- getHyperControlResampling(hyper.control)

I would get a tuning in the inner loop of a budget of 1000 using your default search space of ksvm()?

That's basically what I'm interested in, especially for RF and SVM.

If you are only interested in the search spaces then just use

par.config <- generateParConfig(learner = lrn_ec)
ps <- getParConfigParSet(par.config)

and probably

ctrl <- getHyperControlMlrControl(hyper.control)

but if one elevation is not so expensive and you can afford to have 200 random search iterations I would stick to the random search.

For the rest stick to your mlr workflow and your own resampling definitions. No need for the rest of mlrHyperopts functions.

I assume that the default search spaces of mlrHyperopt will be updated from time to time to reflect the "most successful" search spaces of the database?

What is marked as default in the database reflects the defaults within the package, yes.
And they will be updated to reflect what we deem to be most successful.
Currently I am waiting for the results of a paper by @PhilippPro about the tunability of learners where good ranges are determined on the base of a many random evaluations on a large set of data sets.
They will hopefully find their way into the database.

Yes probably we will find it through the nightly snapshot (https://www.openml.org/guide#!devels) all other ways take too long.

Okay. Now we are getting a bit OT. So you want to do the SQL-Queries to get the parameter configurations and the performance values on a local copy of the Database?

Have you done some work on it already?

I started. At the moment I have problems to connect to the sql database, maybe you can help? There is a ExpDB_SNAPSHOT.sql I want to connect to. src_mysql does not work... I am also on hangout.