ili3p/HORD

Is this under any active development?

adrianog opened this issue · 6 comments

Read the reddit thread on this, and was very interested in trying it. But I notice there've been no recent commits.
Is it worth using it at this stage?
regards

ili3p commented

This repo is only for reproducing the AAAI paper experiments. The optimization tool is at https://github.com/dme65/pySOT/ . You can use the code in this repo as a use case example and as a guideline on which combination of surrogate and search strategy works best for hyperparameter optimization.

Thanks, clearer.

I have another question (apologies in advance if it is very basic, I'm relatively new to this)
Can I use pysot (or is there any code in HORD I can use for an inspiration) to define an heterogeneous, possibly nested / conditional parameter space, possibly across different models (using a variety of parameter types), or is it only useful for numeric parameters
E.g. see hyperopt example here: https://goo.gl/i1zynY

And if it is not currently possible, could it be reasonably implemented, or is there anything inherent in the model / implementation that would prevent such a param space form being usable with HORD.

In any case, thanks for the unique resource.

ili3p commented

I understand your question as:

Does HORD or pySOT support optimization of categorical, i.e. choices such as model type, or an activation function, and conditional parameters, such as parameters specific for the model type, or simply a number of layers and the number of nodes for each layer?

The answer is they do not support it. The search strategies are designed specifically for numerical optimization however they can be easily modified to support this.
For categorical parameters you can assign an integer number to each of the choices, e.g. ReLU is 1, tanh is 2, etc., and optimize this parameter as integer.
For conditional parameters, you optimize all possible parameters but ignore and not use the ones whose condition is not satisfied.

This should work since pySOT and HORD don't use gradient information so they do not require a comparison between the parameter values, i.e. 1 < 2 doesn't need to have a meaning. They work by exploring efficiently the parameters space, and the only assumption is that closeby points in the parameter space have similar objective function values.

The other way to do it is to run the optimization separately for each condition and category. For example, run one optimization for each model type and/or each activation function etc.

Well, thanks for the explanation, it makes perfect sense.
Since we are at it I take advantage of it, and ask another couple of questions.

  1. I notice that hyperband leverages hyperopt to "sample" the runs and concentrate on the most promising to run fully - could HORD be plugged in for the param sampling rather than hyperopt? Am I missing something?

  2. How does HORD compare to MaxLIPO+TR, as implemented in Dlib? Do you have any comments in that respect?

I reiterate my thanks for your inputs thus far!

ili3p commented
  1. Sure, hyperband is basically smart (for most of the time) early stopping. So you can use hyperband and HORD, I like this implementation of hyperband https://github.com/zygmuntz/hyperband. Or you can also implement simple early stopping in your training script, i.e. check the val or train error from time to time and decide when a hyperparameter set is not worth training anymore. However, hyperband does not work well when optimizing the learning rate or dropout rate, since these and possible other hyperparameters affect the training error curve. Low learning rate might be very bad at the beginning and thus will be stopped by hyperband, but the same low learning rate can give best final performance if just let to run the time it needs to. Same for dropout, networks without dropout quickly converge but to lower final performance than networks with dropout.

  2. I am not familiar enough with MaxLIPO+TR to comment, but from what I can read it seems it uses gradient information. I personally don't like this since the objective function of hyperparameter optimization is very spiky so gradients do not work well most of the time.

RE: 2) I thought that was gradient-free as well i.e. see this comment from the author. The LIPO part is gradient-free (only relies on estimating K), and the "classic trust region method" (e.g. BOBYQA) is also derivative-free.

But yes, it does still use gradient information somehow, even though it never performs additional function evaluation to estimate the gradient. It will estimate k (for upper bound estimation) from the largest observed gradient thus far, so this bit you write seems to still be a problem: "the objective function of hyperparameter optimization is very spiky so gradients do not work well most of the time", as evaluating a two function points too close to an irregularity might cause the constant k to explode.

Furthermore, the author also seems to discourage using it for neural network hyperparam optimisation, probably because of that.

Would be interested in knowing your comments on this statement on that same page:
"I wouldn't attempt to optimize functions with more than 10s of parameters with a derivative free optimizer."

Thanks again, your comments have been invaluable so far.