[Feature Request] Performance Benchmarks in Documentation

Question

[Feature Request] Performance Benchmarks in Documentation

jkterry1 opened this issue 4 years ago · 6 comments

So I'm currently trying do hyperparameter tuning RL problem that's kind of a worst case scenario. It's a 10D space, trials are terribly expensive and noisy, and for 3 hyperparameters (including the 2 most important ones) we really have no idea what the value should be so the bounds are very large. The last point means I presumably need to use bayesian or genetic algorithm methods instead of random/PBT methods like are more common.

Because of this I've been specifically looking into Ax and Nevergrad (they appear to be the only open source production grade tools for this). If you already have them, including benchmarks for how Ax compares to other similar methods for hyperparameter tuning or other tasks would be super helpful to new users. E.g. I'm currently having to decide between Ax and TBPSA from Nevergrad and can find no discussion anywhere of which approach is superior, even though I'm likely a fairly common use case for your library.

It would also be helpful to new users to include examples for recommended budget with Ax in certain scenarios, if you have the data. For example, in my case I assume it's between 100 and 1000 but that's a rather large difference in terms of cost.

Answer 1 · 2021-02-12T18:22:27.000Z

I don't believe we currently have benchmarks ready on-hand for Ax, but we might have them for BoTorch (methods library Ax uses under the hood). @Balandat, might you be aware of such benchmarks? I also know that making trial number recommendation is tricky as the problems can really vary, but will let @Balandat chime in on that as well. I believe 100 trials should be a good place to start for this case.

Adding formal benchmarking and performance comparison to Ax docs is something we've had on our roadmap for a while, but have not yet gotten to. Thank you for the suggestion!

Answer 2 · 2021-02-14T01:38:57.000Z

We have some BoTorch benchmarks published in https://arxiv.org/abs/1910.06403, though not terribly comprehensive.

A 10D space should be well manageable with our methods. We often get good results in less than 100 trials, though it depends very much on the hardness of the problem (shape of the response surface, noise level of the observations). It's definitely worth a shot throwing Ax with standard settings at the problem.

I'm curious, in your setup, are you able to get cheaper observations (that are maybe noisier or biased)? Wondering if multi-fidelity / multi-task optimization would be an option.

Answer 3 · 2021-02-14T14:56:56.000Z

@Balandat I can actually arbitrarily make my observations cheaper at the expense of more noise, down to the limit where they're virtually free but are literally pure noise. My current expensive runs are approximately the cheapest that could plausibly give sufficient information to make a legitimate decision about the quality of the hyperparameters.

Could you tell me a little more about what you're thinking with multifidelity optimization?

Answer 4 · 2021-02-14T16:38:27.000Z

The basic idea of multi-fidelity optimization is to use cheaper, but potentially less valuable observations to efficiently optimize the function at "full fidelity". Typically, there is some "fidelity parameter" (in your case that might be the amount of data, or some number of monte-carlo samples in some simulation) that one can choose to evaluate the function at less cost.

The main goal is to use this information in a principled way to learn about the true function at full fidelity. This is typically done by learning a joint model over both the hyperparameters and the fidelity parameter(s) and then using approaches such as the one discussed in https://arxiv.org/pdf/1903.04703.pdf (this is implemented in BoTorch and can be used through Ax as well).

In your case, do you know more about the noise characteristics? I.e. is there some CLT going on and you have an unbiased estimate whose sem behaves like 1/sqrt(sample_size)? In that case one can also try to be smarter and encode that information into the noise model so one doesn't have to actually learn the joint parameter & fidelity parameter model.

Answer 5 · 2021-02-14T16:42:13.000Z

You may also want to take a look at the discussion in #475

Answer 6 · 2021-04-23T19:35:42.000Z

We will now be tracking wishlist items / feature requests in a master issue for improved visibility: #566. Of course please feel free to still open new feature requests issues; we'll take care of thinking them through and adding them to the master issue.