Comparing to naive optimization

Question

Comparing to naive optimization

Hugo-Leung opened this issue 5 years ago · 4 comments

For a high dimensional problem, one naive approach is to calculate all the partial derivative in parallel then apply SGD or other algorithm. Have you compare this naive approach with your method, say in term of numbers of function calls and convergence rate.

Hugo-Leung commented 5 years ago

Thanks.

Answer 1 · 2019-11-06T15:13:00.000Z

This method is gradient-free (only uses function values). Let me know if that answers.

Answer 2 · 2019-12-13T22:28:57.000Z

Hi Paul, I don't think that really answered the question Hugo was getting at. Actually, I also had a similar question and found this before asking.

As far as I know, with black box problems, one can use either an algorithm like yours specifically designed for black box, or use, say, a BFGS optimizer which relies on the gradient. The problem is calculating the gradient is slow when there are many parameters, but if one can solve the problem with parallel computing (as you are doing), then one could potentially calculate the gradients using parallel computing and just use BFGS...

Do you know if it is faster to use your black box algorithm, or to simply use BFGS and calculate the gradient in parallel each step (adding no additional time to each single step)? I suppose rephrasing that, one could ask: is N steps of your derivative-free algorithm faster than 1 BFGS steps with the gradients solved.

Answer 3 · 2019-12-14T23:36:46.000Z

Current method (blackbox) is primarily for hyper-parameter optimization (== expensive function in relatively low-dimensional space). For such cases, current approach (reconstructing function based on its values) is definitely more efficient than computing gradients, especially considering that function might potentially have noise and/or irregularities (cliffs, valleys etc). However, this is an empirical/intuitive statement, I can't provide exact numbers or proofs. Feel free to prove me wrong.

For an opposite side of the spectrum (cheap function in high dimensions), gradient-based methods would probably be the only option (because function reconstruction step itself actually becomes pretty expensive in high dimensions).

For intermediate cases I assume the answer is "it depends".