BORE: Bayesian Optimization by Density-Ratio Estimation

Question

BORE: Bayesian Optimization by Density-Ratio Estimation

georgedeath opened this issue 3 years ago · 16 comments

Description of issue
The recent work of Tiao et al (BORE), provides more insight than Begstra paper discussed in the section on Conditional density estimation (p197). In particular, it shows the link between class-probability estimation and density-ratio estimation, and provides an algorithm which allows for BO using EI to be straight-forwardly transformed into a classification problem.

I think the BORE work has great potential to revolutionise BO (or at least spark the revolution), particularly when working with higher dimensional problems and mixed input types due to the flexible nature of only needing a classified trained on a proper scoring rule.

Proposed resolution
A discussion of the key ideas, which are only a small, but powerful, extension of Bergstra's work with TPE, would prove a useful addition to the book.

Answer 1 · 2021-10-15T23:27:30.000Z

Hi George, thank you for this pointer as well! This will get lumped together with #1; this subsection clearly needs a bit of work. I may rewrite it entirely in light of the BORE viewpoint.

Answer 2 · 2021-10-16T06:44:51.000Z

I think that'd be a great addition, the BORE framework opens up many avenues of future work that we all somehow missed after the TPE paper.

It is probably worth noting that the idea of using a classifier to split between 'good' and 'bad' values has been done quite extensively in the field of evolutionary computation, albeit mainly for multi-objective optimization where the 'good' class contains locations on (or close to) the current Pareto front of evaluated solutions, and the 'bad' class contains the rest of the solutions. This was all done, and is still being done, without any connection to BO or EI.

Answer 3 · 2021-11-03T18:31:25.000Z

@georgedeath a question about the BORE paper -- it seems to me that the proposal is ultimately to maximize π(x) = Pr(y > y* | x, D). But that's just probability of improvement over y*, right? In the appendix the authors point out that PI and EI are proportional "in the formulation of Bergstra et al." (which I interpret to mean class-conditional density estimation), but that PI and EI are not proportional given a Gaussian predictive distribution. However, I can't reconcile this with observation that one way to estimate ell(x) and g(x) would be from a GP belief (e.g., after normalization), where EI and PI are not proportional. What am I missing?

Answer 4 · 2021-11-04T10:55:20.000Z

I don't think we can estimate ell(x) and g(x) from a GP directly, rather we can estimate the CPE via a GP classifier -- I may be misunderstanding your final sentence. I do find it interesting/odd that EI and PI are not proportional given a Gaussian predictive distribution, but are proportional in the density ratio formulation.

@ltiao Could you enlighten us on these points?

Answer 5 · 2021-11-04T11:15:07.000Z

Looking at this closer, I don't see how it's valid to pull ell(x) out of the integral on the first page of the BORE supplement (or the equivalent on page 4 of Bergstra et al.). My analysis is below.

Let's get some definitions out of the way:

       γ = Pr(y < τ)
    ℓ(x) = p(x | y < τ)

Let a = (∞, τ) be the interval of integration. We may derive the following
expression for ℓ(x) as an integral on a:

    ℓ(x) = p(x | y < τ)
         = ∫ p(x | y) p(y | y < τ) dy
         = γ⁻¹ ∫ₐ p(x | y) p(y) dy,

The γ⁻¹ is coming from the normalization constant in p(y | y < τ), which equals
γ⁻¹ p(y) on a (and zero elsewhere). Thus

    ∫ₐ p(x | y) p(y) dy = γℓ(x).

Now we must compute

    ∫ₐ (τ - y) p(x | y) p(y) dy,

which we break into two integrals. The first is

    τ ∫ₐ p(x | y) p(y) dy = τγℓ(x).

This seems fine. But now we are left with

            ∫ₐ y p(x | y) p(y) dy
    =? ℓ(x) ∫ₐ y p(y) dy,

and I don't see how ℓ(x) can be taken out of the integral here since p(x | y) is not constant wrt y.

To interpret that last bit, the integral is wanting to weight p(x | y) according to the magnitude of y, which makes sense given the context of EI. (Incorrectly) pulling ℓ(x) out of the integral ignores that weighting, which (as can be verified in the second part of the BORE supplement) actually leaves us with PI rather than EI.

Answer 6 · 2021-11-04T11:23:18.000Z

I don't think we can estimate ell(x) and g(x) from a GP directly, rather we can estimate the CPE via a GP classifier

Sure you can. Write ℓ(x) = p(x | y < τ) ∝ Pr(y < τ | x) p(x). We do need to choose a prior p(x), but we could just e.g. take it to be uniform. Now ℓ(x) ∝ Pr(y < τ | x), which is just PI. I actually went through the whole construction with a GP and can confirm that everything works out perfectly with the PI claim (perfect coincidence in the acquisition function via either construction). I currently believe there to be an error in the EI claim, as outlined above.

Answer 7 · 2021-11-04T12:49:29.000Z

Sure you can. Write ℓ(x) = p(x | y < τ) ∝ Pr(y < τ | x) p(x).

You're absolutely right, I wasn't thinking!

(Incorrectly) pulling ℓ(x) out of the integral ignores that weighting, which (as can be verified in the second part of the BORE supplement) actually leaves us with PI rather than EI.

The Bergstra and BORE paper both just take ℓ(x) = p(x | y) over (-∞, τ) and whip it out of their integrals. I did similar derivation to your analysis above and come to the same conclusion, p(x | y) can't just be substituted for ℓ(x) because of the weighting of y -- I'd argue this is one of the things that makes EI > PI in the first place.

Considering TPE has been cited over 3000 times you'd have hoped someone else would have noticed this by now (or maybe they have and I haven't seen the work).

Answer 8 · 2021-11-04T13:25:13.000Z

Thinking about my last comment a bit further, you don't actually need to choose a prior p(x) because in the density ratio it cancels (assuming p(x) has support over the whole domain) and you're left with ℓ(x)/g(x) ∝ PI/(1 - PI).

Thanks for your supporting analysis. I'll follow up with the authors of both papers to triple check I'm not mistaken. Fortunately, I don't think there's a huge disaster here -- maximizing the density ratio is equivalent to maximizing probability of improvement, which is a perfectly reasonable policy!

Answer 9 · 2021-11-04T15:30:05.000Z

Thanks for your supporting analysis. I'll follow up with the authors of both papers to triple check I'm not mistaken. Fortunately, I don't think there's a huge disaster here -- maximizing the density ratio is equivalent to maximizing probability of improvement, which is a perfectly reasonable policy!

No worries -- I've also asked a few colleagues to take a look at this thread and let me know if they agree with your analysis. And you're right, no disaster here, the works are still valuable because we're able to perform a BO-esque procedure using arbitrary (appropriately trained) classification models.

Answer 10 · 2021-11-15T18:25:41.000Z

I haven't yet heard back from @ltiao or @jaberg but I attempted to reframe/clarify this content in the most recent version of the book. Closing for now but feel free to reopen!

Answer 11 · 2021-11-15T18:59:24.000Z

Hi @bayesoptbook @georgedeath

Many thanks for initiating this discussion! Consistent with your analysis, we interpreted the step in (Bergstra et al. 2011)'s of taking p(x | y) to be equivalent to p(x | y < tau) = \ell(x) for all y < tau as a simplifying modelling assumption. In other words, it is assumed that p(x | y) is piecewise constant in y, with the two pieces being on the left and right side of \tau. In general, there will be a discontinuity at \tau, and we can think of TPE/BORE as operating on the left side of this discontinuity.

Thank you for clarifying this point in your book, and we will also try to update our paper to be more explicit regarding this important point.

Answer 12 · 2021-11-15T19:08:04.000Z

@ltiao thank you for your comments! I can see that assuming p(x | y) is constant for y < τ allows you to pull it out of the integral as ℓ(x), but it seems to be a bit of a crude approximation as ℓ(x) is itself the average of p(x | y) over all y < τ, weighted by p(y):

ℓ(x) = p(x | y < τ)
     = ∫ p(x | y) p(y | y < τ) dy

So the definition of ℓ(x) in some sense acknowledges the nonconstancy of p(x | y) wrt y. Of course there's no such thing as a "wrong" approximation!

Answer 13 · 2022-01-26T07:00:59.000Z

Hi @bayesoptbook

Thanks for the discussion on this topic! I was having similar concerns about TPE/BORE and EI at about the same time, which led me to submit this workshop paper at NeurIPS: http://bayesiandeeplearning.org/2021/papers/54.pdf

Answer 14 · 2022-01-28T02:44:10.000Z

Ha what a funny coincidence!

Answer 15 · 2022-07-18T17:05:19.000Z

@jiamings and @LantaoYu Has since come out with this great extension towards likelihood-free BO -- really neat ideas, particularly how to use EI with these classification-based settings: https://proceedings.mlr.press/v162/song22b.html

@bayesoptbook It'd be really good if this interpretation could make it into the book over the BORE/TPE density-ratio explanation, but I fear it may be too close to publication. Maybe for v2 of the book? :)

Answer 16 · 2022-07-18T18:10:12.000Z

@georgedeath Thanks for promoting this!

@bayesoptbook Let me know if you have any questions regarding our work.