bayesoptbook/bayesoptbook.github.io

Updated posterior mean incorrect or unclear

kiudee opened this issue · 8 comments

Thank you for the great book. Similar to the book "Gaussian Processes for Machine Learning" by Rasmussen and Williams, it is a great reference for Bayesian optimization.

Describe the erratum
It appears that the update formula for the posterior mean on page 161 is either incorrect or unclear in that it yields incorrect results:
posterior_mean

I implemented this update formula and get the following behavior:
posterior_mean_plot
Shown in the dashed line is the ground truth function. The blue points depict the current posterior means of the existing data points (the GP was fit using a noise standard deviation of 0.01). So the GP is already modelling the function quite well.
Now I simulate what would happen, if I give the model the global minimum as an additional data point (x=5.145735, y=-1.899599, shown in black).
I use the update formula from above to compute the updated posterior means of the existing data points (shown in green).
As is apparent, the green points are far off from the existing fit.

The pseudocode of the example is the following:

# The object gp is our existing model and X is the existing data
x = 5.145735
y = -1.899599

xp = np.concatenate([X, [x]])
# mu_D_xp corresponds to the blue points:
mu_D_xp, s = gp.predict(xp, return_std=True)
k = (
    (y - mu_D_xp[-1]) / np.power(s[-1], 2) * 
    gp.kernel_(xp, [xp[-1]]).flatten()
)
# mu_Dp_xp are the green points:
mu_Dp_xp = mu_D_xp + k

Version and Location
Version: Draft as of 8 October 2021
Page number: 161

Proposed correction
If the equation from the book is correct, then clarify the constituents of the equation.
Otherwise, fix the equation.

Screenshots
See above.

Additional context
I am in the process of implementing the noisy version of expected improvement. Since the criterion was producing nonsensical values, I isolated the problem and arrived at the vector b (equation (8.12)) as the root cause.

The formula is correct. Your code isn't complete enough to run/debug so I'm not sure what the problem is exactly. Note that you need to evaluate the posterior covariance between x' and x. Feel free to reopen and comment with more complete code.

That’s fair. I will implement a minimum working example using the GP implementation in sklearn. That should make it easy to run.

In any case, gp.kernel_(xp, [xp[-1]]) is the posterior covariance between x' and x.

I reimplemented the code using the GP implementation in sklearn. I fixed the random seed for full reproducibility. The results are similar:
posterior_mean_plot

Here is the code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, Matern, WhiteKernel

#plt.style.use("science")  # Needs the package SciencePlots
colors = plt.cm.get_cmap("Set1").colors

# Create the kernel, random generator and GP model:
kernel = ConstantKernel() * Matern(nu=2.5) + WhiteKernel()
rand = np.random.RandomState(42)
gpr = GaussianProcessRegressor(kernel=kernel, random_state=rand, normalize_y=False)


def problem02(x):
    """This is our target function."""
    return np.sin(x) + np.sin(10 / 3 * x)

# Generate a few random samples and fit the GP:
X = rand.uniform(2.7, 7.5, size=14)
Y = problem02(X) + rand.randn(14) * 0.01
gpr.fit(X.reshape(-1, 1), Y)

# Try to add the global optimum as a point to test the update formula for the
# posterior mean:
x = 5.145735
y = -1.899599
xp = np.concatenate([X, [x]])
mu_D_xp, s = gpr.predict(xp.reshape(-1, 1), return_std=True)
k = (
    (y - mu_D_xp[-1])
    / np.power(s[-1], 2)
    * gpr.kernel_(xp.reshape(-1, 1), [[xp[-1]]]).flatten()  # gpr.kernel_ is the posterior kernel function
)
mu_Dp_xp = mu_D_xp + k

# Plot the results:
fig, ax = plt.subplots(figsize=(8, 5))
# First the posterior means for the data points before updating:
ax.plot(xp, mu_D_xp, "o", label="$\mu_{\mathcal{D}}(x)$", color=colors[1])
# Then the posterior means after applying the update formula:
ax.plot(xp, mu_Dp_xp, "o", label="$\mu_{\mathcal{D}'}(x)$", color=colors[2])

xx = np.linspace(2.7, 7.5, num=1000)
ax.plot(
    xx,
    problem02(xx),
    "--k",
    label="Target function",
)
ax.plot(
    xx,
    gpr.predict(xx.reshape(-1, 1)),
    color=colors[1],
    label="GP posterior mean process",
)

ax.plot(x, y, "ok", label="Observation")
ax.axvline(x, alpha=0.3, color="k")
ax.axhline(y, alpha=0.3, color="k")
ax.legend()

I used numpy==1.19.1, matplotlib==3.3.1 and sklearn==0.23.2.

The learned kernel hyperparameters are:

>>> gpr.kernel_
0.917**2 * Matern(length_scale=0.555, nu=2.5) + WhiteKernel(noise_level=9.02e-05)

I played around a bit and tried a variation where I use the posterior kernel hyperparameters for s:

mu_D_xp = gpr.predict(xp.reshape(-1, 1))
# Extract kernel hyperparameters for the signal and noise standard deviation to construct s:
s = np.sqrt((np.exp(gpr.kernel_.theta)[[0, 2]]).sum())
cov = gpr.kernel_(xp.reshape(-1, 1), [[xp[-1]]]).flatten()
#cov = gpr.predict(xp.reshape(-1, 1), return_cov=True)[1][-1]
k = (
    (y - mu_D_xp[-1])
    / np.power(s, 2)
    * cov
)
mu_Dp_xp = mu_D_xp + k

This produces a much more plausible result:
posterior_mean_plot

So it seems the main problem was using the predictive standard deviation in place of the posterior standard deviation.

edit: In that case, I wonder how best to deal with heteroscedastic noise, if it is only added to the diagonal of the kernel matrix and not modelled as a function.

From my reading of the documentation, gpr.kernel_ does not return the posterior covariance but rather the prior covariance with fitted hyperparameters. I believe you need to use gpr.predict(...return_cov = True) (and modify the rest of the code appropriately).

The formula in the book is correct (with the predictive standard deviation s).

Using the predictive covariance looks very plausible:
posterior_mean_plot

Regarding s: If this is the predictive standard deviation, then it should be a vector of length |x'|, correct? It is not clear from the section in the book (could also be the predictive standard deviation for the point x where we simulate y), since it is not bolded.
edit: The scalar predictive standard deviation at the point x sounds more correct, since we integrate at that point.

No, s is the (scalar) predictive standard deviation of y, see (8.5).

Thanks for your help and patience. I think everything is resolved now and noise EI works as well now.