Allow optimization based acquisition functions

Computation of acquisition functions only on sampled points is problematic in high-dimensional spaces, where the distance to the true optimum (of the acquisition function) will be large on average.

Hi @kiudee,
The package looks quite good! I'm particularly interested in Thompson Sampling, however my problem is high dimensional so I need to use the approximation from Hernández-Lobato 2014.
Did you have a look into it already?
Cheers

Hi @michaelosthege, thank you for the reference.
One of my near future todos was to support parallel evaluation of configurations, which are not simply implemented using a constant liar strategy.
Thompson Sampling already seemed like the most natural acquisition function for parallelization.

From having a cursory glance at the paper, it looks like it should be straightforward to implement.

The paper was accompanied by a MATLAB implementation for which I also found a Python port.

Unfortunately, due to the code formatting & absence of useful docstrings/comments I found it hard to understand. I have trouble understanding which values I'll have to pass & which constraints this approximation may put on my GP model..

In your case I would look at quadrature fourier features to implement the Thompson sampling part:
https://papers.nips.cc/paper/8115-efficient-high-dimensional-bayesian-optimization-with-additivity-and-quadrature-fourier-features
I found an implementation here:
https://github.com/Mojusko/QFF

Use it in conjunction with predictive variance reduction search (PVRS):
https://bayesopt.github.io/papers/2017/13.pdf
which works similar to PES but is faster and simpler to implement.

edit: In bayes-skopt PVRS is implemented as follows:

bayes-skopt/bask/acquisition.py

Lines 278 to 308 in d7fa3d2

    
           class PVRS(FullGPAcquisition): 
        
               """Implements the predictive variance reduction search algorithm. 
        
               The algorithm draws a set of Thompson samples (samples from the optimum 
        
               distribution) and proposes the point which reduces the predictive variance 
        
               of these samples the most. 
        
               References 
        
               ---------- 
        
               [1] Nguyen, Vu, et al. "Predictive variance reduction search." Workshop on 
        
               Bayesian optimization at neural information processing systems (NIPSW). 2017. 
        
               """ 
        
               def __call__(self, X, gp, *args, n_thompson=10, random_state=None, **kwargs): 
        
                   n = len(X) 
        
                   thompson_sample = gp.sample_y( 
        
                       X, sample_mean=True, n_samples=n_thompson, random_state=random_state 
        
                   ) 
        
                   thompson_points = np.array(X)[np.argmin(thompson_sample, axis=0)] 
        
                   covs = np.empty(n) 
        
                   for i in range(n): 
        
                       X_train_aug = np.concatenate([gp.X_train_, [X[i]]]) 
        
                       K = gp.kernel_(X_train_aug) 
        
                       if np.iterable(gp.alpha): 
        
                           K[np.diag_indices_from(K)] += np.concatenate([gp.alpha, [0.0]]) 
        
                       L = cholesky(K, lower=True) 
        
                       K_trans = gp.kernel_(thompson_points, X_train_aug) 
        
                       v = cho_solve((L, True), K_trans.T) 
        
                       cov = K_trans.dot(v) 
        
                       covs[i] = np.diag(cov).sum() 
        
                   return covs

Note, however that this is sampling from a random set of points instead of using fourier features.

hi @kiudee,
Thank you for those links. It took me a while to read them and also Bradford, 2018 which has a really good description of the RFF algorithm.

My impression is that the QFF make some assumptions about independence of the problem dimensions, that we don't feel comfortable about with our problem. (That might change though, if we run into problems with computational complexity 😅)

~~My RFF implementation still has a bug, resulting in too much posterior variance, but I think it's just a **2 or numpy.log somewhere..~~

	class PVRS(FullGPAcquisition):
	"""Implements the predictive variance reduction search algorithm.

	The algorithm draws a set of Thompson samples (samples from the optimum
	distribution) and proposes the point which reduces the predictive variance
	of these samples the most.

	References
	----------
	[1] Nguyen, Vu, et al. "Predictive variance reduction search." Workshop on
	Bayesian optimization at neural information processing systems (NIPSW). 2017.
	"""

	def __call__(self, X, gp, args, n_thompson=10, random_state=None, *kwargs):
	n = len(X)
	thompson_sample = gp.sample_y(
	X, sample_mean=True, n_samples=n_thompson, random_state=random_state
	)
	thompson_points = np.array(X)[np.argmin(thompson_sample, axis=0)]
	covs = np.empty(n)
	for i in range(n):
	X_train_aug = np.concatenate([gp.X_train_, [X[i]]])
	K = gp.kernel_(X_train_aug)
	if np.iterable(gp.alpha):
	K[np.diag_indices_from(K)] += np.concatenate([gp.alpha, [0.0]])
	L = cholesky(K, lower=True)
	K_trans = gp.kernel_(thompson_points, X_train_aug)
	v = cho_solve((L, True), K_trans.T)
	cov = K_trans.dot(v)
	covs[i] = np.diag(cov).sum()
	return covs