EmuKit/emukit

Questions about the multi-fidelity model

AnthonyLarroque opened this issue · 9 comments

Hi everybody,

I have some questions about the multi-fidelity model:

  1. For the hyper-parameters of the multi-fidelity model, does emukit find them with the log-likelihood of the whole data (cheap and expensive samples) in one step or does it, as in the paper of Kennedy and O'Hagan (2000), first find the hyperparameters of the cheap kernel with the cheap data, then the scale and the hyperparameters of the error kernel with the difference between the expensive and cheap outputs at common points ? Is the Markov property automatically respected with the optimization of the multi-fidelity model in emukit ?

  2. I saw that in the notebook with multi-fidelity and bayesian optimization, you constrained the lengthscales between 0.01 and 0.5. Why did you do that ? Is it to avoid local optima of the likelihood for large lengthscales or is it because it may not make much sense to have large lengthscales for a domain size of 1 ? Would you have some tips on these constraints according to the domain size of the function ?

  3. If there are constraints on the lengthscales, are there cases where constraints on the variance (for the RBF kernels for example) should be applied ? Sometimes with multi-fidelity (or even with single fidelity), I have variance of an order of magnitude of millions and I am wondering if that makes a lot of sense to have such high variance.

  4. I also saw that in the notebook with multi-fidelity and bayesian optimization you added a noise of 0.1 to avoid overfitting. However, in the notebook with only the multi-fidelity, this noise is fixed to 0. So, how can you a priori know if some noise is needed ? By the way, correct me if I am wrong, but GPy automatically add a noise variance of 1e-8 (the jitter) to avoid ill-conditonned matrix, right ?

  5. Finally, do you think that working with the multi-fidelity, the observations should be normalized with a mean 0 and standard deviation of 1 for each fidelity or should they be kept in their original forms ?

Kind regards,
Anthony Larroque

Hi Anthony,

Thanks for the questions. I am not a huge expert in MF, so the answers might not be completely satisfactory. Nevertheless, let's try:

  1. It seems to be the latter approach, relevant model update code is here: https://github.com/EmuKit/emukit/blob/main/emukit/multi_fidelity/models/non_linear_multi_fidelity_model.py#L340
  2. A lot of times such constraints are added for stability reasons, to ensure the optimization would not crash at some bad random seeds. General advice however is that such parameter priors should come from your knowledge and understanding of the problem, which may or may not be connected to the size of the domain. General rule of thumb "parameter can be no more than X% of the range" just does not exist, I am afraid.
  3. Variance can certainly be bounded in the same way in GPy, here is one example of that: SheffieldML/GPy#735 (comment) . Again, will refrain from any general advice.
  4. In these particular notebooks that would make no difference, since in both cases we don't seem to add any noise to the objective. In more complex situations you would know if your observations are exact or noisy. I believe you are correct about the jitter in GPy
  5. Normalization can be beneficial for GP regression, so generally speaking the answer is yes. But it can get tricky, so one needs to apply care.

To aid your interest in multi-fidelity, I would suggest to check out this great lecture from Prof. Lawrence: https://mlatcl.github.io/mlphysical/lectures/05-02-multifidelity.html . Also uses Emukit, and goes into many concepts much deeper than I could ever do.

Hi apaleyes,

Thank you very much for the answers and the useful documentation ! That helps a lot !

Even if I saw the tutorial notebook on multi-fidelity, I am not really familiar with the non linear multi-fidelity model. With my first question, I was more referring to the linear multi-fidelity model and linear multi-fidelity kernel. Are there any lines in linear_multi_fidelity_kernel.py or in linear_model.py that show that the hyperparameters are firstly optimized on the low fidelity model then on the error function ?

I also join a zip file containing a notebook with a small model wrapper that I created for the error function. I sometimes have different results than with the current linear multi-fidelity wrapper in emukit. Sometimes, these results are worst, sometimes better. So, I am pretty confused on the right way to treat the hyperparameters of the multi-fidelity model. Still, it appears faster to optimize the hyperparameters of the multi-fidelity model with the wrapper that I created.

Kind regards,
Anthony Larroque
hp_mf.zip

Hi there!

The linear model in Emukit is implemented as a single GP, refer to this notebook for explanation: https://nbviewer.org/github/emukit/emukit/blob/main/notebooks/Emukit-tutorial-multi-fidelity.ipynb . So there are no low and high fidelity models, there is just one model.

These models aren't determnistic, so having some stochasiticity in outputs should be expected. That's why, as far as i understand it, GPy has optimisation restarts. Perhaps consider the same thing with MF models?

Hi,

"These models aren't determnistic, so having some stochasiticity in outputs should be expected. That's why, as far as i understand it, GPy has optimisation restarts."

I do understand that the linear model is in the end a GP and that these models are not deterministic and contain a mean and a variance on the prediction. However, I do not really understand regarding the optimization restarts. What I understand is that if GPy has optimization restarts, it is because the optimization of the marginal likelihood of the model to find the hyperparameters is based on the L-BFGS-B algorithm, a gradient-based method. Thus, with several optimization restarts that correspond to different starting points of the optimization, we avoid the L-BFGS-B algorithm being trapped in a local optimum of the marginal likelihood. Isn't it the reason of the optimization restarts ? Could you please explain the link between the stochastic behaviour of the model and the optimization restarts ?

Also, my question is more related to the treatment of the hyperparameters of the linear model. What I understand is that with emukit, the multi-fidelity model is considered, as you said as a single GP, and thus all the hyperparameters (the scale, the variances and lengthscales of the kern_low and kern_err in the notebook that I sent) seem to be optimized in an unique step. However, what I read in Kennedy and O'Hagan (2000), is that the hyperparameters of kern_low are optimized with only the low fidelity data, and then the hyperparameters of kern_err and the scale are find by optimizing the marginal likelihood of the linear difference between the high fidelity and low fidelity outputs at common points. Indeed, they insist on the Markov property:

$cov(f_{high}(x), f_{low}(x^\prime)|f_{low}(x)) = 0$,
meaning that if we know $f_{low}$ at $x$, we can learn no more about $f_{high}$ at $x$ with any other run $f_{low}(x^\prime)$, with $x^\prime$ being different of $x$. That is why I developed the error model in the notebook that I sent in order to proceed as in the original paper: the data is firstly optimized with the low fidelity data, then the error model and the scale are optimized at common points, and finally the multi-fidelity model is reconstructed with the hyperparameters found. As you can see, we can observe some differences with the current implementation of emukit with some inputs.

So. my questions are: could you confirm that the hyperparameters of the linear multi-fidelity model are all optimized in an unique step ? If so, the Markov property is not respected, is-it ? If the Markov property is not respected, do you know some recent developments that could explain why it is not important or relevant to consider the Markov property ?

Kind regards,
Anthony Larroque

Could you please explain the link between the stochastic behaviour of the model and the optimization restarts ?

Ok, I concur, you explanation for restarts is better than mine!

So. my questions are: could you confirm that the hyperparameters of the linear multi-fidelity model are all optimized in an unique step ? If so, the Markov property is not respected, is-it ? If the Markov property is not respected, do you know some recent developments that could explain why it is not important or relevant to consider the Markov property ?

I am not aware of any such developments as I don't follow MF research. But I can see that in the original paper they do the thing also done in Emukit, that is representing linear MF model as a single GP with modified prior (page 4, eq 4), and then comment that the property holds (page 5, top). So I have to assume that the property is still respected in Emukit's implementation as it follows the paper.

Overall, I am afraid this conversation pushes way beyond my understanding of MF modelling, so I probably won't be of much assistance if you have further concerns.

Hey, I came across this discussion and though about sharing my 2 cents on it.

I believe the Markov Property is always respected, otherwise you could not define the autoregressive model proposed by O'Hagan. Notice in the text on his paper between equation 1 (markov property) and equation 2 (the AR model) the text clearly states that the assumption on 1 implies the model in 2.

I also had the same question on the emukit implementation but I think it's a matter of the interpretation of the paper again. From my personal interpretation the derived equations on section 2.4 (and the independence condition that splits the hyperparameters) is done under the assumption that the design points are nested.

image

I have tried emukit with design points that are not nested samples so I assume their implementation is general and solves the likelihood optimization for all parameters at the same time. Still from my personal interpretation this does not break the Markov property.

Also maybe this paper may be usefull for you: https://arxiv.org/pdf/1210.0686.pdf
The author develops a recursive estimation for parameters, assuming nested samples

Hi,

Sorry for the late reply, I have been busy recently.

Thank you @apaleyes for your answers and thank you @GranjalCruz for joining the conversation and provide your useful insights !

Indeed, after thinking to your messages, I believe that the Markov property is always automatically respected in the posterior covariance matrix with the model provided.

Still, the optimization of the hyperparameters makes me think. In all the articles that I read and that use the multi-fidelity model of Kennedy and O'Hagan, they are doing the two-step process of first building the low fidelity model, then building the model of the error function. In the chapter 8 of "Engineering Design via Surrogate Modelling" by Forrester et al. (2008), they also do this two-step process and when the low-fidelity samples are not evaluated at high fidelity samples, they instead take the prediction of the low fidelity model in order to build the bridge function.

Also, I have experienced, I believe, some strange behaviours with the current implementation of the linear multi-fidelity model regarding the optimization of the hyperparameters. I am joining a zip file with a notebook containing some tests that I ran. In the paragraph 2.1 of this notebook, I ran the current implementation of the linear multi-fidelity model with respectively 6 and 3 low and high fidelity samples. Note that around the points evaluated, the variance and the mean have a rough transition. In the paragraph 2.2, I have added 2 high fidelity samples. In that case, the variance of the multi-fidelity model completely disappear. Note also that the hyperparameters of the cheap kernel changed adding high fidelity samples, so I assume that the hyperparameters are all optimized in an unique step in emukit. In the chapter 8 of Forrester et al., they mention: "our cheap data is considered to be independent of the expensive data". If the hyperparameters of the cheap kernel change when we add high fidelity samples, I believe that the cheap data is not considered independent of the expensive data. I also may think that the independence between the low-fidelity model and the model of the error function is no longer maintained with the optimization of the hyperparameters in an unique step (and when the high fidelity samples are a subset of the low-fidelity samples as in the original paper of Kennedy and O'Hagan). Also, for some run, I experienced cases where I had a significant noise on the high-fidelity samples, whereas the noise is fixed to 0.

In the paragraph 3.1 and 3.2, I ran the same experiments but when the hyperparameters are optimized successively. The curves are different compared with the paragraph 2.1 and 2.2. The mean and the variance transitions seem smoother and the results are more reproducible.

So, that makes me wonder if optimizing the hyperparameters in an unique step is the right way to do. The optimization process seems more "unstable" in the sense that even with several optimization restarts, it can lead to different results. The optimization of the hyperpameters is also more difficult as the domain size increase (especially when the objective function is defined in a design space in several dimensions and ARD is used for the kernels) and the process is longer in time than optimizing the hyperparameters successively. I am also wondering if that may be the reason of the overfit mentioned here (#234) as if the hyperparameters are optimized in an unique step, the hyperparameters of the cheap kernel will try to fit the whole data set (as seem to be shown in the paragraph 2.2 of this notebook).

Kind regards,
Anthony Larroque
hp_mf_2.zip

Hey, took a quick look at the script. I think your issue from 2.1 to 2.2 is the fact that the ylim of the plot is on a totally different scale, since when you add the 2 samples they are boosting the uncertainty at low x locations.

The only good way to compare 3.1 to 2.1 is if you pass the same dataset, which from a quick look of the plots seems not to be the case. The model of 3.2 is the best since the new samples are 'optimal' based on some acquisition function, therefore it makes sense that it outputs the best model of all the examples.
I suggest you try to compare the models with the same data, there might be a small difference from fixing the likelihood (or not in this simple case) but it should provide a better comparison.

Unfortunately, I cannot conclude from the examples if the optimization of all hyperparameters together is better or worse (or correct/incorrect)

P.S: I am not using emukit for multi-fidelity optimization but for multi-fidelity modelling (application case is different than classic GPyOpt framework)

Hi @GranjalCruz ,

Thank you for your answer !

"The only good way to compare 3.1 to 2.1 is if you pass the same dataset, which from a quick look of the plots seems not to be the case."

It is the case: the dataset from 2.1 and 3.1 are the same. But sorry, my notebook was not optimal regarding the structure to really compare. I join the same notebook where it is easier to compare the results between when the hyperparameters are optimized in an unique step and when they are optimized successively. In this new notebook, the plot of the section 2.3.1 must be compared with the one of the section 2.3.2 and the plot of section 3.3.1 with the plot of 3.3.2.

"The model of 3.2 is the best since the new samples are 'optimal' based on some acquisition function, therefore it makes sense that it outputs the best model of all the examples."

I am not using any acquisition function in that notebook. I just study two cases: one with 3 high fidelity samples (section 2 in the new notebook) and another one with 5 high fidelity samples (section 3 in the new notebook). I then compare for the two cases the difference between optimizing the hyperparameters in one step or optimizing them successively.

"P.S: I am not using emukit for multi-fidelity optimization but for multi-fidelity modelling (application case is different than classic GPyOpt framework)"

Even if I am later interested in multi-fidelity optimization, all the notebooks that I sent are dealing with just multi-fidelity modelling here. When I am talking about optimization, for the moment I just refer to the optimization of the hyperparameters such it fits the best possible the data according to the marginal likelihood.

Regards,
Anthony Larroque

hp_mf_3.zip