facebook/Ax

[Feature Request] Ability to specify input uncertainties (uncertainties of measured parameters)

sgbaird opened this issue ยท 13 comments

For example, you do some wet-lab synthesis, and your mass scale has some uncertainty with the weight measurements, so the measured parameters (fractional prevalence of each component) has some uncertainty associated with it.

In its current state, is Ax capable of incorporating this type of uncertainty? It's ok if the answer is no, I just wanted to know.

Similar to error-in-variables analysis.

@sgbaird, I might be misunderstanding the question, but ax_client.complete_trial lets you specify both mean and SEM for each metric you log the data for. In the evaluation function we use here: https://ax.dev/tutorials/gpei_hartmann_service.html#3.-Define-how-to-evaluate-trials, we manually set the SEM to 0.0, to indicate that we know the function to be noiseless (as it's synthetic). You could instead pass a non-zero value for the SEM (or None to indicate that you know the function to be noisy, but would like Ax to estimate the noise for you).

I understand that uncertainly in outcomes is not exactly the same as uncertainly in what parameter value was actually given to the evaluation. I think it should be okay to just use the outcome uncertainty for this, but would be curious to hear what @Balandat thinks. Also cc @eytan, @qingfeng10, @dme65

@lena-kashtelyan, that's a good point about incorporating the parameter uncertainty into the outcome uncertainty. I'm sure there's a reasonable way to do that (i.e. uncertainty propagation). After another look at ax_client.attach_trial, it looks like specifying parameter uncertainty isn't built-in, at least.

An Idea for Propagating Parameter Uncertainty to Outcomes

One way might be to use a sampler (e.g. Gibbs sampler) in the parameter space around the parameter set and determine a "mean effect" or something similar on the outcome, then incorporate that into the outcome uncertainty. Maybe this would just be an addition to the original uncertainty (sem = outcome_sigma_propagated + outcome_sigma). While I say that, given my background, implementing it would be non-trivial for me. I struggle with the theory and implementation of samplers.. (this was mostly in MATLAB at the time). Interestingly, something along these lines was suggested to me a while ago (see also discussion of this SE post below).

Some More Context

This is something that has been on my mind for a while (see my SE post). Back then, someone mentioned Stan, but it wasn't straightforward enough for me to justify the time to keep troubleshooting initial attempts. Now that I'm using Ax, I remembered this and thought it would be worth bringing up.

If it becomes a higher priority for me in the future, I'll probably try to flesh out example implementation(s) based on any other feedback in this thread. The other question here is whether incorporating parameter uncertainty actually improves the outcome (e.g. based on Bayesian optimization benchmark schemes such as this materials science example).

EDIT: Another REF on propagating input uncertainty to outputs (just discussion, no simple code examples)

it looks like specifying parameter uncertainty isn't built-in, at least

We don't have a way to express parameter uncertainly explicitly, unfortunately. I think some form of incorporating that uncertainty into outcome uncertainty should make sense though.

eytan commented

cc @sdaulton , @saitcakmak who have been working on optimization under input uncertainty. This BoTorch tutorial considers value at risk (which is more of a robust design thing, like, you want to achieve at least some value of an outcome at least 90% of the time), but one can also use the mean as the risk statistic. https://botorch.org/tutorials/risk_averse_bo_with_input_perturbations

I don't think there is currently a standardized way to surface this in Ax, but maybe it's possible to do via Modular BoTorch?

eytan commented

Re: the stochastic inputs: I agree that if the noise from repeat measurements is greater than the noise from the measurement error, might be best to ignore that (and instead just infer the noise). @Balandat and @Ryan-Rhys have been working on heteroskedastic noise GP models that infer the noise, and (should hopefully?) get less tripped up by repeat observations at the same design point or design points that are close to one another. There are also other models like the one implemented in https://www.jstatsoft.org/article/view/v098i13 which efficiently deal w/ replicated points.

Interesting problem setting. It reminds me of this paper. They study the setting where there's noise in the query location, i.e. the observation is y = f(x-tilde) + \epsilon where x-tilde is only known up to some distributional info, such as x-tilde ~ Normal(x, sigma(x)). Since you don't know the query location but know the distribution, their proposal is to fit a GP using the parameters of the distribution as the input. So, instead of using train_x = [x-tilde-1, x-tilde-2, ....], you'd use train_x = [(x-1, sigma(x-1)), (x-2, sigma(x-2)), ...] etc.
I haven't tried this type of model on BoTorch (or Ax), but I think it should be rather straightforward to do so. We already have the risk averse methods that assume you observe (x, y) but end up implementing x + perturbation (that Eytan linked to above) implemented, so it shouldn't be too hard to combine the two ideas and see how it does.

Hi @sgbaird,

We originally designed this method:

[1] https://iopscience.iop.org/article/10.1088/2632-2153/ac298c

for materials synthesis experiments (solar cells) where the outcome uncertainty was expected to vary as a function of the input space (material composition space in our case). I guess this setting is quite similar to your wet lab example in so far as a material is comprised of say 3 components and there may be some error in the fractional component prevalence (e.g. 20% component A, 30% component B and 50% component C) caused by the mass scale (uncertain inputs).

I'm guessing the most appropriate approach would depend, as @eytan mentions, on the relative magnitude of the input uncertainty relative to the outcome uncertainty. In our application, our assumption was that we had a smooth latent function f(x) with large outcome uncertainty and negligible input uncertainty, and so a heteroscedastic GP seemed to be the most appropriate surrogate. In addition to the approach we took in [1], there are also other attractive models depending on the situation:

[2] https://proceedings.neurips.cc/paper/2021/file/8f97d1d7e02158a83ceb2c14ff5372cd-Paper.pdf
[3] http://proceedings.mlr.press/v51/saul16.pdf

The method in [2] relies on repeated sampling at the same input locations and is useful in cases where the cost of taking repeated measurements at any given input x is low whereas [1] and [3] do not rely on repeated sampling but may require more data points in order to accurately model input-dependent outcome uncertainty. I should mention that the VI scheme in [3] is potentially less capricious relative to the approach I took in [1] and so may be preferable in settings where there isn't much prior knowledge available about the black-box.

To summarise, my take agrees with what's previously been said!

  1. If input uncertainty > outcome uncertainty the methods suggested by @saitcakmak and @eytan are probably the most appropriate.
  2. If input uncertainty < outcome uncertainty methods [1,2,3] might be more appropriate.

I think there are a lot of interesting problems in applying BO to accelerate materials synthesis. One of the aspects for material composition experiments such as the one you described is that the design space is a simplex instead of the more standard hypercube although I'm not sure what effect if any this might have on algorithm design!

A slightly more practical aspect of the problem settings we encountered (which way be common to many wet lab experiments) was the need for some kind of offline or multifidelity evaluation of different methods. One of our chemical engineering collaborators in Singapore had a really fast microreactor that would have been ideal for benchmarking different BO approaches before they were used on slightly different but far more expensive black-box reactor functions, but alas we never got to use it because of the pandemic!

In terms of multifidelity evaluation I'm guessing some information about the smoothness of the (wet-lab) black-box might be attainable by proxy using a cheaper but lower fidelity black-box such as density functional theory. I think the deployment of those kinds of multifidelity BO loops in the lab is very exciting!

Best,
Ryan

P.S. I've tried to highlight ML terminology in bold as I always get confused about what x, f(x) and y are in science problems!

@saitcakmak, thanks for your response!

... train_x = [(x-1, sigma(x-1)), (x-2, sigma(x-2)), ...] ...

I could be reading into this too much, but is there a reason (other than visual clarity) that you put the mean and SEM in tuples? Is your suggestion equivalent to the following?

train_x = [x-1, x-2, sigma(x-1), sigma(x-2), ...]

If so, I think I understand, otherwise, I might need some more clarification.

We already have the risk averse methods that assume you observe (x, y) but end up implementing x + perturbation (that Eytan linked to above) implemented, so it shouldn't be too hard to combine the two ideas and see how it does.

Could you clarify what you mean by "combine the two ideas"? I'm probably missing something, but I was under the impression @eytan 's suggestion and your suggestion would be separate, standalone ways of addressing the problem.

It looks like I was mistaken there. They don't propose to fit the model on the parameters of the distribution (which would be equivalent to the standard GP if the noise is constant throughout), instead the proposal is to use a kernel which is defined over the distribution itself. An example of this for the Gaussian noise model is given at the end of the appendix here. It would be a bit more involved to implement this since it would require you to implement a custom kernel in GPyTorch.

Could you clarify what you mean by "combine the two ideas"?

What I was thinking is, since you have this uncertainty that you model with this fancy new GP, why not take it one step further and use it to make risk averse decisions? You know that the new point you evaluate will be subject to some uncertainty, so you could account for that in deciding where to evaluate next. It is a bit more complicated in this setting since the uncertainty is already built into the model though.

That model will give you the "expected" predictions for any distribution from the given distributional family. Technically, you could consider any given fixed point as a distribution by assuming it is paired with 0 noise (x = Normal(x, 0)). So, you can get point-wise predictions from the model, and use these with the actual (or estimated) noise levels to calculate risk measure predictions. This would be a more conservative alternative to the expected predictions the model readily gives you, and would help in selecting more "robust" candidates to evaluate. It's hard to say how well it would work, but this would be one way to combine the model over distributions with what we've done on the risk averse setting.

Merging this into our "wishlist" master issue, as my sense is that this is not something we will work on in the short-term.

eytan commented

@saitcakmak do we have a task for robust BO in Ax? Even though the approaches we are considering are distinct from the linked literature, it still should provide a solution to the problem, assuming we can specify an error distribution for any given target input measurement. Not sure if this was linked to already, but FYI @sgbaird we are working on integrating more first-class support for optimization of (M)VaR via the specification of input noise distributions ( https://arxiv.org/pdf/2202.07549.pdf ) and highD generalizations.

do we have a task for robust BO in Ax?

What we currently have in the BoTorch tutorials should be exposed in Ax (via developer API & MBM) within next two weeks. Getting MARS hooked up requires some additional work, so that may take a bit longer. Regarding high-D generalizations, if you were referring to TRBO / MORBO, that's not available in OSS Ax, so that won't be exposed here. But I think we can still support high-D use cases by combining the standard robust BO with SAASBO, which should be as simple as selecting SAASBO as the model in MBM.