Selective noise inference for some observations in field experiment
nwrim opened this issue · 16 comments
Hi! We are a group of social scientists trying to use Bayesian optimizations in our experiment. We are running optimization in a complete field experiment, which means that there are some cases where we are only able to attain a very small number of observations for certain parameters. This means that we cannot get a good estimate of SEM (bootstrapping will give us SEM=0, naturally). Therefore, the data we input to the Ax experiment might look something like this (all arbitrary numbers):
arm_name | metric_name | mean | sem | trial_index | n | |
---|---|---|---|---|---|---|
0 | 0_0 | score | 6.51 | 0.94 | 0 | 4 |
1 | 0_1 | score | 7.33 | 0.55 | 0 | 4 |
2 | 0_2 | score | 6.94 | 0.53 | 0 | 3 |
3 | 0_3 | score | 9.42 | 0.91 | 0 | 2 |
4 | 0_4 | score | 3.91 | 0 | 0 | 1 |
5 | 0_5 | score | 2.50 | 0 | 0 | 1 |
We were wondering if Ax takes account of the fact that SEM being 0 when n=1 does not mean that we are fully confident that we have the right value. If it does not, what is the best way to proceed? More generally, what can we do when we are relatively less confident about the value of some values for some parameters?
We know that we can incorporate unknown variance by putting in np.nan, but it looks like we can't put it selectively for certain parameters that we are not confident about - when we tried, it raises this error:
ValueError: Mix of known and unknown variances indicates valuation function errors. Variances should all be specified, or none should be.
Let us know if anything is unclear, and thank you so much in advance!
So one could technically try to infer a noise level selectively for some observations, but that would require changes pretty deep down in the modeling code and would take some time to implement.
To get you off the ground, would it be reasonable to assume that the observation noise is (approximately) homoskedastic (independent of the treatment)? This may be first-order correct if the variance across subjects dominates the noise. In that case, could you impute the sem like sem_4 = sqrt(n_1/1) * sem_1
?
Unfortunately, the noise might be heteroskedastic, so the way you suggested might not work for our case (but I would have to talk with my co-workers). Also, another question we had was whether it would be reasonable to put in a value for n
when we don't supply a value for the sem
(if we decide that we would like to not supply variance for any of the values). Could you give us some insight on this issue too?
Also, another question we had was whether it would be reasonable to put in a value for n when we don't supply a value for the sem
What exactly is the goal of doing this? Would this be to do some kind of importance weighting of the observations? Currently anything that you pass in that is not either mean or sem will be ignored by our models, so there is no straightforward way to do this (at least currently).
Great, thanks for the answer. The reasoning for doing that was because even though we do not input the variance, we thought that it made sense to say that we are more confident about estimates from a larger sample number (n
). I think we will have to try to come up with a clever way to impute the sem. I guess I will leave this open for few more days so that if anybody else has good intuition they can chime in. Thanks again!
Hello again! We have been running a small pilot related to this issue and wanted to see if the developers or anyone else has a suggestion/recommendation on a problem we are encountering.
To briefly summarize our situation (although it was mentioned a bit in the first issue post), we are planning to do a field experiment that we aim to test some combination of parameters with human subjects (aiming for n=4 for arm currently). One problem is that since this is a field experiment involving a lot of human factors, we sometimes get values for parameters that are similar to the one we aim for but not exactly the same. So we will get some n=1 or n=2 data for such arms. And sometimes these low n
trials give extreme values that are likely to be measurement error or outliers.
For example, our current pilot data look something like this (not the full data; the sem for n=1 trials was imputed on a heuristical-measure, as discussed above)
arm_name | metric_name | mean | sem | trial_index | n | |
---|---|---|---|---|---|---|
0 | 0_0 | score | 6.51 | 0.94 | 0 | 4 |
1 | 0_1 | score | 3.37 | 0.56 | 0 | 4 |
2 | 0_2 | score | 5.95 | 0.59 | 0 | 3 |
3 | 0_3 | score | 3.49 | 0.42 | 0 | 2 |
4 | 0_4 | score | 13.91 | 0.99 | 0 | 1 |
5 | 0_5 | score | 4.80 | 0.99 | 0 | 1 |
Obviously, since 0_4
arm performs so well, the GPEI generates only 1 arm (the 0_4
arm) when we call ax.modelbridge.factory.get_GPEI().gen(10)
However, we think it is very likely that the extreme value of 0_4
is an outlier trial or some measurement error occurred. We think that the model will eventually begin generating some other recommendations if we feed in a lot of data with the 0_4
parameter with a relatively normal mean (although when we gave it mean=5.5 and sem=1 in the next four consecutive trials which still generated only one recommendation). However, since the cost of the field experiment that we are doing is quite costly (both logistically and monetarily), we wanted to see if there is any way we can consider this kind of "outlier" trial a bit better.
Some other ways we thought of are:
-
Exclude outliers (or abandon arms) in the model, but the boundary of outlier quickly gets very subjective and we did not want to lose data that is costly to collect.
-
Delay putting in the data for the outlier arm until we collect more data for that arm. But this shares a similar problem with 1) - data are costly so we want to utilize the full data if possible, and it gets subjective what arm we delay or not.
-
"Infer" the mean of n=1 trials, like imputing the sem for n=1 trials (maybe averaging it with the value of other arms with similar parameters)
-
Weight the arm in the experiment by
n
- for example, we can assign a weight of 4 to0_0
,0_1
and a weight 1 to0_4
or0_5
in the above example. However, we were not sure if this option was what the weighting was designed for in Ax/BoTorch.
As you can see with my tone, we are leaning a bit toward 3) or 4), but wanted to hear some thoughts from the experts.
Any recommendations or thoughts will be much appreciated! Let us know if anything is unclear. We think the responsiveness of developers on the issue page is awesome and the Ax/BoTorch is a really great platform. Thanks so much in advance!
cc @Balandat
Curious to hear @Balandat 's thoughts, but a few questions in the mean time just so I get a better understanding of your problem:
- What does your search space look like? When you say
the GPEI generates only 1 arm (the 0_4 arm) when we call ax.modelbridge.factory.get_GPEI().gen(10)
, that indicates to me that it's a discrete search space? - Are there actually only 6 arms in the experiment, or was that just a snapshot above?
- "although when we gave it mean=5.5 and sem=1 in the next four consecutive trials which still generated only one recommendation" -- this is strange to me, which is why I'd like to get a better understanding of what the trial looks like.
- re: #4 -- Yea, unfortunately that's not what the weighting's for -- it's to indicate the size of the population that the arm should be exposed to. See our documentation re: BatchedTrial here for more of an explanation!
Great! Thanks for asking the questions.
- Our search space is a combination of integer variables and ordered categorical variables, so I think it is a discrete search space (and the arms get merged together after rounding)
- We have 10 arms, but I just posted 6 (ones that we intended to run, and ones that were by-products, causing the n to be low)
- We actually have not run any trials in the field yet, so the "feeding it four consecutive trials" meant that we added a BatchTrial, created one arm with
0_4
parameters, fed it mean=5.5, sem=1 four times (so the we would have 10 arms in trial 0, and 1 arm in trial 1,2,3,4). It was kind of surprising to us because there were arms that had mean somewhere near or above 6, and we thought that seeing 4 more trials with mean=5.5 will be enough to make the model explore some other space - Thanks for the clarification! The reason I brought up weight was that the document said
for field experiments the weights could describe the fraction of the total experiment population assigned to the different treatment arms.
, and I thought we can see n as how many cases we can assign to that parameter.
Let me know if you have any other questions!
The most reasonable approach in my mind would be to reflect the fact that you're very uncertain about the value of some arms by inflating the sem that you are passing in. The model will automatically give less credence to observations with high noise levels and so this would essentially be very similar to weighting certain arms, but could be achieved without changing the models to take in weights for the observations. Do you have ways of estimating the variance in the observation, e.g. from similar experiments? Also, how reasonable is it to assume that in the small sample regime the observation noise can be reasonably approximated by Gaussian noise? If that's not the case then one might want to take a look at other likelihoods that might be more appropriate for this setting.
Additionally, I'll just add that if you're suspicious of the behavior you're seeing and want us to investigate further, feel free to send a reproducible example (you can anonymize the search space / data however you like).
Thank you @Balandat! I agree that inflating the SEM will be the best way to work without modifying the model. I don't think our experiment has similar experiments that we can refer to (it is relatively new), but we will definitely try to search for some or try some other method to estimate the variance. Due to the novelness of the experiment, we are not confident that the noise can be reasonably approximated by Gaussian noise - we just chose it since we thought that is a conservative option. Are there other off-the-shelf models that Ax/Botorch model does not use Gaussian noise that we can look into?
Thank you @ldworkin! I remember the team testing out a reproducible example I sent on another issue and that was really helpful. Thank you for the suggestion and thank you again for all the hard works.
Are there other off-the-shelf models that Ax/Botorch model does not use Gaussian noise that we can look into?
We don't have any off-the-shelf models currently. One challenge with that is that inference is generally not closed-form anymore, so one has to resort to approximate or variational methods. If the application requires it that might be worth it, but it would be some work to get this to work in Ax. The first step would be to get an idea of what the noise characteristics actually are.
We will now be tracking wishlist items / feature requests in a master issue for improved visibility: #566. Of course please feel free to still open new feature requests issues; we'll take care of thinking them through and adding them to the master issue.
@Balandat (or anyone else in the team!) I just had a quick sanity check/follow up on your statement
The most reasonable approach in my mind would be to reflect the fact that you're very uncertain about the value of some arms by inflating the sem that you are passing in.
In your response above - my college pointed out that since SEM is calculated as the estimated SD/root of the number of observations in the data if the uncertainty largely comes from the low number of observations in the data (i.e. if the number of observations is low the SEM will be large by definition). I think this makes sense but wanted to see if you have any wisdom on this issue. Thanks so much in advance!
@nwrim That indeed makes sense and inasmuch as your setup does this it should do the right thing already.
I don't have much additional wisdom to dispense, only that if you're trying to estimate a SEM from a very small number of observations then your error will likely not be Gaussian. So technically you're going to be violating some of the modeling assumptions, but from a practical perspective you're probably going to be fine (at least from an optimization perspective, but maybe be careful not to trust the model too much).