banditml/offline-policy-evaluation

Add UCB bandit using MDNs (mixture density networks)

econti opened this issue · 3 comments

I think we can get a big performance gain from doing exploration via. mixture density networks (MDNs).

We train an MDN to output mu and sigma of our reward distribution, and then use the distribution to do upper confidence bound exploration.

MDNs tutorial: https://engineering.taboola.com/predicting-probability-distributions/

@cyrilou242 what do you think?

hello, looks like a great idea !

Some questions:

  • Are you thinking about a 1-gaussian mixture ?(not really a mixture) I am not sure a reward noise distribution would necessarily follow a gaussian distribution. UCB would be easy and fast after indeed.
  • If we realize we may not want to assume a normal distribution for the reward noise, going for a n-gaussian mixture is easy but I am not sure if it's easy to get a confidence interval without sampling/approximating, I'll have a look.

what are your thoughts on these points ?

EDIT: messed up really bad in my first comment getting confused with noise and reward distribution, i fixed --> conclusion: let's try with a single normal distribution.

Good points.

I think you're right regarding the reward noise distribution. We don't really know how that is distributed and it's likely problem specific.

My thinking is we model it as a 1-gaussian mixture for now, if it doesn't work well for a specific problem, perhaps dropout at inference time (although more computationally expensive) is a better fit.

I'll try to add this in the next week unless you have time before then to give it a shot.

not sure to find time before next week