mrahtz/learning-from-human-preferences

Adjusting softmax function

jakkarn opened this issue · 4 comments

The preference for each pair of video clips is calculated based on a softmax over the predicted latent reward values for each clip. In the paper, "Rather than applying a softmax directly...we assume there is a 10% chance that the human responds uniformly at random. Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t decay to 0 as the difference in reward difference becomes extreme." I wasn't sure how to implement this - at least, I couldn't see a way to implement it that would actually affect the gradients - so we just do the softmax directly.

After talking about this with a friend that has greater knowledge in the statistics field, my understanding is this:

  • They adjust the predictor model's probability p1 before using it in the loss function:
  • 90% of the human's decisions are rational
  • 10% of decisions made are: random or simply wrongly made

And the adjusted probability should then be:

p2 = 0.9*p1 + 0.1*0.5 = 0.9*p1 + 0.05

I'm not completely sure this is the correct way, but I think so. Great work implementing this btw!

Thanks for commenting!

p2 = 0.9*p1 + 0.1*0.5 = 0.9*p1 + 0.05

I think my initial intuition for how to implement it was something similar, but since in this case the cross-entropy loss is only being applied over p2 and 1 - p2, the loss is either p2 or 1 - p2 - so the extra 0.05 wouldn't affect the gradients, would it? (I guess it'll make some difference at inference time, but I don't think it'll affect training, will it?)

(I was about to post this reply, then thought "Wait, what about the softmax?" - but the softmax has already been applied in order to get to p1 in the first place.)

Hm... There is this mention in the article:

Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t
decay to 0 as the difference in reward difference becomes extreme.

So I'm thinking, does this have a significant impact on the loss from each entry estimate u(1)log(p1) + u(2)log(1-p1) when the network is more certain of rewards? Having only one estimate (of one entry) decay to zero (or close to zero) could potentially have a huge impact on the resulting sum (cross-entropy loss) since it is the logarithmic function with negative infinity as limit. Couldn't it?

image


Also, the article actually reads "rather than applying a softmax directly...". Does that imply that they adjust it before the softmax? I couldn't get that to make sense in my mind, so that's why I assumed it was like we both seem to have thought.

Sooooo I asked Jan Leike nicely and it turns out he still had a copy of the original code for the paper lying around :)

Searching for '0.1' and '0.9' in the codebase, the only relevant thing I could find is:

def __init__(self, *args, epsilon=0.1, **kwargs):
        self.epsilon = epsilon
...

def p_preferred(self, obs1=None, act1=None, obs2=None, act2=None):
    reward = [prediction1, prediction2]
    mean_rewards = tf.reduce_mean(tf.stack(reward, axis=1), axis=2)
    p_preferred_raw = tf.nn.softmax(mean_rewards)
    return (1 - self.epsilon) * p_preferred_raw + self.epsilon * 0.5

This is the approach you originally suggested, which, yeah, I'm pretty sure doesn't affect gradients...so either a) this just isn't necessary and they never found out because they didn't do an ablation, or b) it only makes a difference at inference time in situations where, as you say, when the reward predictor is particularly sure. (Though...considering that the most important inference function is predicted reward, rather than predicted preference, I lean towards a). shrug Everyone makes mistakes. Or it serves some completely different purpose that's not ocurring to me right now...)

case_closed

Okay! Nice that you took the time. I'm playing around with this a bit now, so I guess I'll give you a comment if I find something substantial related to this.

I love Sherlock btw. Made me laugh when I saw your meme.