vwxyzjn/lm-human-preference-details

Reward Shape

QiyaoWei opened this issue · 1 comments

Dear authors,

Many thanks for creating this wonderful repo! In the blog post that accompanies this repo, there is a remark saying that "produced rewards and values of shape (B, T, 1)". However, I recall that PPO in RLHF only takes the final reward. Is there a contradiction here? And could you kindly point out which line of code would resolve this question? Thanks!

We do take the reward from the last token.

reward = self.scalar_head(last_reward_latents)