Reward Shape

Dear authors,

Many thanks for creating this wonderful repo! In the blog post that accompanies this repo, there is a remark saying that "produced rewards and values of shape (B, T, 1)". However, I recall that PPO in RLHF only takes the final reward. Is there a contradiction here? And could you kindly point out which line of code would resolve this question? Thanks!

We do take the reward from the last token.