Reward Shape
QiyaoWei opened this issue · 1 comments
QiyaoWei commented
Dear authors,
Many thanks for creating this wonderful repo! In the blog post that accompanies this repo, there is a remark saying that "produced rewards and values of shape (B, T, 1)". However, I recall that PPO in RLHF only takes the final reward. Is there a contradiction here? And could you kindly point out which line of code would resolve this question? Thanks!
vwxyzjn commented
We do take the reward from the last token.