openai/summarize-from-feedback

human feedback in validation dataset?

ShiYaya opened this issue · 6 comments

I want to know whether the human feedback you collected was on the validation set?
If so, there is information leakage when the trained reward model was used as an evaluation metric. For example, the context in the validation dataset has been leaked.

We made sure only to train on our training set. (We split the posts into train, validation, and test sets; we sampled summaries for posts from the training set and collected feedback on those summaries.)

Thank you very much for your reply, I have another question.
In table 20, which dataset you use? train, validation, or test sets?

In my opinion, here is another experiment you can do. Collect human feedback on the validation set, then train a reward model on the validation set, and use this reward model to train a policy.
I think that under the experimental setting I proposed, the policy performance will be worse than that in your paper. Because in your experimental settings, the reward model is trained on the training set, it will ideally approach human feedback, which is nearly equivalent to using human feedback as a reward for reinforcement learning, so this reward model will work done well. But if you trained on the validation dataset, the performance of the reward model will decrease.

If I have any errors above, please tell me.

@ShiYaya It's a bit complicated. For Table 20, we use both train+valid for most entries, but for all rows/columns involving the reward model or supervised model, we use only the validation set.

@WuTheFWasThat
Thank you very much for your replay. I’m very sorry that there is something that confuses you in my question. In table20, your purpose is to evaluate agreement rates between humans and various automated metrics, so you should compare these pairs in the same set for fair. I want to know which set you used?

yes, it's true it's not the same set they are evaluated on, and this could be considered unfair. however, since the validation set is distributed the same as the training set, they are all estimates of agreement rates on the same underlying distribution. thus we decided to include the training set for higher statistical power, in cases where the metrics didn't depend on the training set, as this decreases the per-cell bootstrap errors