About the used evaluation set
Closed this issue · 3 comments
Hi, thanks for your great work!
- In the paper Figure 3, the TL;DR summary task is used to report the ROUGE metric. I'm wondering where is the dataset? Is that from
load_dataset('openai/summarize_from_feedback', 'validation')
and calculate the rouge between the generate summary and the higher-scored summary? - In Figure 4., how is the multiple-choice prompt look like?
Thanks for the nice words.
Yeah the validation split is used for evaluation, with instruction being 'a good summary is:'. For evaluation on hh-rlhf , the choice template is 'The following is a dialogue: {dialogue}. This dialogue is {choice}', where choice is either 'good' or 'bad' chosen per likelihood.
Thanks.
I notice that there can be multiple summaries in the validation set for the same document spanning in different instances, however, I guess maybe only the "policy"="ref" one is the human-written one? Did you preprocess first to get the "ref" summary for each document (which will reduce the validation set) or just use the higher-scored summary as the oracle in each instance (which may not be human-written like the following fig)?
I apologize for the delay. We choose the higher scored summary as oracle in our experiments.