About the used evaluation set

Question

About the used evaluation set

Closed this issue a year ago · 3 comments

Hi, thanks for your great work!

In the paper Figure 3, the TL;DR summary task is used to report the ROUGE metric. I'm wondering where is the dataset? Is that from load_dataset('openai/summarize_from_feedback', 'validation') and calculate the rouge between the generate summary and the higher-scored summary?
In Figure 4., how is the multiple-choice prompt look like?

Answer 1 · 2023-03-07T00:54:45.000Z

Thanks for the nice words.

Yeah the validation split is used for evaluation, with instruction being 'a good summary is:'. For evaluation on hh-rlhf , the choice template is 'The following is a dialogue: {dialogue}. This dialogue is {choice}', where choice is either 'good' or 'bad' chosen per likelihood.

Answer 2 · 2023-03-07T01:17:12.000Z

Thanks.
I notice that there can be multiple summaries in the validation set for the same document spanning in different instances, however, I guess maybe only the "policy"="ref" one is the human-written one? Did you preprocess first to get the "ref" summary for each document (which will reduce the validation set) or just use the higher-scored summary as the oracle in each instance (which may not be human-written like the following fig)?

Answer 3 · 2023-06-14T03:26:47.000Z

I apologize for the delay. We choose the higher scored summary as oracle in our experiments.