forhaoliu/chain-of-hindsight

About the used evaluation set

Closed this issue · 3 comments

Hi, thanks for your great work!

  1. In the paper Figure 3, the TL;DR summary task is used to report the ROUGE metric. I'm wondering where is the dataset? Is that from load_dataset('openai/summarize_from_feedback', 'validation') and calculate the rouge between the generate summary and the higher-scored summary?
  2. In Figure 4., how is the multiple-choice prompt look like?

Thanks for the nice words.

Yeah the validation split is used for evaluation, with instruction being 'a good summary is:'. For evaluation on hh-rlhf , the choice template is 'The following is a dialogue: {dialogue}. This dialogue is {choice}', where choice is either 'good' or 'bad' chosen per likelihood.

Thanks.
I notice that there can be multiple summaries in the validation set for the same document spanning in different instances, however, I guess maybe only the "policy"="ref" one is the human-written one? Did you preprocess first to get the "ref" summary for each document (which will reduce the validation set) or just use the higher-scored summary as the oracle in each instance (which may not be human-written like the following fig)?
image
image

I apologize for the delay. We choose the higher scored summary as oracle in our experiments.