SigmaWe/DocAsRef

Pseudoreference using decent enough summarizers

forrestbao opened this issue · 0 comments

Pseudocode:

def pseudo_metric(documents: List[str], system_summaries: List[str]):
    pseudo_ref_summaries = pegasus(documents)
    rouge(pseudo_ref_summaries, system_summaries)
    return rouge_scores 

Let's try two summarizers for now, Google's Pegasus and Facebook's BART trained on a summarization dataset.

Before we start, let's try using and not using HF's pipeline. See whether they have use the same result. Specially, one approach is

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

while the other (doc here)is

summarizer = pipeline("summarization")
summarizer("Sam Shleifer writes the best docstring examples in the whole world.", model="bart-large-cnn", min_length=5, max_length=20)