Release raw lambada dataset

Question

Release raw lambada dataset

yaroslavvb opened this issue 6 years ago · 9 comments

Is it possible to release the Lambada dataset used to generate accuracy numbers in Table 3 of the paper? This would make it easier to do comparisons with other models :)
@Newmu

Answer 1 · 2019-05-15T18:57:04.000Z

we just use the plain text files which can be downloaded here https://zenodo.org/record/2630551#.XNxg89NKjUI

Answer 2 · 2019-05-15T19:03:20.000Z

That's a post-processed version, ie, "don't" is split into "do n't" etc. GPT-2-small gets around 31% on that set. My understanding from @Newmu was that 45.99 figure from Table 3. in the paper was on raw/non-processed version

Answer 3 · 2019-05-15T19:11:51.000Z

we apply "de-tokenizers" to remove some of the artifacts. Alec can verify but I think in this case it's simply

def preprocess(text):
    text = text.replace("“", '"')
    text = text.replace("”", '"')
    return '\n'+text.strip()

in fact the detokenizer should be invertible, although i don't think that's important for the accuracy numbers

Answer 4 · 2019-05-23T16:18:03.000Z

This detokenizer doesn't do anything on the official Lambada dataset since there are no smart quotes in it. My understanding is that OpenAI used its own version of Lambada dataset generated from book corpus/lambada. This dataset is interesting because of the accuracy gap in GPT2-small numbers -- 34% on official Lambada vs 46% on OpenAI's version.

Answer 5 · 2019-05-25T02:33:02.000Z

my bad, you're right, whoops! try this: gs://gpt-2/data/lambada_test.jsonl

Answer 6 · 2019-05-29T22:47:30.000Z

Thanks, that dataset makes a difference.

I'm now getting 41.98 using GPT2-small, this version of dataset + with length-5 beam search decoding of last word for stop-word removal.

Simplifying the procedure to test accuracy by comparing for equality of last BPE token instead of last word the accuracy is up to 46.89

I'm wondering if this should be called "lambada-openai" or something in tables to avoid confusion. I looked at the errors between the two datasets, and it seems easier because formatting provides extra information.

Official Lambada

she and zach were covered in dust and sweat when helen found them. "wow, lexi! you rock." lexi groaned at the bad pun. helen surveyed the work, which was nearly complete. "how did you do this?" lexi shrugged. "don't know." "it's her gift," said zach

This version

She and Zach were covered in dust and sweat when Helen found them. "Wow, Lexi! You rock."

Lexi groaned at the bad pun.

Helen surveyed the work, which was nearly complete. "How did you do this?"

Lexi shrugged. "Don't know."

"It's her gift," said Zach

Answer 7 · 2019-05-29T22:52:24.000Z

yeah i agree keeping the extra information is potentially useful (even for non-zero-shot) and it's probably good to distinguish it from the original dataset

Answer 8 · 2021-06-02T18:39:43.000Z

Hi I'm also looking to run the same test. Can you fix the gs://gpt-2/data/lambada_test.jsonl link? I'm getting

BucketNotFoundException: 404 gs://gpt-2 bucket does not exist.

Answer 9 · 2021-06-02T19:46:08.000Z

should now be at https://openaipublic.blob.core.windows.net/gpt-2/data/lambada_test.jsonl