openai/gpt-2-output-dataset

WebText Dataset format

loretoparisi opened this issue · 1 comments

Which is the meaning of length, ended in the dataset lines:

{"id": 1, "ended": true, "length": 66, "text": "LeSean McCoy going through warmups with first team offense. To my eye, does not look close to 100 percent when cutting and exploding.\n\nABOUT COOKIES\n\nTo help make this website better, to improve and personalize your experience and for advertising purposes, are you happy to accept cookies and other technologies?"}

also I can see that there are newlines followed by indexes like in

{"id": 0, "ended": true, "length": 138, "text": "These girlfriends deserves a special mention for going that extra mile, hopefully doesn't set too many guys off on the path towards outrageous demands.\n\n1. She knows the severity of man-flu\n\n2. All fun and games is all good\n\n3. A voucher that says 'I love you'\n\n4. When arguments don't drag on forever.\n\n5. Providing everything he needs.\n\n6. Very understanding\n\n7. As awesome a gesture as this is, we are worried about this man's cooking skills.\n\n8. Nice cake\n\n8. Fair bargaining\n\n9. Excellent gift choice\n\n10. Very thoughtful"}

so \n\n3...\n\n8. What does this mean? Is it just a questionnaire style scraped document?

I can see that the detector does not use those info anyways: https://github.com/openai/gpt-2-output-dataset/blob/master/detector/dataset.py#L17

Thank you.

length is the length in BPE tokens (see gpt-2 paper for information on tokenization scheme). ended is whether the sample contained (and is truncated at) an endoftext token