WebText Dataset format
loretoparisi opened this issue · 1 comments
loretoparisi commented
Which is the meaning of length
, ended
in the dataset lines:
{"id": 1, "ended": true, "length": 66, "text": "LeSean McCoy going through warmups with first team offense. To my eye, does not look close to 100 percent when cutting and exploding.\n\nABOUT COOKIES\n\nTo help make this website better, to improve and personalize your experience and for advertising purposes, are you happy to accept cookies and other technologies?"}
also I can see that there are newlines followed by indexes like in
{"id": 0, "ended": true, "length": 138, "text": "These girlfriends deserves a special mention for going that extra mile, hopefully doesn't set too many guys off on the path towards outrageous demands.\n\n1. She knows the severity of man-flu\n\n2. All fun and games is all good\n\n3. A voucher that says 'I love you'\n\n4. When arguments don't drag on forever.\n\n5. Providing everything he needs.\n\n6. Very understanding\n\n7. As awesome a gesture as this is, we are worried about this man's cooking skills.\n\n8. Nice cake\n\n8. Fair bargaining\n\n9. Excellent gift choice\n\n10. Very thoughtful"}
so \n\n3...\n\n8
. What does this mean? Is it just a questionnaire style scraped document?
I can see that the detector does not use those info anyways: https://github.com/openai/gpt-2-output-dataset/blob/master/detector/dataset.py#L17
Thank you.
WuTheFWasThat commented
length is the length in BPE tokens (see gpt-2 paper for information on tokenization scheme). ended is whether the sample contained (and is truncated at) an endoftext token