EleutherAI/the-pile

ConvoKit datasets

upintheairsheep opened this issue · 2 comments

Can you integrate the ConvoKit datasets, especially the giant Reddit dataset into the pile, or a future version of the pile? I would really would like to bring AI further for all of humanity, not for the purpose of feeding the pigs (cooperations).
https://zissou.infosci.cornell.edu/convokit/datasets/
See https://convokit.cornell.edu/documentation/datasets.html

http://cairo.lti.cs.cmu.edu/~hector/ - A similar dataset hosting ~0.5GB of Twitter tweets, ~0.3 GB dbpedia data and an unknown amount of wikihow xml files

pile v2