EleutherAI/gpt-neox

some Datasets are not available

vangogh0318 opened this issue · 2 comments

Describe the bug
can not download the github/ArXiv dataset. the url is wrong
how to download github/ArXiv data? thank you

The code in corpora.py file, line 190:
class Github(DataDownloader):
name = "github"
urls = ["http://eaidata.bmk.sh/data/github_small.jsonl.zst"]

class ArXiv(DataDownloader):
name = "arxiv"
urls = [
"https://the-eye.eu/public/AI/pile_preliminary_components/2020-09-08-arxiv-extracts-nofallback-until-2007-068.tar.gz"
]

This is correct. The Pile has been taken down to a DMCA takedown request.

hi, how can I access the Pile data? Thanks
@StellaAthena