EleutherAI/the-pile

"Github" code data download only

HangXue-lab opened this issue · 2 comments

The size of pile is too big for me. I just want to download the "Github" code data. But the number of Pile train file is 30. I would like to know exactly which file contains the "Github" code data.

The data is already processed by that stage, and may not be what you want. You probably want the github.tar from the preliminary components https://the-eye.eu/public/AI/pile_preliminary_components/github.tar and process it yourself.

The link is no longer working, is there another link to obtain the data?