/the-stack-v2

Code for the curation of The Stack v2 and StarCoder2 training data

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

The Stack v2 & StarCoder2Data

In this repository you can find the code for building The Stack v2 dataset, as well as the extra sources used to make StarCoder2data: the training corpus of the StarCoder2 family of models.

This reposirory is a follow-up of on the work in bigcode-dataset used for The Stack v1 and StarCoderData.