RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset
This repo contains a reproducible data receipe for the RedPajama data, with the following token counts:
Dataset | Token Count |
---|---|
Commoncrawl | 878 Billion |
C4 | 175 Billion |
GitHub | 59 Billion |
Books | 26 Billion |
ArXiv | 28 Billion |
Wikipedia | 24 Billion |
StackExchange | 20 Billion |
Total | 1.2 Trillion |
Data Preparation
In data_prep
, we provide all pre-processing scripts and guidelines.
Tokenization
In tokenization
, we provide an example of how to tokenize the dataset using the GPT-NeoX tokenizer.
Visualization
In viz
, we provide a dashboard for exploring a subset of the data using Meerkat.
License
The code in this repo is licensed under the Apache 2.0 license. Unless otherwise noted,
Copyright 2023 Together Computer, ETH Zürich, Stanford University
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The file data_prep/book/dedup.py was co-developed with Ontocord.ai.
Copyright 2023 Ontocord.ai, Together Computer, ETH Zürich, Stanford University
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The dataset itself, please refer to the licenses of the data subsets you use.
- Common Crawl Foundation Terms of Use
- C4 license
- GitHub was limited to MIT, BSD, or Apache licenses only
- Books: the_pile_books3 license and pg19 license
- ArXiv Terms of Use
- Wikipedia License
- StackExchange license on the Internet Archive
For full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please contact us.
Acknowledgement
We are appreciative to the work done by the growing open-source AI community that made this project possible. That includes:
- Participants in building the RedPajama dataset including Ontocord.ai, MILA Québec AI Institute, ETH DS3Lab, Université de Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.
- EleutherAI — This project is built on the backs of the great team at EleutherAI — including the source code they provided for training GPT-NeoX.
- An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.