/llm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.

Primary LanguagePythonApache License 2.0Apache-2.0