/lm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.

Primary LanguagePythonApache License 2.0Apache-2.0

No issues in this repository yet.