antocodes's Stars
DS3Lab/WordScape
The WordScape repository contains code for the WordScape pipeline to create datasets to train document understanding models.
bgGLUE/bgglue
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
togethercomputer/RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.