Tokenized dataset?
Opened this issue · 1 comments
joelburget commented
I was wondering if it'd be possible to upload the tokenized dataset. I tried following the instructions under the Pretraining header but had trouble installing Megablocks due to a CUDA version mismatch. Anyway, I think it would be very helpful to upload the tokenized dataset to Huggingface to save others the work.
Muennighoff commented