
Tokenized dataset?

Opened this issue · 1 comments

I was wondering if it'd be possible to upload the tokenized dataset. I tried following the instructions under the Pretraining header but had trouble installing Megablocks due to a CUDA version mismatch. Anyway, I think it would be very helpful to upload the tokenized dataset to Huggingface to save others the work.

Agree that this would be great; @soldni what do you think? Here are all the s3 paths of the tokenized ds, can we easily upload them to HF?