Suggestion: Use smollm corpus
linux-leo opened this issue · 3 comments
From my understanding we are always trying to use the best dataset, so that's why I'm suggesting the one from the new Huggingface SmolLM: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
can you post some (eval) results against edu fineweb?
I haven't done experiments and never trained a model myself with this codebase, but will do if I ever get around to it.
Note that the large majority of SmolLM is fineweb-edu, only augmented with synthetic data from cosmopedia-v2 and coding data from python-edu, which in my opinion, given that both of these sources are small compared to the fineweb-edu data, should have almost no negative impact on any benchmarks compared to pure fineweb-edu models, but maybe achieve higher scores on more academic questions and reasoning tasks.
This not a one to one comparison, but it is from the official blog post announcing smolLM (notice the comparison to karparthy GPT)

https://huggingface.co/blog/smollm
Note: I don't know what checkpoint they are comparing to, but assuming the longest trained one, smollm was still trained on more than twice the Amount of tokens. Still, I don't think that by itself explains some improvements, especially when taking model saturation into account.