karpathy/llm.c

Suggestion: Use smollm corpus

linux-leo opened this issue · 3 comments

From my understanding we are always trying to use the best dataset, so that's why I'm suggesting the one from the new Huggingface SmolLM: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus

can you post some (eval) results against edu fineweb?

I haven't done experiments and never trained a model myself with this codebase, but will do if I ever get around to it.

Note that the large majority of SmolLM is fineweb-edu, only augmented with synthetic data from cosmopedia-v2 and coding data from python-edu, which in my opinion, given that both of these sources are small compared to the fineweb-edu data, should have almost no negative impact on any benchmarks compared to pure fineweb-edu models, but maybe achieve higher scores on more academic questions and reasoning tasks.

This not a one to one comparison, but it is from the official blog post announcing smolLM (notice the comparison to karparthy GPT)

image

https://huggingface.co/blog/smollm

Note: I don't know what checkpoint they are comparing to, but assuming the longest trained one, smollm was still trained on more than twice the Amount of tokens. Still, I don't think that by itself explains some improvements, especially when taking model saturation into account.