Train/val split
DavidHerel opened this issue · 0 comments
DavidHerel commented
Hi,
I want to ask how one can split a dataset to train/val splits. In the tinystories.py I don't quite understand the comment:
train/test split. let's use only shard 0 for test split, rest train
So how many tokens from train data are selected to be validation split?
It seems that @karpathy uses 10shards and if only 0 shard is used as a test split then it means that 1/10 of the data is used as a test set?
e.g. if I have dataset with 10B tokens then 1B tokens are used for test/val set?