dataset description

Question

dataset description

sunying2018 opened this issue 7 months ago · 3 comments

sunying2018 commented 7 months ago

Great work! Would it be possible to add some descriptions to clarify how the training dataset is generated? For example, the two datasets used in the script: PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K and PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M. Thanks!

Answer 1 · 2024-04-08T06:49:24.000Z

Just added some info to the dataset card: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K

Answer 2 · 2024-04-18T06:58:28.000Z

Both dataset cards specifies that --dataset_size=100m. However, calculation shows that 256K dataset contains 1B tokens, and 1M dataset contains 5B tokens.

Answer 3 · 2024-04-19T00:48:07.000Z

@Bostoncake Yes you are correct. I will update the dataset card. Sorry for the typo.