EleutherAI/gpt-neox

calculate epoch

mackmake opened this issue · 4 comments

hi
I want to know how to find out the number of iterations needed for seeing all of my data for one epoch exactly.
does the preprocess/train code calculate and log it somewhere or i should calculate it by myself?
the realted part in my stdout file looks like below if it helps:

 > dataset split:
    train:
     document indices in [0, 76507324) total of 76507324 documents
    validation:
     document indices in [76507324, 78875972) total of 2368648 documents
    test:
     document indices in [78875972, 78954927) total of 78955 documents
 > loading doc-idx mapping from path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.013 seconds
    total number of samples: 31267487
    total number of epochs: 2

if possible, explain the formula for calculation of it please.
thanks

# tokens / [global batch size * sequence length] is the number of steps to do a single epoch.

thanks for your quick response
but how can i find the # tokens?
i have a large dataset and i think counting tokens in it takes a lot of time
does the preprocess code count it or gives an approximation of it?

It looks like it's 64,035,813,376 (samples*sequence_length = 31267487*2048=64035813376)

oh that's right
thanks very much