calculate epoch
mackmake opened this issue · 4 comments
hi
I want to know how to find out the number of iterations needed for seeing all of my data for one epoch exactly.
does the preprocess/train code calculate and log it somewhere or i should calculate it by myself?
the realted part in my stdout file looks like below if it helps:
> dataset split:
train:
document indices in [0, 76507324) total of 76507324 documents
validation:
document indices in [76507324, 78875972) total of 2368648 documents
test:
document indices in [78875972, 78954927) total of 78955 documents
> loading doc-idx mapping from path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_doc_idx.npy
> loading sample-idx mapping from /path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from /path/to/mydata/_text_document_train_indexmap_24000000ns_2048sl_1234s_shuffle_idx.npy
loaded indexed file in 0.013 seconds
total number of samples: 31267487
total number of epochs: 2
if possible, explain the formula for calculation of it please.
thanks
# tokens / [global batch size * sequence length] is the number of steps to do a single epoch.
thanks for your quick response
but how can i find the # tokens?
i have a large dataset and i think counting tokens in it takes a lot of time
does the preprocess code count it or gives an approximation of it?
It looks like it's 64,035,813,376 (samples*sequence_length = 31267487*2048=64035813376
)
oh that's right
thanks very much