Effective batch size and number of epochs on the full data

Question

Effective batch size and number of epochs on the full data

Opened this issue 4 years ago · 0 comments

Hi, I am curious what was the effective batch size in your experiments? Does the batch size impact training stability? In the paper, you mention that batch size was set to 32 (sequences of length 512) and 32 V100 GPUs were used. So, does it mean the effective batch size was 1024 (32x32)?

Also, it seems you have trained for max steps of 100,000 on MLM. Since you have more than 700M functions in 3 languages, were you able to pass through the entire data once using max steps of 100,000? My calculation says, with 100k steps and an effective batch size of 1024, you covered around 100M examples (nb function). May I know approximately how many epochs were completed on the whole data with your training setup?