haoyuhu/bert-multi-gpu

Do we see different results with different global_batch_size but same iteration_steps? How does Global Batch Size play a role? Will time taken to complete training change?

Closed this issue · 3 comments

Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 32
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000

Situation 2 (single-gpu): train_batch_size = 8, num_gpu_cores = 1, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 8
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000

Let me explain why I put forth these 2 situations.
Even though I increase the number of GPU cores when I try to fine-tune the model, I don't see any decrease in time?
It seems that the stability might have increased because of multi-gpu but the purpose of reducing total time taken is not achieved at all.
Is it because the iteration_steps does not change?
Or am I missing out on something?

Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 32
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000

Situation 2 (single-gpu): train_batch_size = 8, num_gpu_cores = 1, num_train_epochs = 1
global_batch_size = train_batch_size * num_gpu_cores = 8
iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000

I know that there is a time difference in execution for both situation but do we see difference in performance?(although the iteration steps are the same)

Yep, we can see difference. We can see that batch size in situation 1 is much larger than situation 2 (32 vs. 8). With same iteration steps, it is more stable in situation 1 in the training process if keeps other params same also (e.g. learning rate, L2 regularization).

@shishishu thanks for the quick reply. Your answer makes few things clearer but I'm still not able to understand the difference. I might have framed the question incorrectly. I have edited the issue and re-framed my question, can you look into it and help me understand this.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.