NVIDIA/Megatron-LM

Need a suggestion and confirmation regarding throughput calculation

Closed this issue · 2 comments

I have a followup question for a closed issue #541 . I am currently running BERT-Large Pretraining for 30,000 iterations, for each iteration I am calculating throughput (seq/sec) as global_batch_size/ elapsed_time_per_iteration. Once all the iterations are done, almost all the values of throughput for each iteration are similar, so what should I take as final throughput for this experiment ?

Also is this the right way to calculate throughput (seq/sec) ?, I am trying this experiment for below configurations:
1 Node * 1GPU
1 Node * 2 GPUs
1 Node * 4 GPUs
1 Node * 6 GPUs
1 Node * 8 GPUs
1 Node * 10 GPUs

Do I need to change anything for calculating throughput for above configurations, or will it be global_batch_size/ elapsed_time_per_iteration ?

Note that all the GPUs are H100s,micro batch size (Batch size per GPU) is 64.
Reference:
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/training.py#L764
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/training.py#L766

Once all the iterations are done, almost all the values of throughput for each iteration are similar, so what should I take as final throughput for this experiment?

You can compute the average.

Do I need to change anything for calculating throughput for above configurations, or will it be global_batch_size/ elapsed_time_per_iteration ?

Global batch size / elapsed per-iteration time will give you the throughput in sequences / second, regardless of the configuration you use.

Thank you for your confirmation and suggestion.