Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K?

Question

Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K?

rabbitwayne opened this issue 4 years ago · 3 comments

Why RoBERTa-500K has 4.5x more computation than ELECTRA-400K in the paper https://openreview.net/pdf?id=r1xMH1BtvB? Both RoBERTa-500K and ELECTRA-400K are the same size as BERT-large. I think RoBERTa-500K has only 1.25x computations than ELECTRA-400K. Why is it 4.5x?

Answer 1 · 2020-09-18T16:45:24.000Z

because it uses a batch size of 8K compared to ELECTRA which uses a batch size of 2K. This means even though both runs for the same amount of time which is 400-500k steps , RoBERTA uses more computation power where they uses 1024 v100GPUs compared to ELECTRA which i think they uses either TPU-256v3 or TPU-512v2 looking at the batch size.

Answer 2 · 2020-09-18T21:57:30.000Z

Thank you for the explanation!

Answer 3 · 2023-09-08T05:05:34.000Z

Hello all, I just started reading the paper.And I have a few doubts.
I was wondering if you could help me with those?

What exactly does the "Step" mean in step count? Does it mean 1 epoch or 1 minibatch?
Also, in paper I saw (specifically in Table 1) ELECTRA-SMALL and BERT-SMALL borh have 14M parameters, how is that possible as ELECTRA should have more parameters because its generator and discriminator module are both BERT based?
ALso, what is the architecture of both generator and discriminator?Are they both BERT to something else?
Also, what does 500K and 400K mean in above models like RoBERTa-500K or ELECTRA-400K?

Thanks in advance